Spark Overview & Installation


As part of this section, we will see how to set up Spark components while exploring some of the key concepts of this very important service.

There are 2 versions of Spark, we will see how to setup both Spark 1.6 as well as Spark 2.3. However to setup Spark 2.3, we need to use CDS which is possible using Parcels. We will see how to convert our cluster to Parcels before setting up Spark 2.3.

  • Setup and Validate Spark 1.6.x
  • Review Important Properties
  • Spark Execution Life Cycle
  • Convert Cluster to Parcels
  • Setup Spark 2.3.x
  • Run Spark Jobs – Spark 2.3.x

Cluster Topology

We are setting up the cluster on 7+1 nodes. We start with 7 nodes and then we will add one more node later.

  • Gateway(s) and Management Service
    • bigdataserver-1
  • Masters
    • bigdataserver-2
      • Zookeeper
      • Active/Standby Namenode
    • bigdataserver-3
      • Zookeeper
      • Active/Standby Namenode
      • Active/Standby Resource Manager
    • bigdataserver-4
      • Zookeeper
      • Active/Standby Resource Manager
      • Job History Server
      • Spark History Server
  • Slaves or Worker Nodes
    • bigdataserver-5 – Datanode, Node Manager
    • bigdataserver-6 – Datanode, Node Manager
    • bigdataserver-7 – Datanode, Node Manager

Learning Process

We will follow the same standard process to learn while adding any software-based service.

  • Downloading and Installing – Even though we have setup softwares already using Packages, it is not good enough to setup Spark 2. With Cloudera Distribution we need to use Parcels to setup Spark 2. Hence we will see how to migrate cluster from Packages to Parcels.
  • Configuration – we need to understand architecture and plan for the configuration.
    • Architecture – Uses HDFS for File System and YARN for Resource Management.
    • Components – Spark Job History Server
    • Configuration Files
      • Spark 1.6.x: /etc/spark/conf
      • Spark 2.3.x: /etc/spark2/conf
    • With cloudera the location is a bit different and we will see it after setting up the service.
  • Service logs/var/log/spark
  • Service Data – Spark is distributed Computing Framework and it can use any File System that is supported by HDFS APIs (such as HDFS, AWS s3 etc)

Setup and Validate Spark 1.6.x

Even though Cloudera support both Spark 1.6.x as well as Spark 2.3.x, we can only setup Spark 1.6.x with Packages.

  • Go to the Cloudera Manager Dashboard
  • Click on Add Service in drop down of the cluster
  • Choose Spark 1.6.x (don’t choose Stand Alone)
  • We will be using bigdataserver-4 as Spark Job History Server.
  • Review properties and complete the setup process.

Spark is Distributed Computing Engine which uses File Systems supported by HDFS APIs for storage and YARN or Mesos for Resource Management. With distributions like Cloudera we can only configure with YARN.

Run Spark Jobs – Spark 1.6.x

Spark provides APIs as well as Framework for distributed processing.

  • Developers take care of developing Spark based applications using Scala or Python or Java.
  • When code is released, it is the responsibility of Developers to provide run guide for their applications.
  • As part of Spark setup we get examples and they can be submitted using spark-submit command. Let us review some of the arguments we can pass using spark-submit to control the run time behavior of Spark Application.
  • We can also launch Scala REPL with Spark dependencies using spark-shell and Python CLI with Spark dependencies using pyspark
  • After running the jobs let us also review UI to monitor either running or completed jobs.
  • Here Spark is integrated with YARN and hence Spark Job or Application is nothing but YARN Application.

Review Important Properties

Let us review some of the important properties of Spark.

  • Like Map Reduce, Spark will create containers to process data.
  • These containers are called as Executors
  • When it comes to Map Reduce – Map Tasks are based on number of blocks of underlying file, where as with Spark it will create containers based up on allocations configured.
  • There are 2 types of allocation – static and dynamic.
  • In Plain Vanilla Spark, by default allocation is done using static.
  • As part of Cloudera Distribution, allocation is dynamic.
  • Let us review the properties related to executors as well as allocation for the Spark Applications. Using command prompt we can check for and spark-defaults.conf under /etc/spark/conf
  • is shell script to set environment variables where as spark-defaults.conf is properties file which control the run time behavior of Spark Jobs.
  • Unlike Hadoop configuration files Spark configuration files are not xml files, they are standard properties files where properties are defined as key value pairs.
  • Key and Value are separated by “=”.
  • Memory Settings are primarily under

Spark Execution Life Cycle

Let us understand the Execution Life Cycle of Spark. You can review this using Spark Official Documentation.

  • We submit the job for the client. The JVM typically acts as the Driver Program.
  • It will talk to the Resource Manager and create the Application Master.
  • Application Master will talk to Worker Nodes on which Node Managers are running and provision resources based on Allocation Settings. Allocation can be either static or dynamic.
  • These resources are nothing but Executors. From YARN perspective they are Containers.
  • The Executor is nothing but JVM which can run multiple concurrent threads until the Job is complete

Convert Cluster to Parcels

Even though we can configure Spark 2.3.x on Cloudera based cluster using Cloudera Manager, we need to use Parcels.

  • Cloudera recommends parcels over packages to build the cluster.
  • We do not need to set up local repositories with parcels. Cloudera Manager caches the parcels repositories on the node where it is running and takes care of distributing and installing on to all the nodes in the cluster.
  • To use parcels, the server on which Cloudera Manager is running should be able to connect to the Internet.
  • Binaries that comes as part of Packages will be available under /usr/lib whereas binaries that comes as part of Parcels will be available under /opt/cloudera/parcels/CDH/lib
  • Let us see the steps how we can convert packages to parcels.
    • Download: Go to Parcels and Download CDH 5. Files will be downloaded and cached on to the server where Cloudera Manager is running.
    • Distribute: Click on Distribute to deploy parcel based binaries/jar files on to all the nodes in the cluster.
    • Restart and Deploy: Restart the cluster and redeploy the configurations.
    • Uninstall Packages: Uninstall packages from all the nodes – ansible all -i hosts -a "sudo yum remove -y 'bigtop-*' hue-common impala-shell solr-server sqoop2-client hbase-solr-doc avro-libs crunch-doc avro-doc solr-doc" --private-key=~/.ssh/google_compute_engine
    • Restart Cloudera Agents: We can restart Cloudera Agent by running this command on all servers – sudo systemctl restart cloudera-scm-agent
    • Here is the ansible command to restart all Cloudera Agents in one shot – ansible all -i hosts -a "sudo systemctl restart cloudera-scm-agent" --private-key=~/.ssh/google_compute_engine
    • Update Paths: Make sure applications are referring to new location for binaries – /opt/cloudera/parcels/CDH/lib . This is applicable in those scenarios where executables are referred using fully qualified path.

Setup Spark 2.3.x

Once parcels is setup we can setup Spark 2.3.x on the existing cluster.

  • Download Oracle JDK 1.8 on all the servers – ansible all -i hosts -a " wget --no-check-certificate -c --header 'Cookie: oraclelicense=accept-securebackup-cookie' " --private-key=~/.ssh/google_compute_engine
  • Install Oracle JDK 1.8 on all the servers – ansible all -i hosts -a " rpm -ivh jdk-8u191-linux-x64.rpm " --become --private-key=~/.ssh/google_compute_engine
  • Uninstall JDK 1.7 from all the servers – ansible all -i hosts -a "sudo yum -y remove java-1.7.0-openjdk*" --private-key=~/.ssh/google_compute_engine
  • Also remove Cloudera’s Oracle JDK 1.7 – ansible all -i hosts -a "sudo yum -y remove oracle-j2sdk1.7.x86_64" --private-key=~/.ssh/google_compute_engine
  • Download CSD file – wget
  • Copy CSD file to standard location sudo cp SPARK2_ON_YARN-2.3.0.cloudera4.jar /opt/cloudera/csd
  • Restart Cloudera Manager – sudo systemctl restart cloudera-scm-server
  • Restart Cloudera Management Service from Cloudera Manager UI
  • Go to Parcels and Download Spark 2 and then Activate
  • Click on Add Service and add Spark 2 to the existing cluster
  • Both Spark 1.6.x and Spark 2.3.x can co-exist. We can use spark2-shell or pyspark2 and spark2-submit to submit jobs.

Run Spark Jobs – Spark 2.3.x

Spark provides APIs as well as Framework for distributed processing.

  • Even though there are differences with respect to the development of applications using Spark 1.6.x versus Spark 2.3.x, deployment does not change much.
  • Instead of spark-submit, we need to use spark2-submit. Same is the case with spark-shell and pyspark, we have to use spark2-shell and pyspark2
  • We can run the same commands which we have seen earlier by just modifying spark-submit to spark2-submit, spark-shell to spark2-shell and pyspark to pyspark2.
  • Execution life cycle does not change much. It is the same between Spark 1.6.x and Spark 2.3.x.

By this time you should have your cluster running with Parcels along with Zookeeper, HDFS and YARN including High Availability, Spark 1.6.0 and Spark 2.3.0. Also, you should be familiar with all relevant Web UIs for the above services and some of the commands, especially from the administration perspective.

Make sure to stop services in Cloudera Manager and also shut down servers provisioned from GCP or AWS to leverage credits or control costs.