Exercise 04 - Deploy and run word count application on the cluster

Resources:

  • Click here for $35 coupon for CCA 175 Spark and Hadoop Developer using Python.
  • Click here for $35 coupon for CCA 175 Spark and Hadoop Developer using Scala.
  • Click here for $25 coupon for HDPCD:Spark using Python.
  • Click here for $25 coupon for HDPCD:Spark using Scala.
  • Click here for access to state of the art 13 node Hadoop and Spark Cluster

Resources:

  • Click here for $35 coupon for CCA 175 Spark and Hadoop Developer using Python.
  • Click here for $35 coupon for CCA 175 Spark and Hadoop Developer using Scala.
  • Click here for $25 coupon for HDPCD:Spark using Python.
  • Click here for $25 coupon for HDPCD:Spark using Scala.
  • Click here for access to state of the art 13 node Hadoop and Spark Cluster

Description

  • This is intended to understand the job execution cycle
  • Data location in HDFS: /public/randomtextwriter

Problem Statement

  • Run the jar file on the cluster in YARN mode using spark-submit (reference: https://spark.apache.org/docs/1.6.2/submitting-applications.html)
  • Read data from HDFS and write the output back to HDFS
  • You need to validate output directory to make sure data is stored in HDFS
  • Number of stages used to run
  • Determine number of executors used to run
  • Determine number of executor tasks used to run for each stage
  • How many executor tasks ran by each executor in each stage?
  • Change the number of executors and observe the behavior