Apache Spark 2.x – Data processing - Getting Started - Quick Review of Spark APIs

Quick Review Of APIs

Let us have a quick review of Core APIs that are available in Spark.We will cover Data Frame APIs and Spark SQL at a later point in time.

  • SparkContext exposes APIs such as textfiles,sequenceFile to read data from files into a distributed collection called as RDD.
  • RDD stand for Resilient Distributed Dataset and it is nothing but a distributed collextion called as RDD.
  • RDD stands for Resilient Distributed Datasets and it is nothing but a distributed collection.
  • It is typical loaded on to the executors created at the time of execution.
  • RDD exposes APIs called as Transformations and Actions
  • Transformations take one RDD as input and return another RDD as output while Action trigger execution and get data into drive program.
  • Examples of Transformations
    • Row Level Transformations - map,filter,flatMap etc
    • Aggregations - reduceByKey,aggregrateByKey
    • Joins - join,leftOuterJoin,rightOuterJoin
    • Sorting - sortByKey
    • Ranking - groupByKey followed by flatMao with a lambda function
    • Except for Row Level Transformtions,most of the other transformations have to go through the shuffle phase and trigger new stage.
    • Row Level Transformations also known as Narrow Transformations.
    • Transformations that trigger shuffle and new stage are also called as wide transformations.
    • Example of Actions
      • Preview Data:take,takeSample,top,takeOrdered
      • Convert into Python List:collect
      • Total Aggregration:reduce
      • Writing into Files:saveAsTextFile,saveAsSequenceFile

Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs

  • Click here for access to state of the art 13 node Hadoop and Spark Cluster