Quick Review Of APIs
Let us have a quick review of Core APIs that are available in Spark.We will cover Data Frame APIs and Spark SQL at a later point in time.
- SparkContext exposes APIs such as textfiles,sequenceFile to read data from files into a distributed collection called as RDD.
- RDD stand for Resilient Distributed Dataset and it is nothing but a distributed collextion called as RDD.
- RDD stands for Resilient Distributed Datasets and it is nothing but a distributed collection.
- It is typical loaded on to the executors created at the time of execution.
- RDD exposes APIs called as Transformations and Actions
- Transformations take one RDD as input and return another RDD as output while Action trigger execution and get data into drive program.
- Examples of Transformations
- Row Level Transformations - map,filter,flatMap etc
- Aggregations - reduceByKey,aggregrateByKey
- Joins - join,leftOuterJoin,rightOuterJoin
- Sorting - sortByKey
- Ranking - groupByKey followed by flatMao with a lambda function
- Except for Row Level Transformtions,most of the other transformations have to go through the shuffle phase and trigger new stage.
- Row Level Transformations also known as Narrow Transformations.
- Transformations that trigger shuffle and new stage are also called as wide transformations.
- Example of Actions
- Preview Data:take,takeSample,top,takeOrdered
- Convert into Python List:collect
- Total Aggregration:reduce
- Writing into Files:saveAsTextFile,saveAsSequenceFile
Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs
- Click here for access to state of the art 13 node Hadoop and Spark Cluster