#Data type of RDD after each step
scala> val wcData = sc.textFile(“hello.txt”)
16/12/18 03:33:15 INFO DAGScheduler: Job 0 finished: collect at :24, took 0.345433 s
res0: Array[String] = Array(Hi this is spark shell, you are using spark wordcount)
scala> val flatMappedData = wcData.flatMap(_.split(" "))
flatMappedData: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD at flatMap at :23
scala> val mappedData = flatMappedData.map(rec => (rec, 1))
mappedData: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD at map at :25
val reducedData = mappedData.reduceByKey(+)
reducedData: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD at reduceByKey at :27
#Difference between map, flatMap and filter
map :It returns a new RDD by applying a function to each element of the RDD. Function in map can return only one item.
flatMap: Similar to map, it returns a new RDD by applying a function to each element of the RDD, but output is flattened.
Also, function in flatMap can return a list of elements (0 or more)
filter: returns a new RDD with filtered data set
#Why we have used flatMap and map?
flatmap returns a flattened array intially which is required to extract the words separated by space
map returns a pairedRDD which holds the key value pairs
#Why we have considered reduceByKey over aggregateByKey and groupByKey?
groupByKey does not perform any aggregation
reduceByKey returns an aggregated RDD where the return type of the output is same as the input parameters
aggregateByKey is used when the return type will be different from input
#Difference between transformation and action
Transformations create new RDD from existing RDD like map, reduceByKey and filter. Transformations are executed on demand. That means they are computed lazily.
Actions return final results of RDD computations. Actions triggers execution using lineage graph to load the data into original RDD, carry out all intermediate transformations and return final results to Driver program or write it out to file system.