Exercise 13 - Understanding word count program

apache-spark
scala
#1

Explain what you understand from this code

sc.textFile("/Users/dgadiraju/Research/data/wordcount.txt").
flatMap(.split(" ")).
map(rec => (rec, 1)).
reduceByKey(
+_).
collect().foreach(println)

Please provide following details:

  • Data type of RDD after each step
  • Difference between map, flatMap and filter
  • Why we have used flatMap and map?
  • Why we have considered reduceByKey over aggregateByKey and groupByKey?
  • Difference between transformation and action
1 Like

#2

@itversity This is great exercise with well formed questions.

Please find the answers below:

1. Data type of RDD after each step

sc.textFile("/Users/dgadiraju/Research/data/wordcount.txt").
textFile always returns RDD[String], so the resultant RDD of above contains String type where each line of the file becomes one record in RDD.

flatMap(_.split(" ")).
This line in above word count example returns an RDD[String]

map(rec => (rec, 1)).
Above map function returns RDD[String, Int]

reduceByKey(+).
Above line in word count example returns RDD[String, Int]

collect().foreach(println)
In above line collect() is action operation so it returns the result and not the RDD.


2. Difference between map, flatMap and filter

Although three of them are spark transformations they solve different problems in real time.
a) map
It returns a new RDD by applying a function to all elements of input RDD.
b)flatMap
It returns a new RDD by first applying a function to all elements of input RDD, then flattening the results.
c)filter
It returns a new RDD containing only the elements that satisfy a predicate.


3. Why we have used flatMap and map?

In word count example, we need to split each line by whitespace(" ") to get the words and also we need to flatten the result of each line to get the just one list of all words. For these kind of use cases, flatMap works exceptionally well.

In order to tag each word with integer 1 we used map.


4. Why we have considered reduceByKey over aggregateByKey and groupByKey?

reduceByKey and aggregateByKey are identical operations where both aggregates the values of a key whereas groupByKey doesn’t aggregate the data instead it just groups it and returns iterables.

reduceByKey is specific case of aggregateByKey . Although both perform better than groupByKey, using reduceByKey in this case looks obvious because of its implementation is much more simpler than aggregateByKey.


5. Difference between transformation and action

Although both transformations and actions are part of Spark RDD operations, they behave quite differently. Primary difference between them are:

  1. Transformations always return RDD where as actions return result
  2. Transformations are lazily evaluated where as actions evaluated immediately

1 Like

#3

#Data type of RDD after each step

scala> val wcData = sc.textFile(“hello.txt”)

scala> wcData.collect()

16/12/18 03:33:15 INFO DAGScheduler: Job 0 finished: collect at :24, took 0.345433 s
res0: Array[String] = Array(Hi this is spark shell, you are using spark wordcount)

scala> val flatMappedData = wcData.flatMap(_.split(" "))
flatMappedData: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at :23

scala> val mappedData = flatMappedData.map(rec => (rec, 1))
mappedData: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map at :25

val reducedData = mappedData.reduceByKey(+)
reducedData: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at :27

reducedData.collect().foreach(println)

(are,1)
(this,1)
(is,1)
(using,1)
(shell,1)
(spark,2)
(you,1)
(wordcount,1)
(Hi,1)


#Difference between map, flatMap and filter

map :It returns a new RDD by applying a function to each element of the RDD. Function in map can return only one item.

flatMap: Similar to map, it returns a new RDD by applying a function to each element of the RDD, but output is flattened.
Also, function in flatMap can return a list of elements (0 or more)

filter: returns a new RDD with filtered data set


#Why we have used flatMap and map?

flatmap returns a flattened array intially which is required to extract the words separated by space

map returns a pairedRDD which holds the key value pairs


#Why we have considered reduceByKey over aggregateByKey and groupByKey?

groupByKey does not perform any aggregation

reduceByKey returns an aggregated RDD where the return type of the output is same as the input parameters

aggregateByKey is used when the return type will be different from input


#Difference between transformation and action

Transformations create new RDD from existing RDD like map, reduceByKey and filter. Transformations are executed on demand. That means they are computed lazily.

Actions return final results of RDD computations. Actions triggers execution using lineage graph to load the data into original RDD, carry out all intermediate transformations and return final results to Driver program or write it out to file system.

1 Like

#4

Data type of RDD after each step


val wc = sc.textFile(“C:/Users/mataraj/Desktop/mr.txt”)
wc: org.apache.spark.rdd.RDD[String] = C:/Users/mataraj/Desktop/mr.text MapPartitionsRDD[1] at textFile at :21

val flatMapRdd = wc.flatMap(a =>a.split(" "))
flatMapRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at :23

val mapRdd = flatMapRdd.map(a=>(a,1))
mapRdd: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map at :25

val count = mapRdd.reduceByKey(+)
count: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[8] at reduceByKey at :27

scala> count.collect().foreach(println)
(allows,1)
(scale,1)
(is,2)
(computers,1)
(Apache,1)
(machines,1)
(data,1)
(simple,1)
(using,1)
(servers,1)
(framework,1)
(models.,1)
(offering,1)
(from,1)
(single,1)
(up,1)
(The,1)
(that,1)
(a,1)
(computation,1)
(across,1)
(each,1)
(large,1)
(sets,1)
(to,2)
(processing,1)
(library,1)
(local,1)
(of,3)
(clusters,1)
(It,1)
(for,1)
(software,1)
(thousands,1)
(designed,1)
(programming,1)
(distributed,1)
(and,1)
(storage.,1)
(the,1)
(Hadoop,1)

Difference between map, flatMap and filter

map: Map applies a function to each element in the RDD and return an RDD of the result. It Provides 1 to 1 mapping.
flatMap : flatMap applies a function to each element in the RDD and return an RDD of the contents of the iterators returned. It provides 1 to 0 or many mappings.
filter: filter returns an RDD consisting of only elements that pass the condition passed to filter().


Why we have used flatMap and map?

First we have used flatMap to get each word of text file which has been seperated by whiteSpace and then Map to get a key value pair(“word” , 1) i.e 1 to 1 mapping for each word.


Why we have considered reduceByKey over aggregateByKey and groupByKey?

reduceByKey returns an aggregated result where type of input and output parameter is same. we can use aggregateByKey when we need to get different output type type than input.

when combiner and reducer functionality is same we can use reduceByKey and when functionality is different we can use aggregateByKey(ex: to get average )

groupByKey : it groups the values with the same key. it does not provide aggregation.


Difference between transformation and action

Transformation creates new rdd from existing rdd . Transformations are Lazy evaluations. ex : map, flatMap .
Actions returns output from computations on transformations. when an action triggers that time all the RDD gets evaluated on which action functionality depends.ex : collect(), count()

0 Likes

#5

Data type of RDD after each step

val loadFile = sc.textFile("file:///home/jasonbourne/testpartitioner.txt")
loadFile: org.apache.spark.rdd.RDD[String] = file:///home/jasonbourne/testpartitioner.txt MapPartitionsRDD[14] at textFile at :27

val flatMapFile = loadFile.flatMap(rec=> rec.split(" "))
flatMapFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[15] at flatMap at :29

val mapFile = flatMapFile.map(rec=> (rec, 1))
mapFile: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[16] at map at :31

val reduceFile = mapFile.reduceByKey((acc, value) => (acc + value))
reduceFile: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[17] at reduceByKey at :33

reduceFile.take(15).foreach(println)
(Bennu.,1)
(Sun,1)
(its,3)
(There,1)
(Lauretta,1)
(Over,1)
(have,2)
(satellites,1)
(Trojan,6)
(two-year,1)
(serves,1)
(behind,1)
(only,2)
(we,2)
(This,1)

Difference between map, flatMap and filter

map: Return a new distributed dataset formed by passing each element of the source through a function func.

flatMap: each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).

filter: Return a new dataset formed by selecting those elements of the source on which func returns true.

Why we have used flatMap and map?
flatmap is used to flatten the record after it is split by whitespaces which return sequence of words.
mop is used to for transforming the each words to (word, 1 => here “word” act as key and "1 "as value) so that it can be used while doing the reduceByKey.

Why we have considered reduceByKey over aggregateByKey and groupByKey?
groupByKey: It is more generic and does not use combiner. It returns (K,V) => (K,Iteratable V )
aggregateByKey: It is uses combiner and is mostly used when combiner and reduce logic different.
reduceByKey: It is used as it uses combiner.

Difference between transformation and action
Transformations returns pointers, creates a new RDD and follow a Lazy evaluation where are actions returns a value once performed on RDD.

0 Likes