I am trying to understand does spark caches RDD automatically if there is any transformation which causes RDD shuffle ?
Let me put an example.
Here is a sample text file format
2,‘SampleFirstNameRow2’,‘GA’,50000… and so on
Lets assume I have a program which does 2 things.
- Count of customers by State
- Customer having highest income by State
val v = sc.textFile(’’)
//a Key Value Pair
val VKeyVal = v.map(r=> (r.split(",")(2),r.split(",")(3).toDouble))
val v_GBK = VKeyVal.groupByKey
val v_Cust_Per_State = v_GBK.map( r=> (r._1,r._2.size))
val v_HighestIncomeByState = //Code goes here
Question 1:- I haven’t cached v_GBK . Is spark going to cache it internally?
Question 2:- if I write code to find Customer having highest income by State is it going to start from very beginning?
When I checked in spark ui for the stages created for v_HighestIncomeByState output, I really felt that spark has cached the shuffled RDD & it skipped the task to read RDD from disk. Does it mean spark will cache shuffled RDD? I tried checking in “Storage” part of spark ui and did not find anything.
Thanks in Advance.
Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs
- Click here for access to state of the art 13 node Hadoop and Spark Cluster