Question about RDD cache after Shuffle transformation

apache-spark
rdd-api
Question about RDD cache after Shuffle transformation
5.0 1

#1

Hi ItVersity,

I am trying to understand does spark caches RDD automatically if there is any transformation which causes RDD shuffle ?

Let me put an example.

Here is a sample text file format

CustomerID,CusotmerName,State,Income
1,‘SampleFirstNameRow1’,‘Tx’,40000
2,‘SampleFirstNameRow2’,‘GA’,50000… and so on

Lets assume I have a program which does 2 things.

  1. Count of customers by State
  2. Customer having highest income by State

val v = sc.textFile(’’)
v.first
//a Key Value Pair
val VKeyVal = v.map(r=> (r.split(",")(2),r.split(",")(3).toDouble))
//(‘TX’,40000)
val v_GBK = VKeyVal.groupByKey
//MyfirstOutput
val v_Cust_Per_State = v_GBK.map( r=> (r._1,r._2.size))
v_Cust_Per_State.take(10).foreach(println)
//MySecondOutPut
val v_HighestIncomeByState = //Code goes here
v_HighestIncomeByState.take(10).foreach(println)

Question 1:- I haven’t cached v_GBK . Is spark going to cache it internally?
Question 2:- if I write code to find Customer having highest income by State is it going to start from very beginning?

When I checked in spark ui for the stages created for v_HighestIncomeByState output, I really felt that spark has cached the shuffled RDD & it skipped the task to read RDD from disk. Does it mean spark will cache shuffled RDD? I tried checking in “Storage” part of spark ui and did not find anything.

Thanks in Advance.


Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs

  • Click here for access to state of the art 13 node Hadoop and Spark Cluster


#2

This is what documentation says

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.

It says “may be recomputed”, which means unless you persist RDD triggering an action might result in reexebuting the DAG. But if you run action immediately you might not see that happening. persist or cache will preserve the data in memory or other storage level and RDD will not be recomputed if used multiple times as part of same application.