Need explaination of below scala statement

scala
spark-shell
#1

Can anyone please explain of below scala statement,what is being done here

valcombinedOutput = namePairRDD.combineByKey(List(_),(x:List[String],y:String) => y::x,(x:List[String],y:List[String] => x:::y))

The above line is a part of below code snippet

val name = sc.textFile(“spark8/data.csv”)

val namePairRDD = name.map(x=> (x.split(",")(0),x.split(",")(1)))

val swapped = namePairRDD.map(item.swap)

valcombinedOutput = namePairRDD.combineByKey(List(_),(x:List[String],y:String) => y::x,(x:List[String],y:List[String] => x:::y))

combinedOutput.repartition(1).saveAsTextFile(“spark8/result.txt”)

0 Likes

#2

From where you got it? Can you share data fro spark8/data.csv?

0 Likes

#3

data.csv
1,Lokesh
2,Bhupesh
2,Amit
2,Ratan
2,Dinesh
1,Pavan
1,Tejas
2,Sheela
1,Kumar
1,Venkat

0 Likes

#4

@Tarun_Das : I think this is one of the exercise in hadooppass.com, Based on the probm statement they have given, We need to produce a List of all names based on ID.

For that we can ignore the swap function mentioned as part of solution.That is a mistake I guess.
Regarding the CombineByKey():

valcombinedOutput = namePairRDD.combineByKey(List(_),(x:List[String],y:String) => y::x,(x:List[String],y:List[String] => x:::y))

List(_) -> this creates an empty list when ever a new key is encountered while processing the data from namePairRDD. means Two lists will be created for the input given above. one list for 1 and another for 2.

(x:List[String],y:String) => y::x -> this is the combiner.accumulator is a list of String variables. x:List[String] , is the sscala way of defining a list of strings.here x is a list of strings. y is a String variable . y::x (note the two colons)means appending y to the List x. so, for evey key we are appending value to the list.

(x:List[String],y:List[String])=>x:::y) ->This is kind of redcuer, which applies on all the intermediate combiners produced.Here x is a list of Strings ,So is y.which got produced in previous step.x:::y(note the three colons) is scala way of appending two lists.So, we are combining all the lists produced for a key and making it into a single list.

@itversity : Please correct me If I am wrong.

0 Likes

#5

Ok, the question does not make much of sense. I do not think official exam will have this kind of questions, especially on combineByKey which is not part of official spark documentation.

0 Likes

#6

I am not sure from where these questions came from because we are group of friends and we share questions and challenge each other :slight_smile:

0 Likes

#7

But thanks a lot for the explaination,god bless you

0 Likes

#8

Sir,

below is the exact question

You have been given a file named spark8/data.csv(type,name)

1.Load this file from hdfs and save it back as (id,(all names of same type)) in results directory.
However,make sure while saving it should be able to write in a single file

How would i be able to achieve this without combinerbykey ?Thanks in advance

0 Likes

#9

If possible, Please include me in that group…
What is needed? mobile number or mail ID??

0 Likes

#10

No no we just sit in some body’s house ,no group email or something ,we are not so sophisticated :slight_smile:

0 Likes