Need explaination of below scala statement

Can anyone please explain of below scala statement,what is being done here

valcombinedOutput = namePairRDD.combineByKey(List(_),(x:List[String],y:String) => y::x,(x:List[String],y:List[String] => x:::y))

The above line is a part of below code snippet

val name = sc.textFile(“spark8/data.csv”)

val namePairRDD = name.map(x=> (x.split(",")(0),x.split(",")(1)))

val swapped = namePairRDD.map(item.swap)

valcombinedOutput = namePairRDD.combineByKey(List(_),(x:List[String],y:String) => y::x,(x:List[String],y:List[String] => x:::y))

combinedOutput.repartition(1).saveAsTextFile(“spark8/result.txt”)

From where you got it? Can you share data fro spark8/data.csv?

data.csv
1,Lokesh
2,Bhupesh
2,Amit
2,Ratan
2,Dinesh
1,Pavan
1,Tejas
2,Sheela
1,Kumar
1,Venkat

@Tarun_Das : I think this is one of the exercise in hadooppass.com, Based on the probm statement they have given, We need to produce a List of all names based on ID.

For that we can ignore the swap function mentioned as part of solution.That is a mistake I guess.
Regarding the CombineByKey():

valcombinedOutput = namePairRDD.combineByKey(List(_),(x:List[String],y:String) => y::x,(x:List[String],y:List[String] => x:::y))

List(_) -> this creates an empty list when ever a new key is encountered while processing the data from namePairRDD. means Two lists will be created for the input given above. one list for 1 and another for 2.

(x:List[String],y:String) => y::x -> this is the combiner.accumulator is a list of String variables. x:List[String] , is the sscala way of defining a list of strings.here x is a list of strings. y is a String variable . y::x (note the two colons)means appending y to the List x. so, for evey key we are appending value to the list.

(x:List[String],y:List[String])=>x:::y) ->This is kind of redcuer, which applies on all the intermediate combiners produced.Here x is a list of Strings ,So is y.which got produced in previous step.x:::y(note the three colons) is scala way of appending two lists.So, we are combining all the lists produced for a key and making it into a single list.

@itversity : Please correct me If I am wrong.

Ok, the question does not make much of sense. I do not think official exam will have this kind of questions, especially on combineByKey which is not part of official spark documentation.

I am not sure from where these questions came from because we are group of friends and we share questions and challenge each other :slight_smile:

But thanks a lot for the explaination,god bless you

Sir,

below is the exact question

You have been given a file named spark8/data.csv(type,name)

1.Load this file from hdfs and save it back as (id,(all names of same type)) in results directory.
However,make sure while saving it should be able to write in a single file

How would i be able to achieve this without combinerbykey ?Thanks in advance

If possible, Please include me in that group…
What is needed? mobile number or mail ID??

No no we just sit in some body’s house ,no group email or something ,we are not so sophisticated :slight_smile: