Pyspark save text file for each group and the name of the file should refelct the group name (salary)

pyspark
#1

Solution required to save the rdd into textfile for each group by providing the name of the file tagged to the one of the column of the rdd for example if i group by salary then file name should contain the salary. I was able to produce the individual file but not able to add the salary value in the file name.

EmployeeName.csv
E01,Lokesh
E02,Bhupesh
E03,Amit
E04,Ratan
E05,Dinesh
E06,Pavan
E07,Tejas
E08,Sheela
E09,Kumar
E10,Venkat

EmployeeSalary.csv
E01,50000
E02,50000
E03,45000
E04,45000
E05,50000
E06,45000
E07,50000
E08,10000
E09,10000
E10,10000

Now write a sparkcode which load above files and join the same and produce the name,salary values.
Amd save the data in multiple file group by salary (Means each file will have a name of employees with same salary). Make sure file name include salary as well.

empnamecsv= sc.textFile("/user/cloudera/spark33/EmployeeName.csv")

emprdd= empnamecsv.map(lambda m: (m.split(",")[0],m.split(",")[1]))

empsalcsv= sc.textFile("/user/cloudera/spark33/EmployeeSalary.csv")

empsalrdd = empsalcsv.map(lambda m: (m.split(",")[0],m.split(",")[1]))

empjoinsal = empprdd.join(empsalrdd)

empsalname=empjoinsal.map(lambda x: (x[1][1], x[1][0])).sortByKey().saveAsTextFile("/user/cloudera/spark32/employee32")

I was able to generate the individual file for each group but the filename should contain salary also. Not able to include that part

output
[cloudera@quickstart spark33]$ hdfs dfs -cat /user/cloudera/spark32/employee32/part-00000
(u’10000’, u’Venkat’)
(u’10000’, u’Sheela’)
(u’10000’, u’Kumar’)
[cloudera@quickstart spark33]$ hdfs dfs -cat /user/cloudera/spark32/employee32/part-00001
(u’45000’, u’Pavan’)
(u’45000’, u’Amit’)
(u’45000’, u’Ratan’)
[cloudera@quickstart spark33]$ hdfs dfs -cat /user/cloudera/spark32/employee32/part-00002
(u’50000’, u’Tejas’)
(u’50000’, u’Bhupesh’)
(u’50000’, u’Dinesh’)
(u’50000’, u’Lokesh’)

Solution required for :- Make sure file name include salary as well.


Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs

  • Click here for access to state of the art 13 node Hadoop and Spark Cluster

0 Likes

#2

I got the solution for this

empnamecsv= sc.textFile("/user/cloudera/spark33/EmployeeName.csv")

emprdd= empnamecsv.map(lambda m: (m.split(",")[0],m.split(",")[1]))

empsalcsv= sc.textFile("/user/cloudera/spark33/EmployeeSalary.csv")

empsalrdd = empsalcsv.map(lambda m: (m.split(",")[0],m.split(",")[1]))

empjoinsal = emprdd.join(empsalrdd)

empsalname=empjoinsal.map(lambda x: (x[1][1], x[1][0]))

empfinal=empsalname.groupByKey().map(lambda x: (x[0],list(x[1]))).collect()

empoutput.sc.parallelize(empfinal)

empoutput.saveAsTextFile("/user/cloudera/spark33/outputgrouplist")

0 Likes