Pyspark zip two RDDS

pyspark
#1

Header rdd :-[u’id’, u’topic’, u’hits’]

userdata.collect()
[[u’Rahul’, u’scala’, u’120’], [u’Nikita’, u’spark’, u’80’], [u’Mithun’, u’spark’, u’1’], [u’myself’, u’cca175’, u’180’]]

I want to create a header with collection paris by using zip command below is the sample scala code , I need similar solution for pyspark.

Detailed question and solution provided in scala, Looking for similar solution in pyspark for step 5 in scala

Write a spark code which will remove the header part and create RDD of values as below, for all rows. And also if id is myself than filter out row.
Map(id->om,topic->scala,hits->120)

Input data
id,topic,hits
Rahul,scala,120
Nikita,spark,80
Mithun,spark,1
myself,cca175,180

solution in scala

step1 : val csv= sc.textFile(“spark6/user.csv”)

Step 2: val headerandRows=csv.map(line=> line.split(",").map(_.trim))

Step 3: val header=headerandRows.first

step 4: val data = headerAndRows.filter(_(0) !=header(0))

step 5: val maps=data.map(splits=>header.zip(splits.toMap)

step 6: val result=maps.filter(map=>map(“id”) 1= “myself”)

step 7: results.saveAsTextFile(“spark8/result.txt”)

Pyspark solution

empcsv = sc.textFile("/user/cloudera/spark34/user.csv")

headerandata=empcsv.map(lambda d: d.split(",")[0],d.split(",")[1])

header=headerandata.first()

userdata=headerandata.filter(lambda x: x!= header)

Solution required in pyspark similar to solution provided in scala step 5


Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs

  • Click here for access to state of the art 13 node Hadoop and Spark Cluster

0 Likes

#2

@balanightingale1994

The above mentioned line is syntactically wrong. I assume that the functionality you wanted to achieve is

val maps=data.map(splits=>(header zip splits).toMap)

Now, for the problem you have given, the code for the same in pyspark is as mentioned below.

csv = sc.textFile(“Input File Path”)
headerandata = csv.map(lambda d: (d.split(",")))
header = headerandata.first()
userdata = headerandata.filter(lambda x: x!= header)
maps = userdata.map(lambda splits: zip(header, splits))
map1 = maps.map(lambda x: (dict(x)))
result = map1.filter(lambda x: (x[‘id’] !=‘myself’))
result.saveAsTextFile(“Output Directory Path”)

Hope it helps. Happy coding!

0 Likes

#3

Thanks for reply.
Have you completed the spark developer certification using python.

0 Likes

#4

I have actually done the course of spark using python. I’m not sure if I want to go for certification. May be I’d like to explore more in this as it is pretty intriguing to work with this technology and I’m looking ahead to learn much more.

0 Likes