Pyspark Error while doing sort by price for each category

Hi,
I was practicing the video
"Hadoop Certification - CCA - Pyspark - Sorting and Ranking by key using groupByKey".

I coded as
prdRDD= sc.textFile("/user/cloudera/sqoop_import/products")
prdCat = prdRDD.map(lambda x: x.split(",")).map(lambda x: (x[1],x))
prdgrp = prdCat.groupByKey();
prdsrt = prdgrp.flatMap(lambda rec: sorted( rec[1] ,key=lambda k: k.split(",")[4]))
for i in prdsrt.collect():print(i)

I am getting pyspark error as list has no attribute to split.

Why is wrong I am doing ?

you are doing an extra map function on the products RDD, which is converting the each product record into an array…
prdCat = prdRDD.map(lambda x: x.split(",")).map(lambda x: (x[1],x))

as it is already an array when you try to use split function it is not working. if your code has to work then you have to change the function to

prdsrt = prdgrp.flatMap(lambda rec: sorted( rec[1] ,key=lambda k: k[4])) (4 or 5 based on the index)

I need to try the code to see this, but I do not see the need of two map transformation one after other in this step.

prdCat = prdRDD.map(lambda x: x.split(",")).map(lambda x: (x[1],x))

Instead what you could do is -

prdCat = prdRDD.map(lambda x: (x.split(",")[1],x)

I am also not sure what is the point in giving rec[1] as 1st parameter of sorted function, it should have been rec only, as you are sorting rec, on the basis of the value coming from key parameter (key=lambda k: k.split(",")[4]) which is product price.

Hi ,
This is working now.

prdRDD= sc.textFile("/user/cloudera/sqoop_import/products")
prdCat = prdRDD.map(lambda x: (x.split(",")[1],x))
prdgrp = prdCat.groupByKey();
prdsrt = prdgrp.flatMap(lambda rec: sorted( rec[1] ,key=lambda k: k.split(",")[4]))
for i in prdsrt.collect():print(i)

It worked.
Thanks

1 .map()
prdCat = prdRDD.map(lambda x: x.split(",")).map(lambda x: (x[1],x)) results in key and value in the tuple. I have
highlighted it in bold which further helps to give an iteratable objects for each key
(u’2’, [u’1’, u’2’,** u’Quest Q64 10 FT. x 10 FT. Slant Leg Instant U’, u’’, u’59.98’, u’http://images.acmesports.sports/Quest+Q64+10+FT.+x+10+FT.+Slant+Leg+Instant+Up+Canopy’**])

prdCat = prdRDD.map(lambda x: (x.split(",")[1],x) results in
(u’2’, u’1,2,Quest Q64 10 FT. x 10 FT. Slant Leg Instant U,59.98,http://images.acmesports.sports/Quest+Q64+10+FT.+x+10+FT.+Slant+Leg+Instant+Up+Canopy)

2. groupByKey()
Applying groupByKey() helps in producing iteration objects for each key
(u’11’, <\pyspark.resultiterable.ResultIterable object at 0x4795050>)

If .map is not applied twice, groupByKey() will produce separate records as there is no key
**(u’2’, u’1,2,**Quest Q64 10 FT. x 10 FT. Slant Leg Instant U,59.98,http://images.acmesports.sports/Quest+Q64+10+FT.+x+10+FT.+Slant+Leg+Instant+Up+Canopy’)
(u’2’, u"2,2,Under Armour Men’s Highlight MC Football Clea,129.99,http://images.acmesports.sports/Under+Armour+Men’s+Highlight+MC+Football+Cleat")

Hope this answers your query.