Pyspark-SOrting and Ranking Using Group by

Related to video 2 of Pyspark-sorting and ranking( Top N prices for each category)
ERROR-Empty string to float()

Above error is displayed after running the below command-

productsRDD=sc.textFile("/user/cloudera/sqoop_import/products") x: (x.split(",")[1],x))

def topN(rec,N):
… l=[ ]
… l=list(sorted(rec[1],key= lambda k:float(k.split(",")[4]),reverse=True))
… import itertools
… return(y for y in list(itertools.islice(l,0,N))

… )

for i in product_groupby.flatMap(lambda x: topN(x,3)).collect():
… print(i)

Can someone help here?

There is a record which have issue. You need to filter that record out.

1 Like

@Varun_Joshi - filter product_id 685 and try again

1 Like

Thanks It worked now.

1 Like

But what is the problem with this record …?? As the product price is available …

You should have a look into data and try to understand the problem.

I Did checked the data but didn’t find anything worng.

product_id | product_category_id | product_name | product_description | product_price | product_image |
| 685 | 31 | TaylorMade SLDR Irons - (Steel) 4-PW, AW | | 899.99 | http://images.acmesports.sports/TaylorMade+SLDR+Irons+-+(Steel)+4-PW%2C+AW |

What is the delimiter for your data?

Delimiter of data is a pipe ‘|’

Then there shouldn’t be any issue for you. You can try with out filtering the data. When data is ‘,’ delimited there is an issue due to ‘,’ in product name.

1 Like

Ok. Thank you so much

instead of call the function can we use the sortBy() function and take ()