Pyspark-SOrting and Ranking Using Group by

#1

Related to video 2 of Pyspark-sorting and ranking( Top N prices for each category)
ERROR-Empty string to float()

Above error is displayed after running the below command-

productsRDD=sc.textFile("/user/cloudera/sqoop_import/products")
productsmap=productsRDD.map(lambda x: (x.split(",")[1],x))
products_groupby=productsmap.groupByKey()

def topN(rec,N):
… l=[ ]
… l=list(sorted(rec[1],key= lambda k:float(k.split(",")[4]),reverse=True))
… import itertools
… return(y for y in list(itertools.islice(l,0,N))

… )

for i in product_groupby.flatMap(lambda x: topN(x,3)).collect():
… print(i)

Can someone help here?

0 Likes

#2

There is a record which have issue. You need to filter that record out.

1 Like

#3

@Varun_Joshi - filter product_id 685 and try again

1 Like

Getting "java.lang.NumberFormatException: empty String" for Scala - Sorting and Ranking by key using groupByKey
#4

Thanks It worked now.

1 Like

#5

But what is the problem with this record …?? As the product price is available …

0 Likes

#6

You should have a look into data and try to understand the problem.

0 Likes

#7

I Did checked the data but didn’t find anything worng.

product_id | product_category_id | product_name | product_description | product_price | product_image |
±-----------±--------------------±-----------------------------------------±--------------------±--------------±-------------------------------------------------------------------------------+
| 685 | 31 | TaylorMade SLDR Irons - (Steel) 4-PW, AW | | 899.99 | http://images.acmesports.sports/TaylorMade+SLDR+Irons+-+(Steel)+4-PW%2C+AW |

0 Likes

#8

What is the delimiter for your data?

0 Likes

#9

Delimiter of data is a pipe ‘|’

0 Likes

#10

Then there shouldn’t be any issue for you. You can try with out filtering the data. When data is ‘,’ delimited there is an issue due to ‘,’ in product name.

1 Like

#11

Ok. Thank you so much

0 Likes

#12

instead of call the function can we use the sortBy() function and take ()

0 Likes