Apache Spark - PySpark - distinct() issue

pyspark
#1

Hi Guys,

issue in distinct() method in pyspark. I am trying to retrieve distinct records for Order Items RDD based on the key ‘order_id’, but the resulting RDD is not with distinct elements and it is resulting with unsorted.
Please go through below transformations and suggest me:
orderItemsRDD = sc.textFile(“itversity/data/order_items”)
orderItemsMapRDD = orderItemsRDD.map(lambda rec: (int(rec.split(",")[1]), rec))
orderItemsDistinct = orderItemsMapRDD.distinct()
orderItemsDistinct.count()
orderItemsMapRDD.count()

in ‘orderItemsDistinct’ RDD i can see that elements are not distinct and last 2 count() gives exact same result.

Kindly help me to solve this.

0 Likes

#2

@ravi.tejarockon Find the answers provided as inline comments below:

Actually you were not performing distinct just based on order_id instead you were doing on tuple of (order_id, rec). You can see it by executing orderItemsMapRDD .first() .

I hope you might get an idea why they were not distinct. We will see now, why the results were unsorted after distinct operation.
Here distinct() operation causes to shuffle (spark shuffle) the data. In most of cases where shuffle is involved, you will loose the ordering. i.e if any spark operation involved in spark shuffle then there are very high chances that you will loose the data ordering in RDD.

You can read about it more here and here

Hope this helps! :slight_smile:

1 Like

Scala-Spark keys() transformation API returns a RDD but Pyspark does not
#3

@venkatreddy-amalla you are right, i am performing distinct on tuple so i am not getting required result. I am in a misconception that distinct works only on key and it neglects value. (just like how distinct works in SQL). I have corrected and tried as below and it worked

orderID = orderItemsRDD.map(lambda rec: int(rec.split(",")[1]))
orderIDdistinct = orderID.distinct()
orderID.distinct().count()

Thanks a lot :slight_smile:

0 Likes