issue in distinct() method in pyspark. I am trying to retrieve distinct records for Order Items RDD based on the key ‘order_id’, but the resulting RDD is not with distinct elements and it is resulting with unsorted.
Please go through below transformations and suggest me:
orderItemsRDD = sc.textFile(“itversity/data/order_items”)
orderItemsMapRDD = orderItemsRDD.map(lambda rec: (int(rec.split(",")), rec))
orderItemsDistinct = orderItemsMapRDD.distinct()
in ‘orderItemsDistinct’ RDD i can see that elements are not distinct and last 2 count() gives exact same result.
Kindly help me to solve this.