reduceByKey Key Value Pair RDD

rdd1=sc.parallelize([('prod1', 10), ('prod2', 20), ('prod1', 30), ('prod4', 40), ('prod2', 200), ('prod2', 10), ('prod4', 5)])
rdd1.reduceByKey(lambda x,y:(x+y)).collect()

Output -> [('prod1', 40), ('prod2', 230), ('prod4', 45)]

Based on my understanding of how reduceByKey() works…
x - key from the key-value pair RDD
y - value from the key-value pair RDD
And reduceByKey() does aggregation/reduction of rows based on the key i.e. x in this example. The expression applied is what is specified after the lambda function.

However, the 2 examples below… can someone explain how the output is being generated?
Example 1

rdd1=sc.parallelize([('prod1', 10), ('prod2', 20), ('prod1', 30), ('prod4', 40), ('prod2', 200), ('prod2', 10), ('prod4', 5)])
rdd1.reduceByKey(lambda x,y:(x+x)).collect()

Output -> [(‘prod1’, 20), (‘prod2’, 80), (‘prod4’, 80)]


Example 2

rdd1=sc.parallelize([('prod1', 10), ('prod2', 20), ('prod1', 30), ('prod4', 40), ('prod2', 200), ('prod2', 10), ('prod4', 5)])
rdd1.reduceByKey(lambda x,y:(y+y)).collect()

Output -> [(‘prod1’, 60), (‘prod2’, 20), (‘prod4’, 10)]

@ganesh1146 Sorry to say your understanding about reduceByKey() is completely wrong.

Instead look at below how it works:

All available key value pairs:

('prod1', 10), ('prod2', 20), ('prod1', 30),  ('prod4', 40), 
('prod2', 200), ('prod2', 10), ('prod4', 5)

Let’s get all values of same key together:

key: ‘prod1’ values: [10, 30]
key: ‘prod2’ values: [20, 200, 10]
key: ‘prod3’ values: [40, 5]

Now reduce values of each KEY based on provided lambda function( here lambda x,y:(x+y) )
Let first understand what does lambda x,y: (x+y) do?
It takes 2 values at a time and returns sum of those two values. isn’t it simple?

  1. results for key-1: ‘prod1’
    lambda 10, 30 : (10+30) which returns 40

  2. results for key-2: 'prod2’
    lambda 20, 200: (20+200) which returns 220
    lambda 220, 10: (220+10) which returns 230 there are no values left so final result would be 230

  3. results for key-3: ‘prod3’
    lambda 40, 5: (40+5) which returns 45

So final result of reduceBykey() would be [(‘prod1’, 40), (‘prod2’, 230), (‘prod4’, 45)]

In the same way you can calculate results for your 2 examples.

2 Likes

Excellent explanation.
Good job @venkatreddy-amalla