Need help for PySpark 'aggregateByKey()'

Hello guys,

I am unable to understand how the aggregateByKey() was implemented in this below example
https://gist.github.com/tmcgrath/dd8a0f5fb19201deb65f

Even tough it explains the simple way, I am unable to understand properly due to Scala.
Could please explain the same above example using PySpark ? It would be very helpful for me as I got stuck.

Note: I’ve understood how the map(),reduce(),reduceByKey() works, but finding hard to understand the semantics & implementation of aggregateByKey().

Thanks
Gautham P

Hello @gthm777

In the aggregation example, he has used reduceByKey, and aggregateByKey for same aggregation, both gives same result.

let me rewrite aggregateByKey in understandable way and multiple ways to write in achieving same result:

babyNamesCSV.aggregateByKey(0)((acc, value) => acc+value, (acc,value) => acc+value)

above aggregation will do the same as mentioned in github code. Now let’s see another way of achieving same result.
babyNamesCSV.aggregateByKey(0)((acc, value) => acc+value, (value, acc) => acc+value)

Here we are performing simple addition, it doesn’t make any difference with a+b and b+a. This is what he has done. In his code he has converted ‘v’ toInt. Which is not required. Because the RDD is already in correct datatypes.

1 Like

Please refer below link for simple understanding of difference between aggregations:

groupByKey vs. reduceByKey vs. aggregateByKey

1 Like

Thank you @ravi.tejarockon !

1 Like