[Spark Core Exercise]: groupByKey vs. reduceByKey vs. aggregateByKey

Spark provide several APIs to perform by key aggregations and it is very important for any developer to understand the differences.

Please answer below questions which will help you keep the differences in mind and build effective solutions.

  • What is the difference between aggregateByKey, groupByKey and reduceByKey?
  • What is the syntax?
  • When reduceByKey should be used over aggregateByKey?
  • If you have prior experience, please provide the examples
1 Like

ReduceByKey reduceByKey(func, [numTasks])-

Data is combined so that at each partition there should be at least one value for each key.
And then shuffle happens and it is sent over the network to some particular executor for some action such as reduce.

GroupByKey - groupByKey([numTasks])

It doesn’t merge the values for the key but directly the shuffle process happens
and here lot of data gets sent to each partition, almost same as the initial data.

And the merging of values for each key is done after the shuffle.
Here lot of data stored on final worker node so resulting in out of memory issue.

aggregateByKey - aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])
It is similar to reduceByKey but you can provide initial values when performing aggregation.

Use of ReduceByKey
ReduceByKey can be used when we run on large data set.
ReduceByKey when the input and output value types are of same type over aggregatebykey