ReduceByKey reduceByKey(func, [numTasks])-
Data is combined so that at each partition there should be at least one value for each key.
And then shuffle happens and it is sent over the network to some particular executor for some action such as reduce.
GroupByKey - groupByKey([numTasks])
It doesn’t merge the values for the key but directly the shuffle process happens
and here lot of data gets sent to each partition, almost same as the initial data.
And the merging of values for each key is done after the shuffle.
Here lot of data stored on final worker node so resulting in out of memory issue.
aggregateByKey - aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])
It is similar to reduceByKey but you can provide initial values when performing aggregation.
Use of ReduceByKey
ReduceByKey can be used when we run on large data set.
ReduceByKey when the input and output value types are of same type over aggregatebykey