I have seen passing two functions to aggregateByKey method as arguments (reference : URL).
In this case to get the desired output proper (in reference we are finding max for combiner), the same combiner should get all the data of a particular key to the same combiner. So will it cause lot of shuffle data in network so that bandwidth may cause issues ?
If we are getting all data of a particular key to same combiner, second operation which normally reduce does (which is sum of orders ) can be calculated in same combiner it self right ?? , we can ignore reduce function it self in that case.
Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs
- Click here for access to state of the art 13 node Hadoop and Spark Cluster