Clarification on Spark core API, aggregateByKey function data in combiner and reducer

pyspark
cca-175
apache-spark

#1

Hello ItVersity,

I have seen passing two functions to aggregateByKey method as arguments (reference : URL).

In this case to get the desired output proper (in reference we are finding max for combiner), the same combiner should get all the data of a particular key to the same combiner. So will it cause lot of shuffle data in network so that bandwidth may cause issues ?

If we are getting all data of a particular key to same combiner, second operation which normally reduce does (which is sum of orders ) can be calculated in same combiner it self right ?? , we can ignore reduce function it self in that case.


Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs

  • Click here for access to state of the art 13 node Hadoop and Spark Cluster