Coalesce and Repartition

What is Coalesce and re-partition in Spark and how does it differ?

Typically each file will be transformed into rdd partition. If you have too many files then you will end up with too many RDD partitions which will eventually result in too many tasks to process data at each stage.

Coalesce and repartition are used to create few RDD partitions so that lesser number of tasks are used to process the data.

Difference between coalesce and repartition is - data will be shuffled in repartition while data will not be shuffled as part of coalesce.

If you want to generate only one single part file in the output folder use coalesce.
coalesce guaranties that you always have single part file in the output folder and does not shuffle the data .
Example : purchaseDF.coalesce(1).write.mode(SaveMode.Append).insertInto(“sales”)

where as repartition does not guaranties the number of part files in the output and shuffles the data .

1 Like