I ran 2 different files.
1st file contain 20 mb and the output is going to 1 file out of 200 files(remaining 199 is empty).
2nd file contain 700 mb and the output is going to 1 file out of 200 part files(remaining 199 is empty).
this 3 node aws emr cluster(1 master, 1 task, 1 core).
if I do coalesce or repartition then its working fine. even partitionBy also working fine.
but I want know the default behavior.
By Default spark dataframe use which partitioner ( hash partitioner or range partitioner).
Note : same data If run through rdd then generate 3 or 4 partfile.