Spark-shell generating 200 files

apache-spark

#1

I am starting spark-shell like below…

spark-shell --master yarn-client --num-executors 5 --total-executor-cores 4

still when I try to save the data at HDFS, it creates 200 files at HDFS location. How to restrict this? I want to minimize the this.


#2

sqlContext.setConf(“spark.sql.shuffle.partitions”,"2’) // default it is set to 200 this is programmatic way of doing specific to application.

You can do the same thing on entire cluster configuration level also…


#3

default partitions is 200 by default but why data is going to single file and remaining is empty.


#4

Each Stage may create data of One block Size.
Check if the file size is falling in range of one HDFS block size.
Then try the job with higher size input file and see whats going in output.


#5

I ran 2 different files.

  • 1st file contain 20 mb and the output is going to 1 file out of 200 files(remaining 199 is empty).

  • 2nd file contain 700 mb and the output is going to 1 file out of 200 part files(remaining 199 is empty).

this 3 node aws emr cluster(1 master, 1 task, 1 core).

if I do coalesce or repartition then its working fine. even partitionBy also working fine.

but I want know the default behavior.

By Default spark dataframe use which partitioner ( hash partitioner or range partitioner).

Note : same data If run through rdd then generate 3 or 4 partfile.


#6

Even though input file is 700 MB, we need to look at how much data is being shuffled and whats the size of the output file which you say is having all the data. Can you share the hadoop fs -ls output of that output directory?