Number of Output Files. CCA 175


#1

Hi All,

I have a query regarding writing output as TextFile in CCA175.

I am converting a Dataframe to RDD and then using RDD’s saveAsTextFile method for writing output result to HDFS. and output is getting written into 200 files as per spark.sql.shuflle option default value.

My question is if it is not mentioned to store the output in N number of files in a question. should we use a default option which will store the output in 200 files or we should change the number of output files as per our choice?

Regards
Abhishek


Prepare for certifications on our state of the art labs which have Hadoop, Spark, Kafka, Hive and other Big Data technologies

  • Click here for signing up for our state of the art 13 node Hadoop and Spark Cluster


#2

There is no need to convert your Data Frame to RDD. Data Frame it self have bunch of APIs such as df.write.text or df.write.csv to write into file system directly.

Which version of Spark you are using?

Spark 1.6.x: sqlContext.setConf("spark.sql.shuffle.partitions", "2")
Spark 2.x: spark.conf.set("spark.sql.shuffle.partitions", "2")


#3

You can do it either way. it’s your personal choice.

I had stored output in single file while several acquaintances of mine stored output in 200 files. All of us passed exam


#4

Hi Sir,

I am using spark 1.6.3 and planning to appear for CCA 175 next week. reason for converting it in text file is to store the data in different delimiter and to follow a standard approach as show in practice exercises.

Just wanted to check about changing the number of output files as per our choice if it is not specified in the question.

Regards
Abhishek


#5

Hi Mayank,

Many thanks for your quick reply.

Regards
Abhishek