Reg File Compression While storing in HDFS using Pyspark for JSON/AVRO/ORC

pyspark
apache-spark

#1

Hello,

I am trying to compress the (AVRO / JSON / ORC) files while storing into HDFS using pyspark. I tried using following code

FOR JSON:

sqlContext.setConf(“spark.sql.json.compression.codec”,“gzip”)
datafrm.coalesce(1).write.json("")

datafrm.toJSON().saveAsTextFile("",classOf[org.apache.hadoop.io.compress.GzipCodec])
Traceback (most recent call last):
File “”, line 1, in
NameError: name ‘classOf’ is not defined

datafrm.write.json("",classOf[org.apache.hadoop.io.compress.GzipCodec])
Traceback (most recent call last):
File “”, line 1, in
NameError: name ‘classOf’ is not defined

similarly i tried for other file formats like ORC, AVRO also.
sqlContext.setConf(“spark.sql.orc.compression.codec”,“gzip”) for ORC
sqlContext.setConf(“spark.sql.avro.compression.codec”,“gzip”) for AVRO

It didnt work. Kindly give me the solution.

Thanks,
Karthik


Snappy compression for avro file
Applying compression on ORC
#2

Examples given are for Scala.

Here is the python code for JSON, I will explain about others later:

# Create dataframe
orders = sqlContext.read.json("/public/retail_db_json/orders")
orders. \
toJSON(). \
saveAsTextFile("/user/training/orders_json_compressed", compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")

Validation:

[training@gw02 ~]$ hadoop fs -ls /user/training/orders_json_compressed
Found 3 items
-rw-r–r-- 3 training training 0 2018-04-30 03:49 /user/training/orders_json_compressed/_SUCCESS
-rw-r–r-- 3 training training 265258 2018-04-30 03:49 /user/training/orders_json_compressed/part-00000.gz
-rw-r–r-- 3 training training 268635 2018-04-30 03:49 /user/training/orders_json_compressed/part-00001.gz


Above code is tested in our state of the lab using Spark 1.6.2



#3

Here is the python code for compressing data in parquet file format:

sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")

sqlContext.read.json("/public/retail_db_json/orders"). \
write.parquet("/user/training/orders_parquet")

Validation:
You can check snappy in the file names as part of the output.

[training@gw02 ~]$ hadoop fs -ls /user/training/orders_parquet
Found 5 items
-rw-r–r-- 3 training training 0 2018-04-30 06:03 /user/training/orders_parquet/_SUCCESS
-rw-r–r-- 3 training training 495 2018-04-30 06:03 /user/training/orders_parquet/_common_metadata
-rw-r–r-- 3 training training 1668 2018-04-30 06:03 /user/training/orders_parquet/_metadata
-rw-r–r-- 3 training training 266423 2018-04-30 06:03 /user/training/orders_parquet/part-r-00000-94167198-af45-4b71-86c5-80bbddacbb48.snappy.parquet
-rw-r–r-- 3 training training 268441 2018-04-30 06:03 /user/training/orders_parquet/part-r-00001-94167198-af45-4b71-86c5-80bbddacbb48.snappy.parquet


#4

Let us see how we can write data in Avro format with or with out compression.

  • Avro is not supported out of the box
  • We need to use --packages or --jars to use avro as part of our code
  • By default avrò will be compressed using snappy, deflate, uncompressed
  • We can use df.write.format with fully qualified class name of Avro SerDe - com.databricks.spark.avro

Launch:

pyspark --master yarn --conf spark.ui.port=12891 \
--jars /usr/hdp/2.5.0.0-1245/spark/lib/spark-avro_2.10-2.0.1.jar

Execute code:

#Specifying compression algorithm - snappy is default
sqlContext.setConf("spark.sql.avro.compression.codec", "snappy")

sqlContext.setConf("spark.sql.avro.compression.codec", "deflate")
sqlContext.setConf("spark.sql.avro.deflate.level", "5")

sqlContext.setConf("spark.sql.avro.compression.codec", "uncompressed")

sqlContext.read.json("/public/retail_db_json/orders"). \
write.format("com.databricks.spark.avro").save("/user/training/orders_avro")

Validation:
Observe the differences in sizes

#Snappy Compression (default)
[training@gw02 ~]$ hadoop fs -ls /user/training/orders_avro
Found 3 items
-rw-r–r-- 3 training training 0 2018-04-30 06:47 /user/training/orders_avro/_SUCCESS
-rw-r–r-- 3 training training 411010 2018-04-30 06:47 /user/training/orders_avro/part-r-00000-ae7dd429-dfeb-416d-81cd-ac34f6ffa411.avro
-rw-r–r-- 3 training training 424676 2018-04-30 06:47 /user/training/orders_avro/part-r-00001-ae7dd429-dfeb-416d-81cd-ac34f6ffa411.avro

#Deflate
[training@gw02 ~]$ hadoop fs -ls /user/training/orders_avro
Found 3 items
-rw-r–r-- 3 training training 0 2018-04-30 06:49 /user/training/orders_avro/_SUCCESS
-rw-r–r-- 3 training training 223304 2018-04-30 06:49 /user/training/orders_avro/part-r-00000-042ec7e3-160c-4c03-807b-71ccd9e3fa78.avro
-rw-r–r-- 3 training training 225668 2018-04-30 06:49 /user/training/orders_avro/part-r-00001-042ec7e3-160c-4c03-807b-71ccd9e3fa78.avro

#Uncompressed
[training@gw02 ~]$ hadoop fs -ls /user/training/orders_avro
Found 3 items
-rw-r–r-- 3 training training 0 2018-04-30 06:50 /user/training/orders_avro/_SUCCESS
-rw-r–r-- 3 training training 1439410 2018-04-30 06:50 /user/training/orders_avro/part-r-00000-7cd2008a-032f-41bd-9eb8-cf44e5ef0204.avro
-rw-r–r-- 3 training training 1442931 2018-04-30 06:50 /user/training/orders_avro/part-r-00001-7cd2008a-032f-41bd-9eb8-cf44e5ef0204.avro