Applying compression on ORC


#1

I have tried 3 compression(gzip,zlib,snappy) while saving data in ORC format but it is getting saved as normal orc. File getting created is of same size for all.

Code used:
sqlContext.setConf(“spark.sql.orc.compression.codec”,“zlib”)
sc.parallelize(1 to 10).toDF().write.mode(“overwrite”).format(“orc”).save("/user/cloudera/zlib")

Any other way to apply compression on ORC.
PS: I think ORC is already compressed, but not sure.


Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs

  • Click here for access to state of the art 13 node Hadoop and Spark Cluster


#2

By default ORC is compressed. We cannot use sqlContext.setConf, it only work with Parquet and Avro.

I need to troubleshoot further about compressing with different algorithms for ORC. For others you can go through this topic


#3

From exam’s perspective. I have covered reading /writing in avro,json,parquet,sequence with compression. Is that sufficient ?
Only Orc is pending along with compression. Though I have covered orc reading / writing without compression.