Snappy/Gzip compression on ORC files using Scala


#1

How do I generate compressed ORC files?
sqlContext.setConf(“spark.sql.orc.compression.codec”,“snappy”) does not works. I tried sqlContext.setConf(“spark.io.compression.codec”,“snappy”) as well. This didn’t work either.

Please advise.


#2

As per github code of spark repo : https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcOptions.scala
They are providing none, uncompressed, snappy, zlib, lzo . And snappy compression is default codec.
However I am not able to store using with different format
sqlContext.setConf(“spark.sql.hive.orc.compression.codec”,“zlib”)

sqlContext.setConf(“spark.sql.hive.orc.compress.codec”,“zlib”)

sqlContext.setConf(“spark.io.compression.codec”,“zlib”)

If you find any information please let me know.


#5

It seems ORC and Snappy is not working as expected.


#6

any update on this? even I am not able to set any codec for ORC.
It could be an exercise on CCA exams!!!


#7

Run using this:
spark-shell --master yarn
–conf spark.io.compression.codec=snappy


#8

snappy is the default codec for ORC file, but other codecs are not working


#9

Are you able to found a fix for this issue as other compression are not working for orc?


#10

Workable solution (verified):
With create query, it will work. I always use it with sqlContext:

sqlContext.sql(“CREATE TABLE orders_orc_hive STORED AS orc LOCATION ‘/user/hive/warehouse/tableName’ TBLPROPERTIES(‘orc.compress’=’SNAPPY’) as SELECT * from order_orc”)

Note: You can ignore the location. Also, order_orc is the table which we get via registerTempTable.
ORC has a by default Zlib compression. For snappy, you have to follow above approach. Also, Make sure ‘SNAPPY’ would be in all capital letters.