How to read snappy compressed a parquet file?

Hi All,

How to read snappy compressed a parquet file ?

Thanks in advance


Demo is done on our state of the art Big Data cluster with Hadoop, Spark etc - https://labs.itversity.com


Here is the sample code to generate data in parquet format with compression codec snappy:

val orders = sqlContext.read.json("/public/retail_db_json/orders")
sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")
orders.write.parquet("/user/itversity/orders_snappy")

Valid options for spark.sql.parquet.compression.codec are uncompressed, gzip, snappy etc. gzip is default

Validation:

[itversity@gw02 ~]$ hadoop fs -ls orders_snappy
Found 5 items
-rw-r–r-- 3 itversity hdfs 0 2018-04-10 11:33 orders_snappy/_SUCCESS
-rw-r–r-- 3 itversity hdfs 495 2018-04-10 11:33 orders_snappy/_common_metadata
-rw-r–r-- 3 itversity hdfs 1668 2018-04-10 11:33 orders_snappy/_metadata
-rw-r–r-- 3 itversity hdfs 266423 2018-04-10 11:33 orders_snappy/part-r-00000-3dc4646d-67ec-4d3d-8369-2551b6199b39.snappy.parquet
-rw-r–r-- 3 itversity hdfs 268441 2018-04-10 11:33 orders_snappy/part-r-00001-3dc4646d-67ec-4d3d-8369-2551b6199b39.snappy.parquet

How to read data from snappy compressed parquet file?

sqlContext.read.parquet("/user/itversity/orders_snappy").show

1 Like

Thank you for your reply @dgadiraju sir!
My bad, I was not using the correct path in my case :slight_smile:

You should be more elaborate while raising the issues :slight_smile:

@dgadiraju, @Itversity_Training

Suppose I have snappy parquet file size is 200MB and spark has no of partition suppose 4(not default, configured during launching spark shell). So when I read that file only 1 partition will be created? as Snpapy compression is not splittable.

So do I need to df.repartition(4) after reading that file?