Problem 4 - Explore different file formats


#1

Originally published at: http://www.itversity.com/lessons/problem-4-explore-different-file-formats/

In this problem, we will focus on conversion between different file formats using spark or hive. This is a very import examination topic. I recommend that you master the data file conversion techniques and understand the limitations. You should have an alternate method of accomplishing a solution to the problem in case your primary method…


#2

Hi,

in step 4:
I am trying to convert to sequence file .

val datafiles= sqlContext.read.avro("/user/rkathiravan/fileformats/ord_avro")
scala> datafiles.map(r=> (r(0).toInt,r)).saveAsSequenceFile("/user/rkathiravan/fileformats/ex/ord_avro_seq_nocomp")
:31: error: value toInt is not a member of Any
datafiles.map(r=> (r(0).toInt,r)).saveAsSequenceFile("/user/rkathiravan/fileformats/ex/ord_avro_seq_nocomp")
when i type cast with int, it throws error but
if it type cast to String, it works,
datafiles.map(r=> (r(0).toString,r(0)+"\t"+r(1)+"\t"+r(2)+"\t"+r(3))).saveAsSequenceFile("/user/rkathiravan/fileformats/ex/ord_avro_seq_nocomp")

why cast to Int is not working???


#3

Hi,

step 5: converting parquet to avro with snappy compression

I did the following,
spark-shell --packages com.databricks:spark-avro_2.10:2.0.1 --master yarn --conf spark.ui.port=12345
import com.databricks.spark.avro._
val parquetSnappy= sqlContext.read.parquet("/user/rkathiravan/fileformats/ex/ord_avro_parquet_snappy")
sqlContext.setConf(“spark.sql.avro.compression.codec”,“snappy”)
parquetSnappy.write.avro("/user/rkathiravan/fileformats/ex/ord_avro_parquet_snappy_to_avro_snappy1")

it runs successfully but
in that file location
[rkathiravan@gw01 ~]$ hadoop fs -ls /user/rkathiravan/fileformats/ex/ord_avro_parquet_snappy_to_avro_snappy1
Found 5 items
-rw-r–r-- 3 rkathiravan hdfs 0 2017-09-21 15:27 /user/rkathiravan/fileformats/ex/ord_avro_parquet_snappy_to_avro_snappy1/_SUCCESS
-rw-r–r-- 3 rkathiravan hdfs 195253 2017-09-21 15:27 /user/rkathiravan/fileformats/ex/ord_avro_parquet_snappy_to_avro_snappy1/part-r-00000-17e6c260-7622-41ba-b715-a3d3df2846e0.avro
-rw-r–r-- 3 rkathiravan hdfs 169247 2017-09-21 15:27 /user/rkathiravan/fileformats/ex/ord_avro_parquet_snappy_to_avro_snappy1/part-r-00001-17e6c260-7622-41ba-b715-a3d3df2846e0.avro
-rw-r–r-- 3 rkathiravan hdfs 193189 2017-09-21 15:27 /user/rkathiravan/fileformats/ex/ord_avro_parquet_snappy_to_avro_snappy1/part-r-00002-17e6c260-7622-41ba-b715-a3d3df2846e0.avro
-rw-r–r-- 3 rkathiravan hdfs 172071 2017-09-21 15:27 /user/rkathiravan/fileformats/ex/ord_avro_parquet_snappy_to_avro_snappy1/part-r-00003-17e6c260-7622-41ba-b715-a3d3df2846e0.avro

i dont see whether it is snappy compressed .
can anyone help me out.


#4

i checked the size of files as well. it is not snappy compressed. why?
[rkathiravan@gw01 ~]$ hadoop fs -du -s -h /user/rkathiravan/fileformats/ex/ord_avro_parquet_snappy
575.2 K /user/rkathiravan/fileformats/ex/ord_avro_parquet_snappy

[rkathiravan@gw01 ~]$ hadoop fs -du -s -h /user/rkathiravan/fileformats/ex/ord_avro_parquet_snappy_to_avro_snappy1
712.7 K /user/rkathiravan/fileformats/ex/ord_avro_parquet_snappy_to_avro_snappy1

can anyone help