Unable to read Sequence File after doing sqoop

pyspark
rdd-api
sqoop

#1

Hi Team,

I applied sqoop and saved data as sequenceFile format. And when I am tring to read as sc.sequenceFile(“Path”) I am facing error inwritable in pyspark.
Can any one let me rid of this error.
Thank you.


#2

Can you paste sample file format ?


#3

Sure Vinod

image

You can find the attached image.The path provided is sqooped data in sequencefile format.When I am running that I am ending up with errors.


#4

@Tarun_Teja,

Let me explain the issue with non-standardization of “Sequence File Format”, yes Sequence file is not standard & it not so much preferred file format in real-time. Reason? simple “SerDes”, yes they’re not standardized throughout all sequence implementations in all tools (Sqoop, Flume, Hive & you name it) within in just one single distribution (Cloudera/Hortonworks).

It means that when you’ve import data using Sqoop in sequence & try to read it in Hive, then it will not be able to read it cause, both of these SerDe implementations are different. That’s the reason lot of file standards established like ORC, Avro, Parquet etc…

If still you want to read that Sqoop sequenced data in Spark, then identify the SerDe from JAR generated by Sqoop when executed specific job & use same classes in Spark as “newHadoopAPI” in RDDs.


#5

Thank you for detail explanation