Reading and Writing Sequence files


#1

I’m facing difficulties with reading sequence files from hdfs. Can anyone provide clear instructions on how to read sequence files using spark context.


#2

Demo is done on our state of the art Big Data cluster - https://labs.itversity.com
This code works with almost all the versions of Spark especially 1.6.x and later

For material, click here to sign up for one of our Udemy courses.


Here is the demo for writing and reading data from sequence file format using Scala as programming language.

Sequence File format have become popular with HDFS. As Spark uses HDFS APIs to interact with files we can save data in Sequence file format as well as read it as long as we have some information about metadata.

Here are few things that need to keep in mind while dealing with Sequence Files

  • Define key and value
  • As key and value have to be Hadoop writable classes, we need to import classes from org.apache.hadoop.io package (it have classes such as Text, IntWritable etc)
  • To read the data we need to have idea about metadata of Key and Value. We can get it by going through first line of data.
  • Typically we apply map to convert into key value pairs before writing as sequence file or after reading the data.

Writing the data in sequence file format

// Importing org.apache.hadoop.io package
import org.apache.hadoop.io._

// As we need data in sequence file format to read. Let us see how to write first
// Reading data from text file format
val dataRDD = sc.textFile("/public/retail_db/orders")

// Using null as key and value will be of type Text while saving in sequence file format
// By Int and String, we do not need to convert types into IntWritable and Text
// But for others we need to convert to writable object
// For example, if the key/value is of type Long, we might have to 
// type cast by saying new LongWritable(object)
dataRDD.
  map(x => (NullWritable.get(), x)).
  saveAsSequenceFile("/user/`whoami`/orders_seq")
// Make sure to replace `whoami` with the appropriate OS user id

// Saving in sequence file with key of type Int and value of type String
dataRDD.
  map(x => (x.split(",")(0).toInt, x.split(",")(1))).
  saveAsSequenceFile("/user/`whoami`/orders_seq")
// Make sure to replace `whoami` with the appropriate OS user id

Once you write the data, you can see the contents of the sequence file, especially first line to get the key type and value type. Even if you want to read the data not written by you, you can get the metadata by looking at header.

for e.g.: SEQ org.apache.hadoop.io.IntWritableorg.apache.hadoop.io.Text�����g��2?W

Reading sequence file

We need to look at metadata to come up with appropriate classes as part of sc.sequenceFile for key and value

// Considering data where key is of type Int and value is of type String
import org.apache.hadoop.io._
sc.sequenceFile("/user/`whoami`/orders_seq", classOf[IntWritable], classOf[Text]).
  map(rec => rec.toString()).
  take(100).
  foreach(println)
  // Make sure to replace `whoami` with the appropriate OS user id
  // Data will be read in the form of tuples

#3

Thanks Durga sir, its one of the finest explanation i have seen so far over internet .
However i am still stuck .
I have made a sqoop import in to HDFS using below command

sqoop import-all-tables --connect jdbc:mysql://ms.itversity.com:3306/retail_db
–username retail_user --P
–warehouse-dir /user/sameerrao20118/sqoop_importAllTables
–z --as-sequencefile
–compression-codec org.apache.hadoop.io.compress.SnappyCodec
–autoreset-to-one-mapper
–null-non-string -1

My data looks like this.
hdfs dfs -cat /user/sameerrao20118/sqoop_importAllTables/products/part-m-00002 | head -2
SEQ!org.apache.hadoop.io.LongWritabproducts)org.apache.hadoop.io.compress.SnappyCod���б�fT���.���������б�fT���.��P�����:��
���6

looks like the data was not converted in to key,value pair , In this kind of scenario is it possible to read the data
also how can i import hadoop.io on pyspark as it throws Error.

import org.apache.hadoop.io._
Traceback (most recent call last):
File “”, line 1, in
ImportError: No module named org.apache.hadoop.io._

Could you please assist.