Query :Sequence File in Scala

Hi all,

When we are having only two fields in a file , we are making first field as key and second field as value while saving as Sequence file as below.

val data1RDD = sc.textFile("/user/cloudera/sqoop_import/departments")

data1RDD.map(x => (x.split(",")(0), x.split(",")(1))).saveAsSequenceFile("/user/cloudera/scalaspark/departmentsSeq")

When we have more fields in a file , how can we split to KEY (first field) , VALUE (rest of fields) while saving as sequence file. in scala.

Below is the scenario:

val data2RDD = sc.textFile("/user/cloudera/sqoop_import/orders")

rec : 68883,2014-07-23 00:00:00.0,5533,COMPLETE

output : (68883,(2014-07-23 00:00:00.0,5533,COMPLETE))
Key value

I tried by making the record to split by first field and rest of the fields , then tried saving it to sequence file but got below error

val dat1 = dataRDD.map(x => (x.split(",")(0), (x.split(",")(1), x.split(",")(2), x.split(",")(3))))
output : (68883,(2014-07-23 00:00:00.0,5533,COMPLETE))

saving it as sequence file :

scala> dat1.map(x => (x._1, x._2)).saveAsSequenceFile("/user/cloudera/scalaspark/ordersSeq3")
:45: error: value saveAsSequenceFile is not a member of org.apache.spark.rdd.RDD[(String, (String, String, String))]
dat1.map(x => (x._1, x._2)).saveAsSequenceFile("/user/cloudera/scalaspark/ordersSeq3")

Are there any other ways to save as sequence files ?

Thanks in advance

@itversity I am also having same doubt. Can you pls give some solution for this. Some solution similer to pyspark like dataRDD.map(lambda x: tuple(x.split(",",1))).

@chaitu405 @N_Chakote The following way would help. But better way is to create custom classes to write custom sequenceFiles in Spark.

import org.apache.hadoop.io.{Text, IntWritable}

scala> val ordersFile = “/user/cloudera/cca175/retail_db/orders"
scala> val ordersRdd = sc.textFile(ordersFile)
scala> val ordersRdd = sc.textFile(ordersFile).map(_.split(”,"))
scala> val ordersKvRdd = ordersRdd.map{x=> (x(0).toInt, (x(0).toInt, x(1),x(2).toInt,x(3)))}
scala> val ordersKvSeqRdd = ordersKvRdd.map{case(x,y) => (new IntWritable(x), new Text(y.toString))}
scala> ordersKvSeqRdd.saveAsSequenceFile("/user/cloudera/cca175/retail_db/orders_sequence")