How do I read sequence file in Spark?


#1

Hello,

I have some data in sequence format here:

[paslechoix@gw03 ~]$ hdfs dfs -ls orders03131
Found 5 items
-rw-r–r-- 3 paslechoix hdfs 0 2018-03-13 12:04 orders03131/_SUCCESS
-rw-r–r-- 3 paslechoix hdfs 741614 2018-03-13 12:04 orders03131/part-m-00000
-rw-r–r-- 3 paslechoix hdfs 753022 2018-03-13 12:04 orders03131/part-m-00001
-rw-r–r-- 3 paslechoix hdfs 752368 2018-03-13 12:04 orders03131/part-m-00002
-rw-r–r-- 3 paslechoix hdfs 752940 2018-03-13 12:04 orders03131/part-m-00003

The data is like:
[paslechoix@gw03 ~]$ hdfs dfs -cat orders03132_seq/part-m-00000 | head
SEQ!org.apache.hadoop.io.LongWritableordeG�Y���&���]E�@��-OCLOSED@��PENDING_PAYMENT@��/COMPLETE@��"{CLOSED@��,COMPLETE@�COMPLETE@��COMPLET@��

I wonder what is the right way to read them into an RDD? I tried the following and got error:
sc.sequenceFile(“orders03132_seq”, classOf[Int], classOf[String]).first

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): java.lang.RuntimeException: java.io.IOException: WritableName can’t load class: orders
at org.apache.hadoop.io.SequenceFile$Reader.getValueClass(SequenceFile.java:2103)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:2033)
at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1883)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1832)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1846)
at org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.java:49)
at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:237)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: WritableName can’t load class: orders
at org.apache.hadoop.io.WritableName.getClass(WritableName.java:77)
at org.apache.hadoop.io.SequenceFile$Reader.getValueClass(SequenceFile.java:2101)
… 17 more
Caused by: java.lang.ClassNotFoundException: Class orders not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2114)
at org.apache.hadoop.io.WritableName.getClass(WritableName.java:75)
… 18 more

This is my first time to read seq data in Spark, I believe the command is wrong, it would be appreciated if someone can help, thank you very much.


Urgent help needed! Read/save data in formats other than text in Spark
#2

Hello,
scala> import org.apache.hadoop.io.Text
import org.apache.hadoop.io.Text
scala> import org.apache.hadoop.io.IntWritable
import org.apache.hadoop.io.IntWritable
val result = sc.sequenceFile("/tmp/seq-output/part-00001", classOf[Text], classOf[IntWritable]). map{case (x, y) => (x.toString, y.get())}
scala> val result = sc.sequenceFile("/tmp/seq-output/part-00001", classOf[Text], classOf[IntWritable]). map{case (x, y) => (x.toString, y.get())}
result: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[15] at map at :29

scala> result.collect
res14: Array[(String, Int)] = Array((Kay2,2), (Key3,2))

      http://dmtolpeko.com/2015/02/23/spark-reading-sequence-files-generated-by-hive/

#3

Thank you very much for your reply, it throws the following error in my case:

18/03/14 06:25:21 INFO DAGScheduler: Job 0 failed: collect at :32, took 0.620121 s
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.RuntimeException: java.io.IOException: WritableName can’t load class: orders

Note I am using my sequence file here:

scala> val result = sc.sequenceFile(“orders0312seq/part-m-00000”, classOf[Text], classOf[IntWritable]). map{case (x, y) => (x.toString, y.get())}
18/03/14 06:25:04 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 336.9 KB, free 336.9 KB)
18/03/14 06:25:04 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 28.4 KB, free 365.4 KB)
18/03/14 06:25:04 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:42548 (size: 28.4 KB, free: 511.1 MB)
18/03/14 06:25:04 INFO SparkContext: Created broadcast 0 from sequenceFile at :29
result: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[1] at map at :29

The sequence files were generated this way:
sqoop import
–connect=jdbc:mysql://ms.itversity.com/retail_db
–username=retail_user
–password=itversity
–table=orders
–target-dir=“orders0312seq”
–as-sequencefile


#4

I also tried the following:
scala> import org.apache.hadoop.io.LongWritable
scala> val result = sc.sequenceFile(“orders0312seq/part-m-00000”, classOf[LongWritable]). map{case (x, y) => (x.toString, y.get())}

Error:
java.lang.RuntimeException: java.io.IOException: WritableName can’t load class: orders


#5

This is my latest try:

scala> val result = sc.sequenceFile(“orders0312seq/part-m-00000”, classOf[LongWritable], classOf[Text]). map{case (x, y) => (x.toString, y.toString)}
18/03/14 06:55:35 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 337.0 KB, free 1433.1 KB)
18/03/14 06:55:35 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 28.4 KB, free 1461.5 KB)
18/03/14 06:55:35 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on localhost:42548 (size: 28.4 KB, free: 511.0 MB)
18/03/14 06:55:35 INFO SparkContext: Created broadcast 7 from sequenceFile at :30
result: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[7] at map at :30

scala> result.collect
18/03/14 06:55:39 INFO FileInputFormat: Total input paths to process : 1
18/03/14 06:55:39 INFO SparkContext: Starting job: collect at :33
18/03/14 06:55:39 INFO DAGScheduler: Got job 4 (collect at :33) with 2 output partitions
18/03/14 06:55:39 INFO DAGScheduler: Final stage: ResultStage 4 (collect at :33)
18/03/14 06:55:39 INFO DAGScheduler: Parents of final stage: List()
18/03/14 06:55:39 INFO DAGScheduler: Missing parents: List()
18/03/14 06:55:39 INFO DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[7] at map at :30), which has no missing parents
18/03/14 06:55:39 INFO MemoryStore: Block broadcast_8 stored as values in memory (estimated size 3.1 KB, free 1464.7 KB)
18/03/14 06:55:39 INFO MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 1845.0 B, free 1466.5 KB)
18/03/14 06:55:39 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on localhost:42548 (size: 1845.0 B, free: 511.0 MB)
18/03/14 06:55:39 INFO SparkContext: Created broadcast 8 from broadcast at DAGScheduler.scala:1008
18/03/14 06:55:39 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 4 (MapPartitionsRDD[7] at map at :30)
18/03/14 06:55:39 INFO TaskSchedulerImpl: Adding task set 4.0 with 2 tasks
18/03/14 06:55:39 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 7, localhost, partition 0,ANY, 2175 bytes)
18/03/14 06:55:39 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 8, localhost, partition 1,ANY, 2175 bytes)
18/03/14 06:55:39 INFO Executor: Running task 0.0 in stage 4.0 (TID 7)
18/03/14 06:55:39 INFO Executor: Running task 1.0 in stage 4.0 (TID 8)
18/03/14 06:55:39 INFO HadoopRDD: Input split: hdfs://nn01.itversity.com:8020/user/paslechoix/orders0312seq/part-m-00000:440079+440080
18/03/14 06:55:39 INFO HadoopRDD: Input split: hdfs://nn01.itversity.com:8020/user/paslechoix/orders0312seq/part-m-00000:0+440079
18/03/14 06:55:39 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 7)
java.lang.RuntimeException: java.io.IOException: WritableName can’t load class: orders
at org.apache.hadoop.io.SequenceFile$Reader.getValueClass(SequenceFile.java:2103)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:2033)
at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1883)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1832)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1846)
at org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.java:49)
at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:237)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: WritableName can’t load class: orders
at org.apache.hadoop.io.WritableName.getClass(WritableName.java:77)
at org.apache.hadoop.io.SequenceFile$Reader.getValueClass(SequenceFile.java:2101)
… 20 more
Caused by: java.lang.ClassNotFoundException: Class orders not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2114)
at org.apache.hadoop.io.WritableName.getClass(WritableName.java:75)
… 21 more
18/03/14 06:55:39 WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 7, localhost): java.lang.RuntimeException: java.io.IOException: WritableName can’t load class: orders
at org.apache.hadoop.io.SequenceFile$Reader.getValueClass(SequenceFile.java:2103)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:2033)
at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1883)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1832)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1846)
at org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.java:49)
at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:237)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: WritableName can’t load class: orders
at org.apache.hadoop.io.WritableName.getClass(WritableName.java:77)
at org.apache.hadoop.io.SequenceFile$Reader.getValueClass(SequenceFile.java:2101)
… 20 more
Caused by: java.lang.ClassNotFoundException: Class orders not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2114)
at org.apache.hadoop.io.WritableName.getClass(WritableName.java:75)
… 21 more

18/03/14 06:55:39 ERROR TaskSetManager: Task 0 in stage 4.0 failed 1 times; aborting job
18/03/14 06:55:39 INFO TaskSchedulerImpl: Cancelling stage 4
18/03/14 06:55:39 INFO Executor: Executor is trying to kill task 1.0 in stage 4.0 (TID 8)
18/03/14 06:55:39 INFO TaskSchedulerImpl: Stage 4 was cancelled
18/03/14 06:55:39 INFO DAGScheduler: ResultStage 4 (collect at :33) failed in 0.027 s
18/03/14 06:55:39 INFO DAGScheduler: Job 4 failed: collect at :33, took 0.032049 s
18/03/14 06:55:39 ERROR Executor: Exception in task 1.0 in stage 4.0 (TID 8)
java.lang.RuntimeException: java.io.IOException: WritableName can’t load class: orders
at org.apache.hadoop.io.SequenceFile$Reader.getValueClass(SequenceFile.java:2103)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:2033)
at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1883)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1832)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1846)
at org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.java:49)
at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:237)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: WritableName can’t load class: orders
at org.apache.hadoop.io.WritableName.getClass(WritableName.java:77)
at org.apache.hadoop.io.SequenceFile$Reader.getValueClass(SequenceFile.java:2101)
… 20 more
Caused by: java.lang.ClassNotFoundException: Class orders not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2114)
at org.apache.hadoop.io.WritableName.getClass(WritableName.java:75)
… 21 more
18/03/14 06:55:39 INFO TaskSetManager: Lost task 1.0 in stage 4.0 (TID 8) on executor localhost: java.lang.RuntimeException (java.io.IOException: WritableName can’t load class: orders) [duplicate 1]
18/03/14 06:55:39 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 7, localhost): java.lang.RuntimeException: java.io.IOException: WritableName can’t load class: orders
at org.apache.hadoop.io.SequenceFile$Reader.getValueClass(SequenceFile.java:2103)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:2033)
at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1883)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1832)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1846)
at org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.java:49)
at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:237)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: WritableName can’t load class: orders
at org.apache.hadoop.io.WritableName.getClass(WritableName.java:77)
at org.apache.hadoop.io.SequenceFile$Reader.getValueClass(SequenceFile.java:2101)
… 20 more
Caused by: java.lang.ClassNotFoundException: Class orders not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2114)
at org.apache.hadoop.io.WritableName.getClass(WritableName.java:75)
… 21 more

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1433)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1421)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1420)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1420)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:801)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:801)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:801)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1642)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1601)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1590)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:622)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1856)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1869)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1882)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1953)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:934)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:323)
at org.apache.spark.rdd.RDD.collect(RDD.scala:933)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:40)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:42)
at $iwC$$iwC$$iwC$$iwC.(:44)
at $iwC$$iwC$$iwC.(:46)
at $iwC$$iwC.(:48)
at $iwC.(:50)
at (:52)
at .(:56)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.RuntimeException: java.io.IOException: WritableName can’t load class: orders
at org.apache.hadoop.io.SequenceFile$Reader.getValueClass(SequenceFile.java:2103)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:2033)
at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1883)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1832)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1846)
at org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.java:49)
at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:237)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: WritableName can’t load class: orders
at org.apache.hadoop.io.WritableName.getClass(WritableName.java:77)
at org.apache.hadoop.io.SequenceFile$Reader.getValueClass(SequenceFile.java:2101)
… 20 more
Caused by: java.lang.ClassNotFoundException: Class orders not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2114)
at org.apache.hadoop.io.WritableName.getClass(WritableName.java:75)
… 21 more

scala>

Note the sequence file is below:

[paslechoix@gw03 ~]$ hdfs dfs -cat orders03132_seq/part-m-00000 | head
SEQ!org.apache.hadoop.io.LongWritableordeG�Y���&���]E�@��-OCLOSED@��PENDING_PAYMENT@��/COMPLETE@��"{CLOSED@��,COMPLETE@�COMPLETE@��COMPLET@��

As indicated in the error, it seems the process cannot understand something called orders from the data.

The head of the sequence file shows only org.apache.hadoop.io.LongWritable, does that mean there is no existing <K, V> pair in the data? So I modified my command as below:

scala> val result = sc.sequenceFile(“orders0312seq/part-m-00000”, classOf[LongWritable])

it is not accepted though:

:30: error: type mismatch;
found : Classorg.apache.hadoop.io.LongWritable
required: Int
val result = sc.sequenceFile(“orders0312seq/part-m-00000”, classOf[LongWritable])

Thank you, any clue is greatly appreciated.