PySpark-Unable to load class while reading Sequence File


#1

Hi All, I am getting the error Unable to load:Class while reading the sequence file. Please help me to resolve the issue.

Steps which I followed:

  • Import the table-Departments as sequence file through Sqoop. When I --cat the location of the sequence file, I understand the department_id is stored as key(IntWritable) and department_name is stored as Value(LongWritable). Please let me know if my understanding is wrong?
  • Then I tried to read the sequence file

SequenFile location in VM:
[cloudera@quickstart ~]$ hadoop fs -cat /user/cloudera/sqoop_import/updatedepartments3/part*
SEQ!org.apache.hadoop.io.LongWritable
departmentsCP݀�’%��FitnessFootwearApparelGolfOutdoorFan Shop

Science Mathematics
Engineering eTamil1
English1
gMaths1
h
Science5{Tamil�
Quality CheckSEQ!org.apache.hadoop.io.LongWritable
departmentsW�d�9/�$�_\��o�EQ!org.apache.hadoop.io.LongWritable
departments�����W����\�ކ�SEQ!org.apache.hadoop.io.LongWritable
departmentsP�,*^C

Error while reading the sequence file:

dataseqRDD = sc.sequenceFile("/user/cloudera/sqoop_import/updatedepartments3",“org.apache.hadoop.io.IntWritable”,“org.apache.hadoop.io.LongWritable”)
17/02/28 16:33:54 INFO storage.MemoryStore: Block broadcast_9 stored as values in memory (estimated size 191.1 KB, free 622.8 KB)
17/02/28 16:33:54 INFO storage.MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 22.1 KB, free 645.0 KB)
17/02/28 16:33:54 INFO storage.BlockManagerInfo: Added broadcast_9_piece0 in memory on localhost:56179 (size: 22.1 KB, free: 530.2 MB)
17/02/28 16:33:54 INFO spark.SparkContext: Created broadcast 9 from sequenceFile at PythonRDD.scala:474
17/02/28 16:33:54 INFO storage.MemoryStore: Block broadcast_10 stored as values in memory (estimated size 191.1 KB, free 836.1 KB)
17/02/28 16:33:54 INFO storage.MemoryStore: Block broadcast_10_piece0 stored as bytes in memory (estimated size 22.1 KB, free 858.2 KB)
17/02/28 16:33:54 INFO storage.BlockManagerInfo: Added broadcast_10_piece0 in memory on localhost:56179 (size: 22.1 KB, free: 530.2 MB)
17/02/28 16:33:54 INFO spark.SparkContext: Created broadcast 10 from broadcast at PythonRDD.scala:475
17/02/28 16:33:54 INFO mapred.FileInputFormat: Total input paths to process : 4
17/02/28 16:33:54 INFO spark.SparkContext: Starting job: take at SerDeUtil.scala:201
17/02/28 16:33:54 INFO scheduler.DAGScheduler: Got job 3 (take at SerDeUtil.scala:201) with 1 output partitions
17/02/28 16:33:54 INFO scheduler.DAGScheduler: Final stage: ResultStage 3 (take at SerDeUtil.scala:201)
17/02/28 16:33:54 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/02/28 16:33:54 INFO scheduler.DAGScheduler: Missing parents: List()
17/02/28 16:33:54 INFO scheduler.DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[7] at map at PythonHadoopUtil.scala:181), which has no missing parents
17/02/28 16:33:54 INFO storage.MemoryStore: Block broadcast_11 stored as values in memory (estimated size 3.3 KB, free 861.5 KB)
17/02/28 16:33:54 INFO storage.MemoryStore: Block broadcast_11_piece0 stored as bytes in memory (estimated size 1999.0 B, free 863.5 KB)
17/02/28 16:33:54 INFO storage.BlockManagerInfo: Added broadcast_11_piece0 in memory on localhost:56179 (size: 1999.0 B, free: 530.2 MB)
17/02/28 16:33:54 INFO spark.SparkContext: Created broadcast 11 from broadcast at DAGScheduler.scala:1006
17/02/28 16:33:54 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (MapPartitionsRDD[7] at map at PythonHadoopUtil.scala:181)
17/02/28 16:33:54 INFO scheduler.TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
17/02/28 16:33:54 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 3.0 (TID 3, localhost, partition 0,ANY, 2192 bytes)
17/02/28 16:33:54 INFO executor.Executor: Running task 0.0 in stage 3.0 (TID 3)
17/02/28 16:33:54 INFO rdd.HadoopRDD: Input split: hdfs://quickstart.cloudera:8020/user/cloudera/sqoop_import/updatedepartments3/part-m-00000:0+627
17/02/28 16:33:54 ERROR executor.Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.RuntimeException: java.io.IOException: WritableName can’t load class: departments
at org.apache.hadoop.io.SequenceFile$Reader.getValueClass(SequenceFile.java:2107)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:2037)
at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1878)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1827)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1841)
at org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.java:49)
at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:237)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: WritableName can’t load class: departments
at org.apache.hadoop.io.WritableName.getClass(WritableName.java:77)
at org.apache.hadoop.io.SequenceFile$Reader.getValueClass(SequenceFile.java:2105)
… 20 more
Caused by: java.lang.ClassNotFoundException: Class departments not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2105)
at org.apache.hadoop.io.WritableName.getClass(WritableName.java:75)
… 21 more


#2

@email2dgk, @pratyush04 could you please help me to resolve the issue below?
I am getting the error ‘Unable to load:Class’ while reading the sequence file. Please help me to resolve the issue.

Steps which I followed:

  • Import the table-Departments as sequence file through Sqoop. When I --cat the location of the sequence file, I understand the department_name is stored as Value(LongWritable). how to determine the datatype for key? Please let me know if my understanding is wrong?
  • Then I tried to read the sequence file with the command,
    dataseqRDD = sc.sequenceFile("/user/cloudera/sqoop_import/updatedepartments3",“org.apache.hadoop.io.IntWritable”,“org.apache.hadoop.io.LongWritable”)

SequenFile location in VM:
[cloudera@quickstart ~]$ hadoop fs -cat /user/cloudera/sqoop_import/updatedepartments3/part*
SEQ!org.apache.hadoop.io.LongWritable
departmentsCP݀�’%��FitnessFootwearApparelGolfOutdoorFan Shop

Science Mathematics
Engineering eTamil1
English1
gMaths1
h
Science5{Tamil�
Quality CheckSEQ!org.apache.hadoop.io.LongWritable
departmentsW�d�9/�$�_\��o�EQ!org.apache.hadoop.io.LongWritable
departments�����W����\�ކ�SEQ!org.apache.hadoop.io.LongWritable
departmentsP�,*^C

Error while reading the sequence file:

dataseqRDD = sc.sequenceFile("/user/cloudera/sqoop_import/updatedepartments3",“org.apache.hadoop.io.IntWritable”,“org.apache.hadoop.io.LongWritable”)
17/02/28 16:33:54 INFO storage.MemoryStore: Block broadcast_9 stored as values in memory (estimated size 191.1 KB, free 622.8 KB)
17/02/28 16:33:54 INFO storage.MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 22.1 KB, free 645.0 KB)
17/02/28 16:33:54 INFO storage.BlockManagerInfo: Added broadcast_9_piece0 in memory on localhost:56179 (size: 22.1 KB, free: 530.2 MB)
17/02/28 16:33:54 INFO spark.SparkContext: Created broadcast 9 from sequenceFile at PythonRDD.scala:474
17/02/28 16:33:54 INFO storage.MemoryStore: Block broadcast_10 stored as values in memory (estimated size 191.1 KB, free 836.1 KB)
17/02/28 16:33:54 INFO storage.MemoryStore: Block broadcast_10_piece0 stored as bytes in memory (estimated size 22.1 KB, free 858.2 KB)
17/02/28 16:33:54 INFO storage.BlockManagerInfo: Added broadcast_10_piece0 in memory on localhost:56179 (size: 22.1 KB, free: 530.2 MB)
17/02/28 16:33:54 INFO spark.SparkContext: Created broadcast 10 from broadcast at PythonRDD.scala:475
17/02/28 16:33:54 INFO mapred.FileInputFormat: Total input paths to process : 4
17/02/28 16:33:54 INFO spark.SparkContext: Starting job: take at SerDeUtil.scala:201
17/02/28 16:33:54 INFO scheduler.DAGScheduler: Got job 3 (take at SerDeUtil.scala:201) with 1 output partitions
17/02/28 16:33:54 INFO scheduler.DAGScheduler: Final stage: ResultStage 3 (take at SerDeUtil.scala:201)
17/02/28 16:33:54 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/02/28 16:33:54 INFO scheduler.DAGScheduler: Missing parents: List()
17/02/28 16:33:54 INFO scheduler.DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[7] at map at PythonHadoopUtil.scala:181), which has no missing parents
17/02/28 16:33:54 INFO storage.MemoryStore: Block broadcast_11 stored as values in memory (estimated size 3.3 KB, free 861.5 KB)
17/02/28 16:33:54 INFO storage.MemoryStore: Block broadcast_11_piece0 stored as bytes in memory (estimated size 1999.0 B, free 863.5 KB)
17/02/28 16:33:54 INFO storage.BlockManagerInfo: Added broadcast_11_piece0 in memory on localhost:56179 (size: 1999.0 B, free: 530.2 MB)
17/02/28 16:33:54 INFO spark.SparkContext: Created broadcast 11 from broadcast at DAGScheduler.scala:1006
17/02/28 16:33:54 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (MapPartitionsRDD[7] at map at PythonHadoopUtil.scala:181)
17/02/28 16:33:54 INFO scheduler.TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
17/02/28 16:33:54 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 3.0 (TID 3, localhost, partition 0,ANY, 2192 bytes)
17/02/28 16:33:54 INFO executor.Executor: Running task 0.0 in stage 3.0 (TID 3)
17/02/28 16:33:54 INFO rdd.HadoopRDD: Input split: hdfs://quickstart.cloudera:8020/user/cloudera/sqoop_import/updatedepartments3/part-m-00000:0+627
17/02/28 16:33:54 ERROR executor.Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.RuntimeException: java.io.IOException: WritableName can’t load class: departments
at org.apache.hadoop.io.SequenceFile$Reader.getValueClass(SequenceFile.java:2107)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:2037)
at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1878)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1827)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1841)
at org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.java:49)
at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:237)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: WritableName can’t load class: departments
at org.apache.hadoop.io.WritableName.getClass(WritableName.java:77)
at org.apache.hadoop.io.SequenceFile$Reader.getValueClass(SequenceFile.java:2105)
… 20 more
Caused by: java.lang.ClassNotFoundException: Class departments not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2105)
at org.apache.hadoop.io.WritableName.getClass(WritableName.java:75)
… 21 more


#3

First you need to start the pyspark shell using the jar file generated while using sqoop import. And use sc.sequenceFile(“path-to-seq-file”, “org.apache.hadoop.io.LongWritable”,“departments”,valueConverter=“valueConverterClass”).
You need to write a valueConverter class, if using pyspark, as per http://spark.apache.org/docs/latest/programming-guide.html#external-datasets.
But if you are using scala, it is much easier. Just run spark-shell with --jar option to load jar file generated during sqoop import and then use
val data = sc.sequenceFileLongWritable,departments
data.map(tup => (tup._1.get(), tup._2.toString())).collect


#4

Thank you so much for your answer.


#6

Hi Pratyush,

I am reading your post and I have questions. I am using pyspark and the error message I have is java.lang.RuntimeException: java.lang.NoSuchMethodException: org.apache.hadoop.io.ArrayWritable.(). In my sequence file, the key is org.apache.hadoop.io.IntWritable and the value is org.apache.hadoop.io.ArrayWritable. Could you give me some details to use pyspark load sequence file?

Thank you very much!

Rice Song


#7

Hi Femy,

Did you solve your problem? I am having a similar problem. In my sequence file, the key is “org.apache.hadoop.io.IntWritable” and value is “org.apache.hadoop.io.ArrayWritable”. When I use “data = sc.sequenceFile(inFile, “org.apache.hadoop.io.IntWritable”, “org.apache.hadoop.io.ArrayWritable”)” in pyspark, my error message is:

py4j.protocol.Py4JJavaError: An error occurred while calling

z:org.apache.spark.api.python.PythonRDD.sequenceFile.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, wsc-b001.rcac.purdue.edu, executor 2): java.lang.RuntimeException: java.lang.NoSuchMethodException: org.apache.hadoop.io.ArrayWritable.()

Could you tell me how you fix the problem? I am new to pyspark. Thank you very much for your time.

Rice Song


#9

You can try this too

var data = sc.sequenceFile(inFile, classOf[org.apache.hadoop.io.IntWritable], classOf[org.apache.hadoop.io.ArrayWritable])…

Also please check the first line of the “inFile” , and make sure the correct format


#10

Did you solve it through scala? I am running a comparison that I have to use pyspark. Do you have some suggestions to use pyspark about this problem? Thank you very much!


#12

Yes I solved the problem via pyspark… above post was useful for me…