Task not serializable while using REPL


#1

another basics going wrong

I am trying to read a file in spark context and trying to skip the header of the file by doing this

scala > val read = sc.textFile(“file path”)
scala > val header = read.first
scala > val restfile = read.map(rec => row != header)

with these I get error “org.apache.spark.SparkException: Task not serializable”.

how does serialization work in this scenario… what to know the basic.

note: I know there other methods to skip the header of the file. I would however would like to know the concept of serialization in this context. Please share your views.

Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs

  • Click here for access to state of the art 13 node Hadoop and Spark Cluster


#2

correction there is typo in the code. Should be

scala > val read = sc.textFile(“file path”)
scala > val header = read.first
scala > val restfile = read.map(rec => rec != header)


#3

Where are you running this? Can you paste the complete log you got?


#4

here is the code I am running:
val read = sc.textFile(“hdfs:///user/edureka/data/ls2014.tsv”)
val header = read.first
val data = read.filter(row => (row != header))

error that i get:

scala> val data = read.filter(row => (row != header))
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2101)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:387)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:386)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.filter(RDD.scala:386)
… 46 elided
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext@1eab8437)
- field (class: $iw, name: sc, type: class org.apache.spark.SparkContext)
- object (class $iw, $iw@4402ed61)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@ad6255e)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@77511e9c)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@35d145fb)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@2a6871ad)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@42d96745)
- field (class: $line15.$read, name: $iw, type: class $iw)
- object (class $line15.$read, $line15.$read@4cddc3d9)
- field (class: $iw, name: $line15$read, type: class $line15.$read)
- object (class $iw, $iw@363e2009)
- field (class: $iw, name: $outer, type: class $iw)
- object (class $iw, $iw@15af06f)
- field (class: $anonfun$1, name: $outer, type: class $iw)
- object (class $anonfun$1, )
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
… 55 more


#5

I am running the code in REPL.


#6

You are using the environment provided by Edureka. We are not sure how their environment is setup, hence it is tough for us to troubleshoot these issues.

I would highly recommend to use Cloudera QuickStart VM or Hortonworks Sandbox which are thoroughly tested.

You might have to check with Edureka to troubleshoot these issues.