Products Filter is not giving me result

scala

#1

Please find the below code:

sqoop import
–connect “jdbc:mysql://nn01.itversity.com:3306/retail_db”
–username retail_dba
–password itversity
–table products
–as-textfile
–target-dir /user/shubhaprasadsamal/problem2/products
–fields-terminated-by ‘|’;

val product=sc.textFile("/user/shubhaprasadsamal/problem2/products") // this folder does not have the 685 line with issue . I was able to print from this product RDD.

product.filter(r=>(r.split("|")(4).toFloat < 100) ).take(4).foreach(println) // this throws error.

18/09/20 13:00:00 INFO SparkContext: Starting job: take at :32
18/09/20 13:00:00 INFO DAGScheduler: Got job 1 (take at :32) with 1 output partitions
18/09/20 13:00:00 INFO DAGScheduler: Final stage: ResultStage 1 (take at :32)
18/09/20 13:00:00 INFO DAGScheduler: Parents of final stage: List()
18/09/20 13:00:00 INFO DAGScheduler: Missing parents: List()
18/09/20 13:00:00 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[2] at filter at :29), which has no missing parents
18/09/20 13:00:00 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.5 KB, free 510.7 MB)
18/09/20 13:00:00 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2034.0 B, free 510.7 MB)
18/09/20 13:00:00 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:42936 (size: 2034.0 B, free: 511.1 MB)
18/09/20 13:00:00 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1008
18/09/20 13:00:00 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[2] at filter at :29)
18/09/20 13:00:00 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
18/09/20 13:00:00 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 4, localhost, partition 0,ANY, 2186 bytes)
18/09/20 13:00:00 INFO Executor: Running task 0.0 in stage 1.0 (TID 4)
18/09/20 13:00:00 INFO HadoopRDD: Input split: hdfs://nn01.itversity.com:8020/user/shubhaprasadsamal/problem2/products/part-m-00000:0+41419
18/09/20 13:00:00 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 4)
java.lang.NumberFormatException: For input string: "Q"

Please help. Thanks


#2

Try this.
product.filter(r=>(r.split(’|’)(4).toFloat < 100) ).take(4).foreach(println)

Note: i’m using single quotes in the split function.


#3

The error is due to split("|") .
You need to use split(’|’) or split("\|") or split("""|""") .

product.filter(r=>(r.split('|')(4).toFloat < 100) ).take(4).foreach(println)