Not able query count for each order status

apache-spark

#1

sqoop import --connect “jdbc:mysql://nn01.itversity.com:3306/retail_db” --username retail_dba --password itversity --table orders --warehouse-dir /user/shubhaprasadsamal/data/ --as-textfile --fields-terminated-by ‘|’ --lines-terminated-by ‘\n’

hadoop fs -tail /user/shubhaprasadsamal/data/orders/part-m-00000

17198|2013-11-09 00:00:00.0|642|CLOSED
17199|2013-11-09 00:00:00.0|7246|PENDING_PAYMENT
17200|2013-11-09 00:00:00.0|4846|PENDING_PAYMENT
17201|2013-11-09 00:00:00.0|10506|PENDING_PAYMENT
17202|2013-11-09 00:00:00.0|4145|PROCESSING
17203|2013-11-09 00:00:00.0|6725|COMPLETE
17204|2013-11-09 00:00:00.0|3960|CLOSED
17205|2013-11-09 00:00:00.0|2715|CLOSED
17206|2013-11-09 00:00:00.0|2848|PROCESSING
17207|2013-11-09 00:00:00.0|8986|COMPLETE

val orders = sc.textFile("/user/shubhaprasadsamal/data/orders")

orders.take(5).foreach(println)

1|2013-07-25 00:00:00.0|11599|CLOSED
2|2013-07-25 00:00:00.0|256|PENDING_PAYMENT
3|2013-07-25 00:00:00.0|12111|COMPLETE
4|2013-07-25 00:00:00.0|8827|CLOSED
5|2013-07-25 00:00:00.0|11318|COMPLETE

val ordersMap = orders.map(x => (x.split("|")(3),1))
ordersMap: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[2] at map at :29

ordersMap.take(5).foreach(println)

Actual Result:

(0,1)
(0,1)
(0,1)
(0,1)
(0,1)

Expected Result

(CLOSED,1)
(PENDING_PAYMENT,1)
(COMPLETE,1)
(PROCESSING,1)

Please help to get the expected result.


Click here to accelerate learning by hands-on practice on a state of the art cluster.



#2

Hi,

Try using single quotes instead of double quotes while spliting:

val ordersMap = orders.map(x => (x.split(’|’)(3),1))

instead of

val ordersMap = orders.map(x => (x.split("|")(3),1))


#3

Pipe is a special character. You need to escape it.

Here is the working code:

val orders = sc.textFile("/user/shubhaprasadsamal/data/orders")
val ordersMap = orders.map(x => (x.split("\\|")(3),1))
ordersMap.take(10).foreach(println)

Here is the output:

(CLOSED,1)
(PENDING_PAYMENT,1)
(COMPLETE,1)
(CLOSED,1)
(COMPLETE,1)
(COMPLETE,1)
(COMPLETE,1)
(PROCESSING,1)
(PENDING_PAYMENT,1)
(PENDING_PAYMENT,1)

Issue is produced and fixed on our state of the art Big Data cluster. If you want to accelerate your learning you can sign up by clicking here



#4

Resolved.Thank you so much