Pyspark/scala- totalRevenue store in HDFS

#1

I am trying to store the final result to HDFS getting below error message

#Get total revenue from order_items
1)
orderItemsRDD = sc.textFile("/user/gnanaprakasam/sqoop_import/order_items")
orderItemsMap = orderItemsRDD.map(lambda x: float(x.split(",")[4]))
orderItemsRevenue = orderItemsMap.reduce(lambda rev1, rev2 : rev1 + rev2)
orderItemsRevenue.saveAsTextFile("/user/gnanaprakasam/pyspark/totalRevenue")

‘float’ object has no attribute ‘saveAsTextFile’

Results => 34322619.930019915

  1. Same issue while trying to store using scala, also result differs. Any input on this ?

val orderItemsRDD = sc.textFile("/user/gnanaprakasam/sqoop_import/order_items")
val orderItemsMap = orderItemsRDD.map(x => (x.split(",")(4).toDouble))
val orderItemsRevenue = orderItemsMap.reduce((acc, value) => acc + value)
orderItemsRevenue.saveAsTextFile("/user/gnanaprakasam/sparkscala/totalRevenue")

value saveAsTextFile is not a member of Double

Result => 3.4322619930019915E7

0 Likes

#2

Hi, saveAsTextFile is an API on RDD. But the reduce here is generating a double, so you cannot save it as a text file, but you can print it on console.

0 Likes

#3

You have to use scala APIs to change the formatting of the result - try to apply round function.

Here is the response to save single value as a file (from this link)

// So either do this
scala> val sumRDD = sc.parallelize(mysum.toString)
sumRDD: org.apache.spark.rdd.RDD[Char] = ParallelCollectionRDD[4] at parallelize at :16

// or do this
scala> val sumRDD = sc.parallelize(Array(mysum))
sumRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at :16
Next step is to store it to a particular path. ‘file’ is to store it on the local machine and ‘hdfs’ to store on HDFS, respectively

scala> sumRDD.saveAsTextFile(“file:///home/hdfs/a12”)
15/08/19 16:14:57 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.

0 Likes

#4

Thanks @pramodvspk @itversity

0 Likes