Spark-RDD-ReduceByKey - Getting AVG product price

rdd-api

#1

Hi all,

 i am new to Spark, so learning by watching Itversity playlist in Udemy. Also working out using LAB.

while working with ReducebyKey aggregation, able to perform SUM,MIN,MAX…but how to calculate avg from the given list…

whether i should use different aggregator ?or only with Reducebykey i can achieve it…pls guide me…Thanks…



Here are the Udemy coupons for our certification courses. Our coupons include 1 month lab access as well.

  • Click here for $35 coupon for CCA 175 Spark and Hadoop Developer using Python.
  • Click here for $35 coupon for CCA 175 Spark and Hadoop Developer using Scala.
  • Click here for $25 coupon for HDPCD:Spark using Python.
  • Click here for $25 coupon for HDPCD:Spark using Scala.
  • Click here for access to state of the art 13 node Hadoop and Spark Cluster


#2

You can use aggregateByKey, if you can provide me more details like table schema i can give you a solution. You can also reduceByKey as well but you need to calculate sum and count in one step and in next step do sum/count in map.


#3

Hi, Thanks for the reply bro…Below is the table and my code for sum of revenue…Pls help me on finding avg…either using aggregatebykey or reducebykey…Thank you

val orderItems = sc.textFile("/public/retail_db/order_items")
val orderItemsMap = orderItems.
map(oi => (oi.split(",")(1).toInt, oi.split(",")(4).toFloat))

orderItemsMap.take(5).foreach(println)

val revenuePerOrderId = orderItemsMap.
reduceByKey((total, revenue) => total + revenue)


#4

@Fayaz
reduceByKey:

val orderItems = sc.textFile("/public/retail_db/order_items")
val orderItemsMap = orderItems.map(oi => (oi.split(",")(1).toInt, (oi.split(",")(4).toFloat,1)))
val avgPerOrderId = orderItemsMap.reduceByKey((total, revenue) => (total._1+revenue._1,total._2+revenue._2)).map(i=>(i._1,i._2._1/i._2._2))

avgPerOrderId.take(5).foreach(println)

aggregateByKey:

val orderItems = sc.textFile("/public/retail_db/order_items")
val orderItemsMap = orderItems.map(oi => (oi.split(",")(1).toInt, oi.split(",")(4).toFloat))
val avgPerOrderId = orderItemsMap.aggregateByKey((0.0,0))((total, revenue) => (total._1+revenue,total._2+1),(v1,v2)=>(v1._1+v2._1,v1._2+v2._2)).map(i=>(i._1,i._2._1/i._2._2))

avgPerOrderId.take(5).foreach(println)

Code is tested on online Big Data cluster supported by https://labs.itversity.com


#5

thanks a lot bro :slight_smile: it helps me lot…apart from min,max,avg,sum => any other aggregations we perform?


#6

@Fayaz in exam point of view, count is important as well. Good Luck!!