Aggregating data sets using pyspark - by key

Originally published at: http://www.itversity.com/topic/aggregating-data-sets-using-pyspark-by-key/

Aggregations – by key Aggregations can be broadly categorized into totals and by key. As part of this topic we will covering aggregations – by key. Load data from HDFS and store results back to HDFS using Spark Join disparate datasets together using Spark Calculate aggregate statistics (e.g., average or sum) using Spark Filter data…

Just a small correction in the above topic ,this line

ordersPerDayPerCustomer = ordersJoinOrderItems.map(lambda rec: ((rec[1][1].split(",")[1], rec[1][1].split(",")[2]), float(rec[1][0].split(",")[4])))

should instead read

ordersPerDayPerCustomer = ordersJoinOrderItems.map(lambda rec: ((rec[1][1].split(",")[1], rec[1][1].split(",")[2]), float(rec[1][0].split(",")[3])))

since order item sub_total is the 4 th field in the dataset and accessed by array index/subscript 3 .


Sorry , below line is correct , my bad eyes today

ordersPerDayPerCustomer = ordersJoinOrderItems.map(lambda rec: ((rec[1][1].split(",")[1], rec[1][1].split(",")[2]), float(rec[1][0].split(",")[4])))