I Am preparing for cca 175 certification. I have been practicing with data provided on itversity labs and I am struck with doing this task. can some one please help me.
I am trying to find total orders and total amount per status per day. I am using PySpark2 Core API methods to solve this. My final output should contain Order_Date , Order_status, total_orders, total_amount.I have been using itversity labs and I have done the following
orders = spark.sparkContext.textFile("/user/sindhu27/practicedata/retail_db/orders")
orderItems = spark.sparkContext.textFile("/user/sindhu27/practicedata/retail_db/order_items")
ordersMap = orders.map(lambda l: (l.split(","),l))
orderItemsMap = orderItems.map(lambda l: (int(l.split(",")),l))
ordersMap = orders.map(lambda l: (int(l.split(",")),l))
orderJoin = ordersMap.join(orderItemsMap)
orderJoinResult = orderJoin.map(lambda l: ((l.split(","),l.split(",")),float(l.split(","))))
orderJoinResultReduce = orderJoinResult.groupByKey()
orderJoinResultReduceAdd = orderJoinResultReduce.map(lambda l: (l,len(l),sum(l)))
I have got the output((order_date, order_status),total_orders,total_amount) as
((u’2014-01-02 00:00:00.0’, u’COMPLETE’), 94, 12758.899999999983)
((u’2013-08-11 00:00:00.0’, u’CANCELED’), 4, 544.95)
Now I need to sort order_date in descending, order_status in ascending, total_orders in ascending and total_amount in descending.
Can someone please help me on how to do this using pyspark2 core api.
Prepare for certifications on our state of the art labs which have Hadoop, Spark, Kafka, Hive and other Big Data technologies
- Click here for signing up for our state of the art 13 node Hadoop and Spark Cluster