Reduce by key in python on order_items revenue calculation


#1

In a pyspark practice 75, I have the following code and stuck here:

[paslechoix@gw03 ~]$ hdfs dfs -cat p90_order_items/*
152860,61113,957,1,299.98,299.98
152861,61114,1014,5,249.9,49.98

order_items = sc.textFile(“p90_order_items”)

Now, what I need to do are:

  1. I need to convert all the fields to integer or float;
  2. I need to calculate the sum of rev which is #4 field

Can someone tell me how to do this in python?

Thank you very much.


#2

Please come up with the queries you practiced, so that it can be helpful to trace what is blocking.
thank you


#3

Step 1 : Import Single table .

[paslechoix@gw03 ~]$ sqoop import -m=1
–connect=jdbc:mysql://ms.itversity.com/retail_db
–username=retail_user
–password=itversity
–table=order_items
–target-dir=p90_order_items

Step 2 : Check the import below:

[paslechoix@gw03 ~]$ hdfs dfs -cat p90_order_items/*
152860,61113,957,1,299.98,299.98
152861,61114,1014,5,249.9,49.98
152862,61115,725,1,108.0,108.0
152863,61115,1073,1,199.99,199.99
152864,61115,191,4,399.96,99.99
152865,61115,1014,4,199.92,49.98

Step 3: create an RDD on this data:

order_items = sc.textFile(“p90_order_items”)

All I need is how do I convert the fields into different data type, for example in this case, I need to convert them to int and float for the calculation in the created RDD, thank you.


#4

Hi,
Try this command to convert all fields of order_Items into int.
orderItemsInt = orderItems.map(lambda line: [int(float(i)) for i in line.split(",")])