My Request to Everyone


#1

Hello all,

My request to everyone who raised issues in this community, please respond back if your issue resolved or still exist. This way, other community members also learn from your issues and resolutions.

Thank you
Venkat


#2

Hi Venkat,
I still have one unresolved issue with me.

Thannks
Aparna


#3

@AparnaSen:

Can you please post your issue?
Thanks
Venkat


#4

Hi Venkat,
Thanks for your response.

The Problem statement is to Get Daily Revenue Per Customer Name.
Tables to be used : orders, order_items, customers
I tried two methods for this.

In my first approach, I have used Broadcast variable for customer (with customer _id and customer_name). In this approach, the final count I am getting is 47654

In my 2nd approach, I have joined customer table with the joined resultset of orders and order_items, to replace the customer_id with the customer_name, but in this case, the final count I am getting is 57047.

I think the count should be the same in both cases.
Please look into my code and let me know where I am wrong.

Get Daily revenue per customer:(using broadcast variable)

#Joining orders and order_items -

orders = sc.textFile("/user/aparna149/aparna/orders")
orderItems = sc.textFile("/user/aparna149/aparna/order_items")
ordersMap = orders.map(lambda x:(int(x.split(",")[0]),(x.split(",")[1],int(x.split(",")[2]))))
orderItemsMap = orderItems.map(lambda x: (int(x.split(",")[1]), float(x.split(",")[4])))
ordersJoinOrderItems = ordersMap.join(orderItemsMap)
for i in ordersJoinOrderItems.take(20): print(i)

ordersJoinOrderItemsMap = ordersJoinOrderItems.map(lambda x: x[1])
for i in ordersJoinOrderItemsMap.take(20): print(i)

#creating broadcast variable for customer-

customers = open("/home/aparna149/customers").read().splitlines()
customersMap = dict(map(lambda c: (int(c.split(",")[0]),(c.split(",")[1]) + " " + (c.split(",")[2])), customers))
customersBV = sc.broadcast(customersMap)

#Substituting customer_id by customer_name -

RevenueCustomerMap = ordersJoinOrderItemsMap.map(lambda x:((x[0][0], customersBV.value[x[0][1]]),x[1]))
for i in RevenueCustomerMap.take(20): print(i)

RevenuePerDatePerCustomer = RevenueCustomerMap.reduceByKey(lambda x, y: x+y)
for i in RevenuePerDatePerCustomer.take(20): print(i)

FinalRevenueOutput = RevenuePerDatePerCustomer.map(lambda x:(x[0][0]+ “\t” + x[0][1] + “\t” + str(x[1])))

Get Daily revenue per customer: (using join)

orders = sc.textFile("/user/aparna149/aparna/orders")
orderItems = sc.textFile("/user/aparna149/aparna/order_items")
customers = sc.textFile("/user/aparna149/aparna/customers")
ordersMap = orders.map(lambda x:(int(x.split(",")[0]),(x.split(",")[1],int(x.split(",")[2]))))
orderItemsMap = orderItems.map(lambda x: (int(x.split(",")[1]), float(x.split(",")[4])))
ordersJoinOrderItems = ordersMap.join(orderItemsMap)
for i in ordersJoinOrderItems.take(20): print(i)

ordersJoinOrderItemsMap = ordersJoinOrderItems.map(lambda x: x[1])
for i in ordersJoinOrderItemsMap.take(20): print(i)

DailyRevenuePerCustomer = ordersJoinOrderItemsMap.reduceByKey(lambda x,y: x+y)
for i in DailyRevenuePerCustomer.take(20): print(i)

DailyRevenuePerCustomerMap = DailyRevenuePerCustomer.map(lambda x: (x[0][1],(x[0][0], x[1])))
customersMap = customers.map(lambda x:(int(x.split(",")[0]),(x.split(",")[1],x.split(",")[2])))
customersJoinDailyRevenue = customersMap.join(DailyRevenuePerCustomerMap)
for i in customersJoinDailyRevenue.take(20): print(i)

RevenuePerCustomerName = customersJoinDailyRevenue.map(lambda x: (x[1][0][0]+ " “+ x[1][0][1] + “\t” + x[1][1][0] +”\t" +str(x[1][1][1])))

It would be a great help if I could get to know what’s the issue?
Thanks in advance,
Aparna


#5

Hello Aparna

I found the problem statement to be interesting and gave an attempt using Spark SQL. Here is my solution. Can you let me know if it solves the use case of the problem

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext

// Get Daily Revenue Per Customer Name

object test {

  val conf = new SparkConf().setAppName("test").setMaster("local")
  val sc = new SparkContext(conf)

  val sqlContext=new SQLContext(sc)

  import sqlContext. implicits._

  def main(args: Array[String]): Unit = {
    val orders = sc.textFile("/home/varunu28/data/retail_db/orders")
    val order_items = sc.textFile("/home/varunu28/data/retail_db/order_items")
    val customers = sc.textFile("/home/varunu28/data/retail_db/customers")

    val ordersDF = orders.
                    map(rec => (rec.split(",")(0).toInt, rec.split(",")(1).split(" ")(0), rec.split(",")(2).toInt)).
                    toDF("order_id", "order_date", "order_customer_id")

    val order_itemsDF = order_items.
                          map(rec => (rec.split(",")(1).toInt, rec.split(",")(2).toInt, rec.split(",")(4).toFloat)).
                          toDF("order_item_id", "order_item_order_id", "order_item_subtotal")

    val customersDF = customers.
                        map(rec => (rec.split(",")(0).toInt, rec.split(",")(1) + " " + rec.split(",")(2))).
                        toDF("customer_id", "customer_name")

    ordersDF.registerTempTable("orders")
    order_itemsDF.registerTempTable("order_items")
    customersDF.registerTempTable("customers")

    val res = sqlContext.sql("SELECT order_date, customer_name, ROUND(SUM(order_item_subtotal), 2) AS revenue " +
                             "FROM orders JOIN order_items ON orders.order_id=order_items.order_item_order_id " +
                             "JOIN customers ON orders.order_customer_id=customers.customer_id " +
                             " GROUP BY order_date, customer_name " +
                             "ORDER BY order_date").count()

    println(res)
  }

}


#6

Hi Varun,
Thank you very much for your help. Really appreciate.
Though I don’t know scala, still I tried your code in pyspark and got the count 47654, which exactly matches with my 1st result.
but still I am curious to know what is wrong in my 2nd approach?
Could you Please look into the code again (2nd one) and let me know ?

Thanks ,
Aparna


#7

@AparnaSen:

REF : Get Daily revenue per customer: (using join)

All your steps are correct except last step:
You SHOULD put double quotes(" ") INSTEAD of double apostrophes (“ “) in this step.
Now check the count. It is 57047
Proof my execution is below along with sample output data.
Hope this helps your issue.

Your RDD:
RevenuePerCustomerName = customersJoinDailyRevenue.map(lambda x: (x[1][0][0]+ " “+ x[1][0][1] + “\t” + x[1][1][0] +”\t" +str(x[1][1][1])))

My RDD:
RevenuePerCustomerName = customersJoinDailyRevenue.map(lambda x: (x[1][0][0]+ " “+ x[1][0][1] + “\t” + x[1][1][0] +”\t" +str(x[1][1][1])))

###########################################################################

orders = sc.textFile("/user/vanampudi/retail_db/orders")
order_items = sc.textFile("/user/vanampudi/retail_db/order_items")
customers = sc.textFile("/user/vanampudi/retail_db/customers")

ordersMap = orders.map(lambda x:(int(x.split(",")[0]),(x.split(",")[1],int(x.split(",")[2]))))
ordersMap.first()
(1, (u’2013-07-25 00:00:00.0’, 11599))

orderItemsMap = order_items.map(lambda x: (int(x.split(",")[1]), float(x.split(",")[4])))
orderItemsMap.first()
(1, 299.98)

ordersJoinOrderItems = ordersMap.join(orderItemsMap)
for i in ordersJoinOrderItems.take(10):print(i)

(32768, ((u’2014-02-12 00:00:00.0’, 1900), 199.99))
(32768, ((u’2014-02-12 00:00:00.0’, 1900), 129.99))
(32768, ((u’2014-02-12 00:00:00.0’, 1900), 299.98))
(32768, ((u’2014-02-12 00:00:00.0’, 1900), 399.98))
(49152, ((u’2014-05-27 00:00:00.0’, 9778), 299.98))
(4, ((u’2013-07-25 00:00:00.0’, 8827), 49.98))
(4, ((u’2013-07-25 00:00:00.0’, 8827), 299.95))
(4, ((u’2013-07-25 00:00:00.0’, 8827), 150.0))
(4, ((u’2013-07-25 00:00:00.0’, 8827), 199.92))
(50192, ((u’2014-06-04 00:00:00.0’, 1083), 129.99))

ordersJoinOrderItemsMap = ordersJoinOrderItems.map(lambda x: x[1])
for i in ordersJoinOrderItemsMap.take(20): print(i)
((u’2014-02-12 00:00:00.0’, 1900), 199.99)
((u’2014-02-12 00:00:00.0’, 1900), 129.99)
((u’2014-02-12 00:00:00.0’, 1900), 299.98)
((u’2014-02-12 00:00:00.0’, 1900), 399.98)
((u’2014-05-27 00:00:00.0’, 9778), 299.98)
((u’2013-07-25 00:00:00.0’, 8827), 49.98)
((u’2013-07-25 00:00:00.0’, 8827), 299.95)
((u’2013-07-25 00:00:00.0’, 8827), 150.0)
((u’2013-07-25 00:00:00.0’, 8827), 199.92)
((u’2014-06-04 00:00:00.0’, 1083), 129.99)

DailyRevenuePerCustomer = ordersJoinOrderItemsMap.reduceByKey(lambda x,y: x+y)
for i in DailyRevenuePerCustomer.take(10): print(i)
((u’2013-11-15 00:00:00.0’, 915), 329.98)
((u’2014-04-18 00:00:00.0’, 9415), 49.98)
((u’2013-11-21 00:00:00.0’, 9970), 499.97)
((u’2014-02-02 00:00:00.0’, 9812), 179.97)
((u’2013-11-07 00:00:00.0’, 4486), 149.94)
((u’2013-08-15 00:00:00.0’, 4417), 759.8500000000001)
((u’2014-06-21 00:00:00.0’, 3949), 1029.87)
((u’2013-10-02 00:00:00.0’, 8180), 979.86)
((u’2014-07-08 00:00:00.0’, 6493), 549.88)
((u’2014-02-12 00:00:00.0’, 6815), 549.94)

DailyRevenuePerCustomerMap = DailyRevenuePerCustomer.map(lambda x: (x[0][1],(x[0][0], x[1])))
for in DailyRevenuePerCustomerMap.take(10):print(i)
(915, (u’2013-11-15 00:00:00.0’, 329.98))
(9415, (u’2014-04-18 00:00:00.0’, 49.98))
(9970, (u’2013-11-21 00:00:00.0’, 499.97))
(9812, (u’2014-02-02 00:00:00.0’, 179.97))
(4486, (u’2013-11-07 00:00:00.0’, 149.94))
(4417, (u’2013-08-15 00:00:00.0’, 759.8500000000001))
(3949, (u’2014-06-21 00:00:00.0’, 1029.87))
(8180, (u’2013-10-02 00:00:00.0’, 979.86))
(6493, (u’2014-07-08 00:00:00.0’, 549.88))
(6815, (u’2014-02-12 00:00:00.0’, 549.94))

customersMap = customers.map(lambda x:(int(x.split(",")[0]),(x.split(",")[1],x.split(",")[2])))
for i in customersMap.take(10):print(i)
(1, (u’Richard’, u’Hernandez’))
(2, (u’Mary’, u’Barrett’))
(3, (u’Ann’, u’Smith’))
(4, (u’Mary’, u’Jones’))
(5, (u’Robert’, u’Hudson’))
(6, (u’Mary’, u’Smith’))
(7, (u’Melissa’, u’Wilcox’))
(8, (u’Megan’, u’Smith’))
(9, (u’Mary’, u’Perez’))
(10, (u’Melissa’, u’Smith’))

customersJoinDailyRevenue = customersMap.join(DailyRevenuePerCustomerMap)
for i in customersJoinDailyRevenue.take(10):print(i)
(8196, ((u’Kimberly’, u’Sheppard’), (u’2013-11-01 00:00:00.0’, 829.97)))
(8196, ((u’Kimberly’, u’Sheppard’), (u’2013-09-06 00:00:00.0’, 119.98)))
(8196, ((u’Kimberly’, u’Sheppard’), (u’2013-10-13 00:00:00.0’, 119.98)))
(8196, ((u’Kimberly’, u’Sheppard’), (u’2014-03-22 00:00:00.0’, 969.9200000000001)))
(8196, ((u’Kimberly’, u’Sheppard’), (u’2013-08-24 00:00:00.0’, 499.97)))
(8196, ((u’Kimberly’, u’Sheppard’), (u’2014-02-25 00:00:00.0’, 129.99)))
(8196, ((u’Kimberly’, u’Sheppard’), (u’2013-12-05 00:00:00.0’, 949.96)))
(6, ((u’Mary’, u’Smith’), (u’2014-02-13 00:00:00.0’, 649.97)))
(6, ((u’Mary’, u’Smith’), (u’2013-09-09 00:00:00.0’, 1049.84)))
(6, ((u’Mary’, u’Smith’), (u’2013-09-10 00:00:00.0’, 629.82)))

RevenuePerCustomerName = customersJoinDailyRevenue.map(lambda x: (x[1][0][0]+ " “+ x[1][0][1] + “\t” + x[1][1][0] +”\t" +str(x[1][1][1])))
for i in RevenuePerCustomerName.take(10):print(i)
Kimberly Sheppard 2013-11-01 00:00:00.0 829.97
Kimberly Sheppard 2013-09-06 00:00:00.0 119.98
Kimberly Sheppard 2013-10-13 00:00:00.0 119.98
Kimberly Sheppard 2014-03-22 00:00:00.0 969.92
Kimberly Sheppard 2013-08-24 00:00:00.0 499.97
Kimberly Sheppard 2014-02-25 00:00:00.0 129.99
Kimberly Sheppard 2013-12-05 00:00:00.0 949.96
Mary Smith 2014-02-13 00:00:00.0 649.97
Mary Smith 2013-09-09 00:00:00.0 1049.84
Mary Smith 2013-09-10 00:00:00.0 629.82

RevenuePerCustomerName.count() = 57047

Thanks
Venkat


#8

@Aparana Sen:

Ignore my answer.

This is good issue and interesting as well. We have to review whether lookup with broadcast variable will give same result as join.

Thanks
Venkat


#9

Hi Venkat,
Thanks a lot for your time and effort. I think you got my issue, why there is a mismatch in the count.
I am sure it will be resolved soon. :slight_smile:

Thanks
Aparna


#10

@avr8082 @Varun_Upadhyay1
Hi Venkat and Varun,
Keeping everything else the same , I executed the below query in sparkSql and I am surprised to see the count 57047.

Please look into the code.

sqlContext.sql(“select order_date,concat(concat(customer_fname,’ '),customer_lname) customer_name,Daily_Revenue
from (select order_customer_id, order_date,sum(order_item_subtotal) Daily_Revenue
from orders o join order_items oi on o.order_id = oi.order_item_order_id
group by order_customer_id,order_date) q join customers c
on q.order_customer_id=c.customer_id”). count()

Thanks
Aparna