Hi, Please refer the code as explained by Durga sir, in one of his videos, while explaining how to use broadcast variable.
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName(“Orders Join Order Items”).setMaster(sys.argv)
sc = SparkContext(conf=conf)
#Reading the data
inputPath = sys.argv
orders = sc.textFile(inputPath + “orders”)
orderItems = sc.textFile(inputPath + “order_items”)
Join orders, order_items and customers
To join we need to convert into tuples
First work on joining orders and order_items
ordersMap = orders.map(lambda o: (int(o.split(",")), (o.split(","), int(o.split(",")))))
orderItemsMap = orderItems.map(lambda oi: (int(oi.split(",")), float(oi.split(","))))
ordersJoin = ordersMap.join(orderItemsMap)
Get revenue per day per customer id
Read customers to get customer details and broadcast
customersPath = sys.argv
customers = open(customersPath).read().splitlines()
customersMap = dict(map(lambda c: (int(c.split(",")),(c.split(",")) + " " + (c.split(","))), customers))
customersBV = sc.broadcast(customersMap)
for i in ordersJoin.take(10): print(i)
revenuePerDatePerCustId = ordersJoin.
map(lambda o: ((o, customersBV.value[o]), o)).
reduceByKey(lambda t, v: t + v)
map(lambda rec: rec + “\t” + rec + “\t” + str(rec)).
Final Output: date(tab)customer name(tab)revenue
customer name can be computed using first name and last name
My question here is if we don’t use broadcast variable, what we need to do is to implement proper map () and reduceByKey() on revenuePerDatePerCustId and then join it with customersMap by means of Customer_id key, as the join column.
Should it give the correct answer??
Because when I executed in this way, it gave me the same no of Total Count, but there is a mismatch when individual count of any particular customer/ date is considered.
Could anyone please help me in correcting me.
Thanks in advance,