Need help Hive query and Pyspark code giving different results

Hive query:

select c.category_name, count(order_item_quantity) as count
from order_items oi
inner join products p on oi.order_item_product_id = p.product_id
inner join categories c on c.category_id = p.product_category_id
group by c.category_name
order by count desc


from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)

categoryRDD = sqlContext.sql(“select * from categories”)
productRDD = sqlContext.sql(“select * from products”)
orderRDD = sqlContext.sql(“select * from order_items”)

categorymap = categoryRDD.map(lambda x: (x.category_id,x.category_name))
productmap = productRDD.map(lambda x: (x.product_id,x.product_category_id))
ordermap = orderRDD.map(lambda x: (x.order_item_product_id,float(x.order_item_quantity)))

productorder = ordermap.join(productmap)

productordermap = productorder.map(lambda x: (int(x[1][1]), int(x[1][0])))

ordercat = productordermap.join(categorymap)

ordercout = ordercat.map(lambda x: (x[1][1],int(x[1][0]))).reduceByKey(lambda x,y: x+y)

for i in ordercout.collect():
print i


I’ve tried a lot to figure out what went wrong. Always the pyspark code is giving too huge numbers compared hive query. The hive query mentioned above is in Exercise of cloudera in VM (http://quickstart.cloudera/#/tutorial/ingest_structured_data ). This query finds the Most popular product categories

@itversity Durga Sir, could you please help me.

I’m eagerly waiting for the solution…Can any one help…plzz?

Hi,

I guess in your python code we have to mention join column also. I mean on which column to apply the join other wise it will take cross join

In python code join is done using data sets of <K,V> pairs. There is no need of mentioning the column name or there is no interface in python join that takes column name as argument

@itversity Sir can you plz plz help. I’m not getting the answer for this question

@mahendra971 I think you are getting different results because from pyspark code when its executes select * from categories command it will get all data present in clusters. I think there is no guarantee that you need to get the results in order as 1 2 3 4 … because the RDD/DF works around the clusters we don’t know which RDD/DF executes first.May be the 4th data rdd.df might execute first.so the data can be 4 1 2 3… So this might be the reason you have got different results from hive.