Spark SQL with more than 2 tables

Can we join more than 2 tables in a spark sql? (not by registering temp table or using hive sql). I tried below query but its not working.

ordersDF.join(orderItemsDF, ordersDF(“order_id”) === orderItemsDF(“order_item_order_id”)).join(productsDF, orderItemsDF(“order_item_product_id”) === product
sDF(“product_id”)).join(categoriesDF, productsDF(“product_category_id”) === categoriesDF(“category_id”)).join(deptDF, categoriesDF(“category_department_id”) === de
ptDF(“department_id”)).first()

@sarang yes you can join more than 2 tables with spark sql
I am not writing the entire syntax
we have 3 tables: table1, table2, table3
first you need to join 2 tables lets say join1 = table1.join(table2)
then next join2 = join1.join(table3)

1 Like

@Rahul…thats true…I was trying to accommodate that in single statement…Thanks

@sarang I think you can’t do it in single statement. It will throw u error

@Rahul…in fact we can join more than 2 tables with both DF and DF SQL. problem was with the dataset i was using…Product file has 1 invalid record…once you discard it , it will work

@sarang Can you just share me the code of 3 or 4 tables join

@Rahul here are the snippet,

With DF -

ordersDF.filter(ordersDF(“order_status”) === “COMPLETE” || ordersDF(“order_status”) === “PENDING”).join(orderItemsDF, ordersDF(“order_id”) === orderItemsDF(
“order_item_order_id”)).join(productsDF, orderItemsDF(“order_item_product_id”)===productsDF(“product_id”))
.join(categoriesDF, productsDF(“product_category_id”)===categoriesDF(“category_id”)).join(deptDF, categoriesDF(“category_department_id”)===deptDF(“department_id”)).groupBy(deptDF(“department_name”))
.agg(sum(orderItemsDF(“order_item_subtotal”))).sort(deptDF(“department_name”)).rdd.saveAsTextFile(outputPath)

With DF SQL -

ordersDF.registerTempTable(“orders”)
orderItemsDF.registerTempTable(“orderItems”)
productsDF.registerTempTable(“products”)
categoriesDF.registerTempTable(“categories”)
deptDF.registerTempTable(“dept”)

  sqlContext.sql("select d.department_name,sum(oi.order_item_subtotal) total_rev from orders o join orderItems oi on o.order_id = oi.order_item_order_id "+
            " join products p on oi.order_item_product_id = p.product_id join categories c on p.product_category_id = c.category_id join dept d on "+
            " c.category_department_id = d.department_id group by d.department_name").rdd.saveAsTextFile(outputPath)