Apache Spark Python - Transformations - Using Inner Joins

Let us understand about inner join in Spark. Here are the steps we typically follow for joining data frames.

  • Read the data sets that are supposed to be joined from files into respective data frames.

  • Optionally we filter the data, if filter is involved as per the requirements.

  • Join both the data sets using inner join. We will get the data satisfying the join condition in the form of a new Data Frame.

  • Once the Data Frame with Join Results is created we can refer the columns from both the data sets after the join using the original data frame for further processing.

Reading Data Sets

To join data frames in Spark, you first need to read the data sets from files into Spark data frames. Here is how you can do it:

orders = spark.read.json('/public/retail_db_json/orders')
order_items = spark.read.json('/public/retail_db_json/order_items')

Performing Inner Join

Now, let’s perform an inner join on the ‘order_id’ column of ‘orders’ data frame and ‘order_item_order_id’ column of ‘order_items’ data frame:

orders_join = orders.join(
    order_items,
    orders.order_id == order_items.order_item_order_id,
    how='inner'
)
orders_join.show()

Projecting Fields

You can also project specific fields from the joined data frames. Here is an example where we project ‘order_id’, ‘order_date’, ‘order_status’ from ‘orders’ and ‘order_item_subtotal’ from ‘order_items’ data frame:

orders.join(
    order_items, 
    on=orders['order_id'] == order_items['order_item_order_id'],
).select(orders.order_id, orders.order_date, orders.order_status, order_items.order_item_subtotal).show()

Watch the video tutorial here

Hands-On Tasks

  1. Read the ‘orders’ and ‘order_items’ data sets from files into respective data frames.
  2. Perform an inner join on ‘orders’ and ‘order_items’ data frames based on the common key.
  3. Project all the fields from ‘orders’ and then ‘order_item_subtotal’ from ‘order_items’.
  4. Project ‘order_id’, ‘order_date’, ‘order_status’ from ‘orders’ and ‘order_item_subtotal’ from ‘order_items’.

Conclusion

In this article, we learned how to perform an inner join in Spark using PySpark. By following the step-by-step guide and hands-on tasks, you can practice joining data frames and projecting specific fields successfully. Don’t hesitate to engage with the community for further learning and practice.