Data Engineering using Spark SQL - Basic Transformations - Joining Tables - Outer

Let us understand how to perform outer joins using Spark SQL. There are 3 different types of outer joins. Let’s start a spark context for this Notebook so that we can execute the code provided.

Key Concepts Explanation

LEFT OUTER JOIN (default): Get all the records from both the datasets which satisfy the JOIN condition along with those records which are in the left side table but not in the right side table.

Example:

SELECT o.order_id,
    o.order_date,
    o.order_status,
    oi.order_item_order_id,
    oi.order_item_subtotal
FROM orders o LEFT OUTER JOIN order_items oi
ON o.order_id = oi.order_item_order_id
LIMIT 10

RIGHT OUTER JOIN: Get all the records from both the datasets which satisfy the JOIN condition along with those records which are in the right side table but not in the left side table.

Example:

SELECT o.order_id,
    o.order_date,
    o.order_status,
    oi.order_item_order_id,
    oi.order_item_subtotal
FROM orders o RIGHT OUTER JOIN order_items oi
ON o.order_id = oi.order_item_order_id
LIMIT 10

Hands-On Tasks

  1. Perform LEFT OUTER JOIN between ‘orders’ and ‘order_items’ tables.
  2. Perform RIGHT OUTER JOIN between ‘orders’ and ‘order_items’ tables.

Conclusion

In this article, we covered the concept of outer joins in Spark SQL using LEFT and RIGHT joins. We provided examples to clarify the usage of these joins. Practice these tasks to reinforce your understanding of outer joins in Spark SQL.

Watch the video tutorial here