Data Engineering using Spark SQL - Basic Transformations - Joining Tables - Inner

In this article, we will understand how to join data from multiple tables. We will go through the basics of inner joins using Spark SQL.

Key Concepts Explanation

INNER JOIN

INNER JOIN is used to retrieve all records from both datasets that satisfy the join condition. Here is an example of an INNER JOIN query:

SELECT o.order_id,
    o.order_date,
    o.order_status,
    oi.order_item_subtotal
FROM orders o JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
LIMIT 10

You can join more than two tables by extending the same concept. Remember to use the WHERE clause for additional filters and keep the ON clause exclusively for join conditions. Non-equal join conditions can also be used, but they are less common.

Hands-On Tasks

  1. Execute the provided INNER JOIN query to retrieve order details.
  2. Modify the query to add additional filtering based on order status.
  3. Enhance the query further to filter orders placed in a specific month.

Conclusion

In this article, we learned about INNER JOIN in Spark SQL and how to join data from multiple tables. By practicing the provided tasks, you will have a better understanding of applying INNER JOIN in your SQL queries. Remember to engage with the community for further learning opportunities.

Watch the video tutorial here