Apache Spark Python - Transformations - Left or Right Outer Join

This article offers a concise guide on performing left or right outer joins in Spark to combine datasets. Exploring key concepts, such as one-to-many relationships and the significance of outer joins, readers learn practical applications for analyzing data from multiple sources. Through code examples and hands-on tasks, readers gain insights into comparing outer join results, identifying unmatched records, and extracting valuable insights. The article concludes with encouragement to engage with the community for further learning and practice in mastering Spark data processing.

Left or Right Outer Join

Performing an outer join between two datasets allows us to get the data from both datasets satisfying the join condition, along with the data from the driving table that does not satisfy the join condition. This can be useful when we want to display data even if there is no matching record in the other dataset.

In the example provided:

  • Customers and orders data are used for demonstration.
  • One-to-many relationship between customers and orders.
  • Outer join can be performed to get the revenue or number of orders placed by customers.
  • Using left or right outer join can help include all records from the driving table.

Code Execution


# Initialize Spark session
spark = SparkSession.builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Joining Data Sets'). \
    master('yarn'). \
    getOrCreate()

# Perform outer join between customers and orders
orders = spark.read.json('/public/retail_db_json/orders')
customers = spark.read.json('/public/retail_db_json/customers')

customer_order_details = customers.join(
    orders, 
    on=customers['customer_id'] == orders['order_customer_id'],
    how='inner'
)

# Get count comparison between outer join and inner join
orders_count = orders.count()
customer_order_count = customer_order_details.count()

# Identify customers with no orders using left outer join
customer_order_details_left = customers.join(
    orders, 
    on=customers['customer_id'] == orders['order_customer_id'],
    how='left'
)

customers_with_no_orders = customer_order_details_left. \
    filter(orders['order_id'].isNull()). \
    count()

# Get order counts for each customer in 2013
orders_filtered = orders.filter("order_date LIKE '2013%'")
customer_order_details_left_2013 = customers.alias('c'). \
    join(
    orders_filtered.alias('o'), 
    on=customers['customer_id'] == orders_filtered['order_customer_id'],
    how='left'
)

customer_order_counts_2013 = customer_order_details_left_2013. \
    groupBy('customer_id', 'customer_email'). \
    agg(sum(expr('CASE WHEN order_id IS NULL THEN 0 ELSE 1 END')).alias('order_count')). \
    orderBy('order_count', 'customer_id')

Watch the video tutorial here

Hands-On Tasks

  1. Perform Outer Join: Execute an outer join between customers and orders and compare the count with an inner join.
  2. Identify Customers with No Orders: Utilize left outer join results to extract customers who have never placed orders.
  3. Get Order Counts for 2013: Determine the number of orders placed by each customer in the year 2013, assigning a count of 0 for customers with no orders.

Conclusion

In this article, we explored the concept of left or right outer join in Spark and how it can be used to combine data from two datasets based on certain conditions. By following the step-by-step tasks provided, readers can gain practical experience in performing outer joins and analyzing data relationships.

Remember to engage with the community for further learning and practice these concepts to master Spark data processing. Join the discussion and share your insights with fellow learners.