Programming Essentials Python - Manipulating Collections - Joining Data Sets

This article provides a comprehensive guide on how to perform joins over multiple collections using loops and conditionals, specifically focusing on the Nested Loops approach and leveraging ‘orders’ and ‘order_items’ datasets.

Key Concepts Explanation

  • Nested Loops: This approach involves building a dictionary for one dataset (‘orders’) and iteratively looking up information from this dataset while processing the other dataset (‘order_items’).

  • Sorting Merge: A strategy for joining datasets that involves sorting both datasets by the join key (order_id) and merging them based on their sorted order.

  • Hash Join: This join technique involves creating an in-memory hash table for one dataset and scanning the other dataset to find matching records based on the hash key.

Hands-On Tasks

Let’s dive into some practical tasks to understand how to implement this:

  1. Implement a function get_orders_dict to build a dictionary containing ‘order_id’ as keys and ‘order_date’ as values based on the ‘orders’ dataset and a given order status.
  2. Develop a function get_daily_revenue that calculates the daily revenue by joining ‘orders’ and ‘order_items’ datasets only for orders with a specific status.

Conclusion

In conclusion, this article has walked you through the process of performing joins over multiple datasets using the Nested Loops approach. By following the provided examples and tasks, you can gain a solid understanding of how to merge datasets efficiently.

Joining Data Sets

Use the function to get daily revenue considering only COMPLETE orders.

Watch the video tutorial here