Programming Essentials Python - Map Reduce Libraries - Preparing Data Sets

We will be primarily using orders and order_items data set to understand about manipulating collections.

  • orders is available at path /data/retail_db/orders/part-00000

  • order_items is available at path /data/retail_db/order_items/part-00000

  • orders - columns

    • order_id - it is of type integer and unique

    • order_date - it can be considered as string

    • order_customer_id - it is of type integer

    • order_status - it is of type string

  • order_items - columns

    • order_item_id - it is of type integer and unique

    • order_item_order_id - it is of type integer and refers to orders.order_id

    • order_item_product_id - it is of type integer and refers to products.product_id

    • order_item_quantity - it is of type integer and represents number of products as an order item within an order.

    • order_item_subtotal - it is item level revenue (product of order_item_quantity and order_item_product_price)

    • order_item_product_price - it is product price for each item within an order.

  • orders is the parent data set to order_items and will contain one record per order. Each order can contain multiple items.

  • order_items is the child data set to orders and can contain multiple entries for a given order_item_order_id.

Task 1 - Read orders into collection

Let us read orders data set into the collection called as orders. This will be used later.

orders_path = '/data/retail_db/orders/part-00000'
orders_file = open(orders_path)
orders_raw = orders_file.read()
orders = orders_raw.splitlines()
orders[:10]
len(orders) # same as the number of records in the file

Task 2 - Read order_items into collection

Let us read order_items data set into the collection called as order_items. This will be used later.

order_items_path = '/data/retail_db/order_items/part-00000'
order_items_file = open(order_items_path)
order_items_raw = order_items_file.read()
order_items = order_items_raw.splitlines()
order_items[:10]
len(order_items) # same as the number of records in the file

Watch the video tutorial here