We will be primarily using orders and order_items data set to understand about manipulating collections.
-
orders is available at path /data/retail_db/orders/part-00000
-
order_items is available at path /data/retail_db/order_items/part-00000
-
orders - columns
-
order_id - it is of type integer and unique
-
order_date - it can be considered as string
-
order_customer_id - it is of type integer
-
order_status - it is of type string
-
-
order_items - columns
-
order_item_id - it is of type integer and unique
-
order_item_order_id - it is of type integer and refers to orders.order_id
-
order_item_product_id - it is of type integer and refers to products.product_id
-
order_item_quantity - it is of type integer and represents number of products as an order item within an order.
-
order_item_subtotal - it is item level revenue (product of order_item_quantity and order_item_product_price)
-
order_item_product_price - it is product price for each item within an order.
-
-
orders is the parent data set to order_items and will contain one record per order. Each order can contain multiple items.
-
order_items is the child data set to orders and can contain multiple entries for a given order_item_order_id.
Task 1 - Read orders into collection
Let us read orders data set into the collection called as orders. This will be used later.
orders_path = '/data/retail_db/orders/part-00000'
orders_file = open(orders_path)
orders_raw = orders_file.read()
orders = orders_raw.splitlines()
orders[:10]
len(orders) # same as the number of records in the file
Task 2 - Read order_items into collection
Let us read order_items data set into the collection called as order_items. This will be used later.
order_items_path = '/data/retail_db/order_items/part-00000'
order_items_file = open(order_items_path)
order_items_raw = order_items_file.read()
order_items = order_items_raw.splitlines()
order_items[:10]
len(order_items) # same as the number of records in the file