Programming Essentials Python - Manipulating Collections - Row Level Transformations

Let us understand how to perform row level transformations using orders data set. Here are the details about orders.

  • Data is in text file format
  • Each line in the file contains one record.
  • Each record contains 4 attributes which are separated by “,”
    • order_id
    • order_date
    • order_customer_id
    • order_status
ls -ltr /data/retail_db/orders/part-00000
tail /data/retail_db/orders/part-00000
path = '/data/retail_db/orders/part-00000'
# C:\\users\\itversity\\Research\\data\\retail_db\\orders\\part-00000

orders_file = open(path)
type(orders_file)
orders_raw = orders_file.read()
type(orders_raw)
orders_raw.splitlines?
orders = orders_raw.splitlines()
type(orders)
orders[:10]
type(orders[0])
len(orders)
wc -l /data/retail_db/orders/part-00000

Task 1

Get all order ids and associated statuses. Each record in the output should be a comma-separated string.

order = '1,2013-07-25 00:00:00.0,11599,CLOSED' # -> '1,CLOSED'
':'.join(['1', '2', '3', '4'])
order.split(',')[0]
order.split(',')[3]
[order.split(',')[0], order.split(',')[3]]
','.join([order.split(',')[0], order.split(',')[3]])

# Generate the order statuses
order_statuses = []

for order in orders:
    order_statuses.append(','.join([order.split(',')[0], order.split(',')[3]]))
order_statuses[:10]
len(order_statuses)

# Alternative Solution
order_statuses = [','.join([order.split(',')[0], order.split(',')[3]]) for order in orders]
order_statuses[:10]
len(order_statuses)

Task 2

Get all order ids, the dates on which the order is placed, and order status. Each record in the output should be a dictionary with the following column names as keys: ‘order_id’, ‘order_date’, ‘order_status’

def get_order_details(order):
    order_values = order.split(',')
    return ({
        'order_id': int(order_values[0]),
        'order_date': order_values[1],
        'order_status': order_values[3]
    })

get_order_details('1,2013-07-25 00:00:00.0,11599,CLOSED')

# Extract order details for all orders
order_details = []

for order in orders:
    order_details.append(get_order_details(order))
order_details[:10]
len(order_details)

# Alternative Solution
order_details = [get_order_details(order) for order in orders]
order_details[:10]
len(order_details)

Watch the video tutorial here