Programming Essentials Python - Manipulating Collections - Overview of Standard Transformations

In this article, we will delve into the standard transformations that can be applied to data in collections. We will cover various aspects such as filtering, row-level transformations, total aggregations, grouped aggregations, sorting, and ranking. These transformations are essential for cleaning and processing data efficiently.

Key Concepts Explanation

Filtering

Filtering involves selecting specific data based on certain criteria. For example:

# Get all the revenue generating orders with order_status as COMPLETE or CLOSED
revenue_orders = orders.filter(lambda x: x['order_status'] == 'COMPLETE' or x['order_status'] == 'CLOSED')

# Get sales details of dealers within a state
state_sales = sales.filter(lambda x: x['state'] == 'California')

Row-Level Transformations

Row-level transformations involve standardizing, masking, or cleansing data at an individual row level. For instance:

# Convert names to upper case
df['Name'] = df['Name'].apply(lambda x: x.upper())

# Standardize Phone Numbers
df['Phone'] = df['Phone'].apply(lambda x: standardize_phone(x))

Total Aggregations

Total aggregations compute the sum of values for a specific attribute. For example:

# Calculate revenue generated by a product
product_revenue = sales.groupby('product_id')['revenue'].sum()

Grouped Aggregations

Grouped aggregations involve aggregating data based on a specific grouping attribute. For instance:

# Calculate revenue generated by each product
product_revenue = sales.groupby('product_id')['revenue'].sum()

# Calculate sales commission for each channel partner
partner_commission = sales.groupby('channel_partner')['commission'].sum()

Sorting and Ranking

Sorting entails arranging data in a particular order, while ranking assigns a specific numerical rank to each element based on a certain criterion. For example:

# Get top 5 stores by revenue
top_stores = sales.groupby('store_id')['revenue'].sum().nlargest(5)

# Get top 5 car dealers by sales
top_dealers = sales.groupby('dealer_id')['sales'].sum().nlargest(5)

Hands-On Tasks

  1. Filter orders with ‘CLOSED’ status
  2. Standardize the phone numbers in the dataset
  3. Calculate the total revenue generated by each store

Conclusion

In this article, we explored the standard transformations that can be applied to data collections. By implementing these concepts, you can efficiently clean and process your data for further analysis. Practice these tasks and engage with the community for further learning and development.

Watch the video tutorial here