Data Engineers - Overview of Python

As part of this session, we will cover what all features should one learn in Python to become Cloud based Data Engineer. This is part of Roadmap to Data Engineering Bootcamp series which is tentatively planned for January 2020 for the students who are graduating in Fall 2019.

You might not be able to understand every aspect of Python used as part of the demonstration. It is primarily to give practical overview of Python. You can sign up for our free course on Ubuntu to cover most of the features highlighted below.

Before getting into details about this session, we will also discuss below topics.

As topic is vast, it is covered in 2 live sessions.

  • Overview of Python
  • Overview of Collections and Data Frames

Agenda for this session:

  • Common Mistakes
  • Recap of SDLC
  • Pre-requisites
  • Overview of Python CLI or Jupyter Notebook
  • Getting Started - Python
  • Define Problem Statement
  • Overview of Programming Constructs
  • Overview of Collections
  • Reading Data from Files into Collections
  • Processing Data using Loops
  • Processing Data using Pandas
  • Externalizing Properties
  • Focus Areas for Data Engineers

Common Mistakes

Let me highlight some of the common mistakes committed by newbies to Linux environments. You should overcome these issues as quickly as possible.

  • Typos - a serious issue due to lack of typing skills. Learn typing and also stay focused to avoid typos.
  • Spaces in the paths and file names.
  • Mixing up with the case. Generally in Windows case is not that important. But in Linux based environments, it is very important. We either use camel case or underscores while naming our programs.
  • Using keywords for programs. At any cost, you should avoid naming programs such as python, pythonc etc.

Recap of SDLC

Let us recap Software Development Life Cycle (SDLC), so that we understand relevance of some of the aspects that are covered later as part of this session.

  • Requirements
  • Design
  • Development (where we are involved as developers)
  • Integration (Optional)
  • UAT or Functionality Testing
  • Performance Testing (Optional)
  • Production

Only tested and certified code should be deployed in higher environments such as UAT, Production etc.

Overview of Python CLI or Jupyter Notebook

We can use either Python CLI or Jupyter Notebook to explore the APIs.

  • One can launch Python CLI using python command.
  • We can write one liners using Python CLI
  • We need to setup Jupyter Notebook. One of the way to set it up is by using Anaconda.
  • Once setup we need to start web service so that we can open Python run time environment in browser and write the code in interactive fashion.
  • We will be mainly using PyCharm for application development and Python CLI to explore the APIs.
  • We can use help as part of Python CLI to get documentation over APIs we want to use. e. g.: help(str)


Here are the pre-requisites for you to learn Python from Cloud based Data Engineer perspective.

  • Make sure you are using Mac or Ubuntu based Desktop
  • Make sure you have PyCharm setup
  • You should have our GitHub repository for data cloned
  • You can follow this post to setup the Development Environment either on Mac or Ubuntu based Desktop.

Getting Started - Python

We will be primarily using PyCharm for application development using Python as programming language.

  • Make sure PyCharm is setup.
  • Search or find for PyCharm and then create new project as demonstrated.
  • Develop Hello World Program to validate that we can use PyCharm for application development.
  • Let us also understand how we can pass run time arguments. We need to use sys package to read run time arguments as part of Python programs.

Define Problem Statement

Let us define problem statement to develop a program using basic programming constructs in Python.

  • Problem Statement: Get revenue for each order id
  • We have already setup data sets on our computer.
  • It have retail_db folder which contain 6 other folders.
  • One of them is order_items. It has 6 fields - order_item_id, order_item_order_id, order_item_product_id, order_item_quantity, order_item_subtotal, order_item_product_price
  • order_items is child table for orders. There will be multiple order items for a given ** order_item_order_id**.
  • Using ** order_item_order_id** as key we will compute revenue by adding order_item_subtotal.

Overview of Programming Constructs

Here are some of the important aspects of Python Programming which one should know to be comfortable in Python.

  • Variables - Dynamically Typed
  • Defining the scope - indentation
  • Conditionals
  • Looping
  • Exception Handling (will not be covered as part of this session)
  • Developing Functions
  • Arguments and Return Values

Overview of Collections

Fundamentally, there are 3 types of collections in Python.

  • list - Heap of elements where duplicates are allowed and order is preserved.
  • set - Group of unique elements
  • dict - Hash Map with keys and values where keys are unique.

For our problem statement we will

  • read data from file into list
  • construct dict with order_item_order_id as key and revenue as value

Let us see few operations on list and dict

  • Creating list - l = [2, 4, 3, 4, 1]
  • Getting APIs available on top of list - help(l)
  • Accessing elements in list using index
  • Using functions on top of list
  • Creating dict - d = {'order_id': 1, 'order_date': '2013-07-25', 'order_customer_id': 1, 'order_status': 'COMPLETE'}
  • Getting APIs available on top of dict - help(d)
  • Accessing values in dict using key
  • Using functions on top of dict

Reading Data from Files into Collections

Let us understand how to read data from files into collections. I would highly recommend you to focus on data processing to be proficient in Python than developing programs like printing triangles, quick sort etc.

We will use orders data set for now.

  • Open File - orders/part-00000
  • Read Data from file into memory
  • Convert into collection (list in this case)
  • Perform list operations on top of data.
orders_file = open('/Users/itversity/Research/data/retail_db/orders/part-00000')
orders_raw =
orders = orders_raw.splitlines()

for i in orders[0:10]: print(i)

Processing Data using Loops

Let us understand how we can use loops and conditions to compute revenue for each order id.

  • Read order_items data from files into list.
  • Process data by grouping on order_item_order_id and compute revenue for each order_id.
  • We can use dict with order_item_order_id as key and revenue as value.
d = {1: 100.0, 2: 300.0}
d[2] = d.get(2) + 200.0
d[3] = d.get(3) + 100.0 # it will fail, we need to check for emptiness and then set the value instead of adding.

if(d.get(3)): d.get(3) + 100.0
else: d[3] = 100.0


order_items_file = open('/Users/itversity/Research/data/retail_db/order_items/part-00000')
order_items_raw =
order_items = order_items_raw.splitlines()

def get_order_revenue(order_items):
  order_revenue = {}
  for order_item in order_items:
    order_item_order_id = int(order_item.split(',')[1])
    order_item_subtotal = float(order_item.split(',')[4])
      order_revenue[order_item_order_id] = order_revenue.get(order_item_order_id) + order_item_subtotal
      order_revenue[order_item_order_id] = order_item_subtotal
  return order_revenue

order_revenue = get_order_revenue(order_items)

Processing Data using Pandas

Let us understand how we can use Pandas Data Frame APIs to compute revenue for each order id.

  • We need to ensure pandas is installed.
    • To validate whether pandas is installed or not we can use pip3 list
    • If pandas is not installed, we can use pip command to install pandas.
    • As I want to use Python3 and as it is not default, I am using pip3 to install pandas - pip3 install pandas
  • Read order_items data from files into list.
    • We can use pandas APIs such as read_csv to read the data.
    • We can also specify schema while reading data using read_csv.
import pandas as pd

order_items = pd.read_csv(
  names=["order_item_id", "order_item_order_id", "order_item_product_id",
         "order_item_quantity", "order_item_subtotal", "order_item_product_price"]

# Projecting
order_items[["order_item_order_id", "order_item_subtotal"]]

# Filtering
order_items[order_items.order_item_order_id == 2]
  • Process data by grouping on order_item_order_id and compute revenue for each order_id.
# Aggregation
order_items.groupby("order_item_order_id")["order_item_subtotal"]. \
  sum(). \

order_items.groupby("order_item_order_id")["order_item_subtotal"]. \
  agg('sum'). \
  • We can also write Data Frame to file using APIs such as to_csv.

Externalizing Properties

We should avoid hard coding information related to connecting to databases, file or directory paths etc.

  • We will be connecting to different databases when the application is deployed in different environments such as Development, UAT, Production etc.
  • Also file or directory paths might be different.
  • If we hard code these details, then we will not be able to deploy certified code in higher environments such as Production.

Let us use PyCharm for the development purpose. We need to ensure the right version of Python Interpreter while using PyCharm for development. In our case it is Python 3.x

  • Create new project in PyCharm
  • Create config and src directories
  • Install configparser using pip. It can be installed using PyCharm or CLI.
  • Create file with input file path.
  • Define properties for dev, uat and prod. Load the properties of corresponding environment at run time.
import configparser as cp
props = cp.RawConfigParser()'config/')
props.get('dev', 'input_dir')
  • We typically pass the environment as run time argument (dev or uat or prod).
  • To read run time arguments we need to use sys.
import configparser as cp, sys
props = cp.RawConfigParser()'config/')
props.get(sys.argv[1], 'input_dir')
  • I will share the code in GitHub Repository just to get you gist of GitHub with README file after the session.

Focus Areas for Data Engineers

Here are the topics one should know using Python as programming language to be successful Cloud based Data Engineer.

  • Lambda Functions
  • Collections and Map Reduce APIs - to get general idea about Collections and Map Reduce Paradigm.
  • pandas - Popular module to process data at lower volumes.
  • pyspark - Popular module to process data at scale.
  • kafka - To develop producers and consumers for Kafka Topic
  • boto3 - To programmatically interact with AWS
  • JDBC Programming - To develop applications by connecting to Databases
  • Developing AirFlow DAGs
    and more

You can see all the articles on this topic from here .

1 Like