Data Engineers - Setting up the Development Environment

As part of this session we will see what is required to setup the environment to build Data Engineering Applications. I will also explain why they are required.

Before getting into details about setting up the development environment, we will also discuss below topics.

  • Quick Recap of Program Details. You can get complete details by going to this post.
  • Q&A about Program

Here is the agenda about setting up Development Environment for Data Engineering.

  • Operating System - Linux based preferably Ubuntu or Mac
    • Virtual Box and Virtual Machine (Requires 16 GB RAM and i7 Quad Core)
    • Dual boot or Linux Only (Requires 4 GB RAM and i3, Recommended 8 GB RAM and at least i5)
    • Use Mac with out Linux
  • Overview of setup tools - brew for Mac and snap as well as apt for Ubuntu.
  • Configure password less login
  • Overview of primary programming language of the program - Python
  • Setup of Development Tools for Python based application development (PyCharm and Jupyter). PyCharm is mandatory.
  • Review of git based tools for Code Versioning
  • Setup of Data Sets for practice - git, Kaggle and other relevant platforms
  • Overview of Scala and IntelliJ for Scala based application development
  • Overview of apex.oracle.com to practice SQL

We will also see other technologies as we get into main course. In next couple of weeks make sure you guys set up all the above tools and validate. If you run into any issues please reply to this topic so that it will be useful for future reference. Also you can follow up over the Slack Channel.

  • Apache Spark
  • Apache Kafka
  • AWS EMR and other AWS Analytics Tools
  • Databricks
  • AirFlow

and many other tools based on the individual capability.

Before getting into the details let me also walk you through the support process.

  • Our forums. If you are reading this then you are already in our forums.
  • Slack for live updates and follow ups or escalations.
  • We want to use our forums for support primarily for 2 reasons.
    • Free Slack have limitation of 10,000 Messages.
    • Forums can become knowledge base for future reference.
  • We will make sure you folks are comfortable with both of these.
  • Live Sessions are typically conducted using Zoom and streamed to YouTube.
  • It will be better for you to join Zoom sessions to interact with me directly.

Operating Systems

Linux (Ubuntu) or Mac based Desktop or Laptop.

Why Ubuntu or Mac?

Traditionally we are used to Windows based desktops for development. But applications are deployed on Linux based environments.

  • Using Linux or Mac based Desktops or Laptops will make it lot more productive.
  • Most of the Data Engineering Tools work seamlessly with Linux or Mac but not Windows. Examples are Hadoop, Kafka, Spark etc.
  • If you are already using Windows based desktop, then either switch to Ubuntu or setup Dual Boot or if you have 16 GB RAM and Quad Core Laptop at least setup Ubuntu based Virtual Machine along with Virtual Box.
  • Mac users will be able to use their laptop directly.

Setup Ubuntu Desktop on Windows

If you have high end laptop with at least 8 GB RAM, you can consider setting up Virtual Box and then Ubuntu based Virtual Machine with Desktop.

  • Set up Virtual Box
  • Create Virtual Host
  • Download Ubuntu 18.04 LTS ISO
  • Setup Virtual Machine
  • Install Virtual Box VM Tools

You can follow these 2 videos to understand step by step process.

Setup Virtual Box and Setup Virtual Host

Setup Ubuntu based Virtual Machine with Desktop

Overview of Setup Tools

Let us understand the tools that are typically used to install softwares on Ubuntu or Mac based environments. We should get used to installing softwares using command line rather than GUI based tools.

Software Installation Life Cycle

Here are the details about Software Installation Life Cycle.

  • Download Software
  • Install Software
  • Configure Software
  • Start and Manage Services (if applicable)

Command line tools provide lot more flexibility to manage Software Installation Life Cycle.

Apple Mac

In Mac, we typically use brew to install and manage softwares.

Click here to get instructions to install brew on Mac.

Here are some of the important commands.

# Searching for packages
brew search wget

# Getting information about packages
brew info wget

# Installing packages
brew install wget

Ubuntu LTS

When it comes to Ubuntu make sure to use LTS version rather than latest. We can either use apt or snap to install software packages in Ubuntu based environment.

Both apt and snap are available in Ubuntu out of the box.

Here are some of the important commands using apt

# Install Software
apt install

Here are some of the important commands using snap

# Search for softwares
snap find

# Install softwares
snap install

# Get details about softwares
snap info

Configure Password Less Login

We often connect to remote servers. If we use SSH Password less Login we will be a bit more productive.

  • Troubleshoot the issues
  • Deploy the applications
  • Manage the environments

Here are some of the important commands with respect to Password Less Login.

# To generate private key and public key
ssh-keygen

# To copy public key to the remote machine
ssh-copy-id

# To connect to remote machine by using custom key other than id_rsa
ssh -i LOCATION_OF_PRIVATE_KEY

Advantages

Here are the advantages of using Password Less Login.

  • No need to remember passwords or remember only passcode.
  • Copy files to remote machines using command line tools such as scp and rsync using scripts.
  • Ability to run commands on multiple remote machines using scripts.

Primary Programming Language — Python

We will be using Python as primary programming language for the program.

  • Extensively used in Data Engineering as well as Data Science
  • Robust and easy to use APIs to interact with cloud based platforms (eg: boto3 to work with AWS)
  • Workflow tools such as AirFlow uses Python based approach to build DAGs
  • Rich visualization APIS with modules such as matplot lib

Setup Development Tools — Python

We need to understand and setup the required tools to develop applications using Python.

  • python CLI — to run basic Python code and explore APIs.
  • pip — to install packages. We can run apt -y install python-pip to setup pip on Ubuntu based Desktop.
  • Anaconda or Jupyter — to run Python code in interactive fashion. These are extremely popular in Data Science based Model Development.
  • PyCharm — Popular IDE which is used to develop Web or Data Engineering applications using Python as programming Language.

We can setup PyCharm using snap and here are some of the advantages of using PyCharm.

  • Interactive Terminal
  • Ability to refactor the code
  • Modularize the project
  • Validate the applications using Wizards
  • Integration with specialized plugins for Web or Mobile Development
  • Ability to manage code with Code Versioning tools

Install PyCharm on Mac

Here are the commands to install PyCharm on Mac.

brew search pycharm

# GUI based softwares are generally provided as casks by brew
# pycharm-ce is the cask that needs to be installed.
brew cask install pycharm-ce

Install PyCharm on Ubuntu

Here are the commands to install PyCharm on Ubuntu.

snap find pycharm

# PyCharm community edition is provided as pycharm-community in snap store
snap install pycharm-community --classic

Code Versioning using Git

In actual projects we develop applications as a team and Git will act as code versioning tools and also have features to collaborate with other developers in the team.

  • git is open source code versioning tool.
  • GitHub, Bit Bucket and GitLab are popular git based platforms.
  • Validate if you have git or not.
  • If it is not available use brew on Mac and snap or apt on Ubuntu to install.

Data Sets for Practice

We need to have appropriate Data Sets for Practice.

  • We can use several public data sets available on AWS S3, Kaggle etc.
  • We also provide some data sets as part of our GitHub repository.
  • You can also take up generating huge volume Data Sets as exercise and build to get comfortable with programming.

Click here to get data sets from our GitHub Repository.

Overview of Scala and IntelliJ

Scala is another popular programming language to build large scale Data Engineering applications using Kafka and Spark.

  • Better to learn both Python and Scala as programming languages.
  • Spark and Kafka are originally developed using Scala.
  • It is the most popular functional programming language and belong to Java Family.
  • We use IntelliJ or Eclipse as IDE for Scala based application development.

Install IntelliJ on Mac

Here are the commands to install IntelliJ on Mac.

brew search intellij

# GUI based softwares are generally provided as casks by brew
# intellij-idea-ce is the cask that needs to be installed.
brew cask install intellij-idea-ce

Install IntelliJ on Ubuntu

Here are the commands to install IntelliJ on Ubuntu.

snap find intellij

# IntelliJ community edition is provided as intellij-idea-community in snap store
snap install intellij-idea-community --classic

Importance of SQL

SQL is one of the top three skills required by any IT Professional. Oracle SQL is the most popular and advanced that is available in market.

  • Use https://apex.oracle.com to learn SQL using Oracle using free web interface.
  • You need to learn DDL, DML and DQL.
  • As part of SQL (DQL), you need to explore all pre-defined functions and clauses such as FROM , WHERE , JOIN , GROUP BY , HAVING , SELECT , ORDER BY etc.

Scripts - Setup Development Environment

We can run the scripts to setup the development environment. You need to have either Mac or Ubuntu based Desktop environment. Make sure to validate that setup is done properly. Detailed validation steps will be covered in subsequent sessions.

Script for Mac

Once brew is setup, one can run this script to setup the environment.

brew cask install pycharm-ce
cd
git clone https://www.github.com/dgadiraju/data
brew cask install intellij-idea-ce

Script for Ubuntu

We use apt or snap to setup softwares in Ubuntu.

snap install pycharm-community --classic
apt -y install python-pip
apt -y install git
cd
git clone https://www.github.com/dgadiraju/data
snap install intellij-idea-community --classic

You can see all the articles on this topic from here .

1 Like