As part of this session we will see what is required to setup the environment to build Data Engineering Applications. I will also explain why they are required.
Before getting into details about setting up the development environment, we will also discuss below topics.
- Quick Recap of Program Details. You can get complete details by going to this post.
- Q&A about Program
Here is the agenda about setting up Development Environment for Data Engineering.
- Operating System - Linux based preferably Ubuntu or Mac
- Virtual Box and Virtual Machine (Requires 16 GB RAM and i7 Quad Core)
- Dual boot or Linux Only (Requires 4 GB RAM and i3, Recommended 8 GB RAM and at least i5)
- Use Mac with out Linux
- Overview of setup tools - brew for Mac and snap as well as apt for Ubuntu.
- Configure password less login
- Overview of primary programming language of the program - Python
- Setup of Development Tools for Python based application development (PyCharm and Jupyter). PyCharm is mandatory.
- Review of git based tools for Code Versioning
- Setup of Data Sets for practice - git, Kaggle and other relevant platforms
- Overview of Scala and IntelliJ for Scala based application development
- Overview of apex.oracle.com to practice SQL
We will also see other technologies as we get into main course. In next couple of weeks make sure you guys set up all the above tools and validate. If you run into any issues please reply to this topic so that it will be useful for future reference. Also you can follow up over the Slack Channel.
- Apache Spark
- Apache Kafka
- AWS EMR and other AWS Analytics Tools
and many other tools based on the individual capability.
Before getting into the details let me also walk you through the support process.
- Our forums. If you are reading this then you are already in our forums.
- Slack for live updates and follow ups or escalations.
- We want to use our forums for support primarily for 2 reasons.
- Free Slack have limitation of 10,000 Messages.
- Forums can become knowledge base for future reference.
- We will make sure you folks are comfortable with both of these.
- Live Sessions are typically conducted using Zoom and streamed to YouTube.
- It will be better for you to join Zoom sessions to interact with me directly.
Linux (Ubuntu) or Mac based Desktop or Laptop.
Why Ubuntu or Mac?
Traditionally we are used to Windows based desktops for development. But applications are deployed on Linux based environments.
- Using Linux or Mac based Desktops or Laptops will make it lot more productive.
- Most of the Data Engineering Tools work seamlessly with Linux or Mac but not Windows. Examples are Hadoop, Kafka, Spark etc.
- If you are already using Windows based desktop, then either switch to Ubuntu or setup Dual Boot or if you have 16 GB RAM and Quad Core Laptop at least setup Ubuntu based Virtual Machine along with Virtual Box.
- Mac users will be able to use their laptop directly.
Setup Ubuntu Desktop on Windows
If you have high end laptop with at least 8 GB RAM, you can consider setting up Virtual Box and then Ubuntu based Virtual Machine with Desktop.
- Set up Virtual Box
- Create Virtual Host
- Download Ubuntu 18.04 LTS ISO
- Setup Virtual Machine
- Install Virtual Box VM Tools
You can follow these 2 videos to understand step by step process.
Setup Virtual Box and Setup Virtual Host
Setup Ubuntu based Virtual Machine with Desktop
Overview of Setup Tools
Let us understand the tools that are typically used to install softwares on Ubuntu or Mac based environments. We should get used to installing softwares using command line rather than GUI based tools.
Software Installation Life Cycle
Here are the details about Software Installation Life Cycle.
- Download Software
- Install Software
- Configure Software
- Start and Manage Services (if applicable)
Command line tools provide lot more flexibility to manage Software Installation Life Cycle.
In Mac, we typically use
brew to install and manage softwares.
Click here to get instructions to install brew on Mac.
Here are some of the important commands.
# Searching for packages brew search wget # Getting information about packages brew info wget # Installing packages brew install wget
When it comes to Ubuntu make sure to use LTS version rather than latest. We can either use
snap to install software packages in Ubuntu based environment.
snap are available in Ubuntu out of the box.
Here are some of the important commands using
# Install Software apt install
Here are some of the important commands using
# Search for softwares snap find # Install softwares snap install # Get details about softwares snap info
Configure Password Less Login
We often connect to remote servers. If we use SSH Password less Login we will be a bit more productive.
- Troubleshoot the issues
- Deploy the applications
- Manage the environments
Here are some of the important commands with respect to Password Less Login.
# To generate private key and public key ssh-keygen # To copy public key to the remote machine ssh-copy-id # To connect to remote machine by using custom key other than id_rsa ssh -i LOCATION_OF_PRIVATE_KEY
Here are the advantages of using Password Less Login.
- No need to remember passwords or remember only passcode.
- Copy files to remote machines using command line tools such as
- Ability to run commands on multiple remote machines using scripts.
Primary Programming Language — Python
We will be using Python as primary programming language for the program.
- Extensively used in Data Engineering as well as Data Science
- Robust and easy to use APIs to interact with cloud based platforms (eg: boto3 to work with AWS)
- Workflow tools such as AirFlow uses Python based approach to build DAGs
- Rich visualization APIS with modules such as matplot lib
Setup Development Tools — Python
We need to understand and setup the required tools to develop applications using Python.
pythonCLI — to run basic Python code and explore APIs.
pip— to install packages. We can run
apt -y install python-pipto setup pip on Ubuntu based Desktop.
- Anaconda or Jupyter — to run Python code in interactive fashion. These are extremely popular in Data Science based Model Development.
- PyCharm — Popular IDE which is used to develop Web or Data Engineering applications using Python as programming Language.
We can setup PyCharm using
snap and here are some of the advantages of using PyCharm.
- Interactive Terminal
- Ability to refactor the code
- Modularize the project
- Validate the applications using Wizards
- Integration with specialized plugins for Web or Mobile Development
- Ability to manage code with Code Versioning tools
Install PyCharm on Mac
Here are the commands to install PyCharm on Mac.
brew search pycharm # GUI based softwares are generally provided as casks by brew # pycharm-ce is the cask that needs to be installed. brew cask install pycharm-ce
Install PyCharm on Ubuntu
Here are the commands to install PyCharm on Ubuntu.
snap find pycharm # PyCharm community edition is provided as pycharm-community in snap store snap install pycharm-community --classic
Code Versioning using Git
In actual projects we develop applications as a team and Git will act as code versioning tools and also have features to collaborate with other developers in the team.
gitis open source code versioning tool.
- GitHub, Bit Bucket and GitLab are popular
- Validate if you have
- If it is not available use
brewon Mac and
apton Ubuntu to install.
Data Sets for Practice
We need to have appropriate Data Sets for Practice.
- We can use several public data sets available on AWS S3, Kaggle etc.
- We also provide some data sets as part of our GitHub repository.
- You can also take up generating huge volume Data Sets as exercise and build to get comfortable with programming.
Click here to get data sets from our GitHub Repository.
Overview of Scala and IntelliJ
Scala is another popular programming language to build large scale Data Engineering applications using Kafka and Spark.
- Better to learn both Python and Scala as programming languages.
- Spark and Kafka are originally developed using Scala.
- It is the most popular functional programming language and belong to Java Family.
- We use IntelliJ or Eclipse as IDE for Scala based application development.
Install IntelliJ on Mac
Here are the commands to install IntelliJ on Mac.
brew search intellij # GUI based softwares are generally provided as casks by brew # intellij-idea-ce is the cask that needs to be installed. brew cask install intellij-idea-ce
Install IntelliJ on Ubuntu
Here are the commands to install IntelliJ on Ubuntu.
snap find intellij # IntelliJ community edition is provided as intellij-idea-community in snap store snap install intellij-idea-community --classic
Importance of SQL
SQL is one of the top three skills required by any IT Professional. Oracle SQL is the most popular and advanced that is available in market.
- Use https://apex.oracle.com to learn SQL using Oracle using free web interface.
- You need to learn DDL, DML and DQL.
- As part of SQL (DQL), you need to explore all pre-defined functions and clauses such as
Scripts - Setup Development Environment
We can run the scripts to setup the development environment. You need to have either Mac or Ubuntu based Desktop environment. Make sure to validate that setup is done properly. Detailed validation steps will be covered in subsequent sessions.
Script for Mac
Once brew is setup, one can run this script to setup the environment.
brew cask install pycharm-ce cd git clone https://www.github.com/dgadiraju/data brew cask install intellij-idea-ce
Script for Ubuntu
snap to setup softwares in Ubuntu.
snap install pycharm-community --classic apt -y install python-pip apt -y install git cd git clone https://www.github.com/dgadiraju/data snap install intellij-idea-community --classic
You can see all the articles on this topic from here .