Data Engineering Projects

EDIT: Added 5th project

Hi guys,

A lot of us learn Big Data technologies and data analytics tools and languages by going through tutorials, but in my opinion applying them on a project will help us understand the nuances of the technology/tools better.

Below I have a list of 4 Data Engineering projects that I have found on the internet. The projects are coding challenges for the Insight Data Engineering program. The challenges can be completed using plain Python code or you can use Big Data technologies. Hope you find them useful.

  1. Twitter Challenge:
  2. NASA fansite analytics challenge:
  3. Digital Wallet:
  4. Venmo Challenge:
  5. Anomaly Detection:

** The projects are on GitHub and publicly available. If you are using the project for any commercial purpose, don’t forget to give credits original author and cite the URL.


Thanks for sharing. I was looking for something like this.

Excellent !!
Thanks for sharing !!

hi thanks for these details ,its a great help. I am new to hadoop , can i please get the codings of these projects plz so that i can learn and check if i am correct. Can anyone help by providing the simple projects along with answers.Your help will be greatly appreciated.

I have completed the challenge in Python. You can take a look at it here. I am referring to the NASA fan analytics challenge.

Nice pramod, this is first time seeing this blog and helpful.

Hi pramodvspk and all,

In real time spark big data solution which one is better either RDD or dataframe/dataset.

I’m seeing still some company using core rdd for cleansing, transformation,
is any harm If we use RDD because dataframe is doing well good optimization compare to rdd.

Suresh Selvaraj