Transition to be Data Engineer using Big Data eco system


#1

Are you working professional with experience in one of the below roles and transition to Data Engineer?

  • Mainframes Developer
  • ETL Developer using technologies like Informatica, Ab Initio, Data Stage, SAS etc
  • Datawarehouse Developer
  • Database Developer
  • Application Developers should have idea about Data Engineering tools and technologies but it need not be better career choice.

Here is the plan to become Data Engineer using Big Data eco system:

  • Good understanding of Linux commands and ability to understand as well as develop shell scripts
  • Expert in writing high quality and efficient SQL
  • Good understanding about Data Modeling - both Normalized data models as well as Dimensional Modeling
  • Good core programming skills using any programming language - preferably Python, Scala or Java (Object Oriented concepts are not that important)
  • Expertise in Spark - Data Frame Operations, Spark SQL. One should be able to develop Scala or Python or Java based applications using Spark APIs
  • SQL based tools in Big Data - Spark SQL, Hive, Impala, Presto etc
  • Ability to build batch data pipelines using programming language and Spark with scheduling tools such as Azkaban, Airflow or any other enterprise scheduler
  • High level understanding about NoSQL technologies such as HBase, Cassandra, MongoDB etc with expertise in one of the NoSQL technologies
  • Real time data ingestion using tools like Kafka and integrating with Spark Streaming to apply rules in real time and derive streaming insights
  • Good knowledge about Amazon EMR and other analytics services such as Kinesis, Athena etc

Please add any more information as part of the reply.


#2

Dear @dgadiraju sir,

Thanks a lot for providing crystal clear transition plan to be Data Engineer using Big Data eco-system.
You’re absolutely right about each & every point, but would like to suggest one change as below:

In above point we need good core programming skills in any 2 languages from Python, Scala or Java. Only one is not helping for job search purposes, as most of Spark projects are in Scala & Java, even Python as well. That too expecting not just core programming but Object Oriented (OOPS) programming & Functional programming, then only we can implement complex real-time scenarios. For freshers its fine with limited knowledge, but with experienced software engineers, market is keen on critical skills. Coming to Python its good to have it for future profile scaling purposes.

Below are my contributions to above list:

  • Spark architecture (medium & advanced knowledge) & different design patterns for Batch & Stream processing.

  • Complex file formats like JSON, Avro & Parquet.

  • Best Practices & Optimization techniques with Spark, Kafka, Hive, HBase (or any NoSQL).

  • Data Modeling in NoSQL databases.


#3

Can you name some of the design patterns you can think of?


#4

@viswanath.raju,

Please find below:

  1. Spark Batch Design Patterns:
  • ETL Data Pipeline patterns: CDC, SCD Type- I & II.
  • MapReduce Design patterns (Summarization, Filtering, Data Organization, Join, Metapatterns).
  1. Spark Streaming Design Patterns:
  • Widowing & time-based based window aggregations, rolling windowing.
  • Back pressure
  • Streaming Data ingestion from SQL & NoSQL
  • Anomaly & fraud detection
  • Complex Event Processing (CEP)
  • Real-time & Near Real-time

We don’t need to learn all any 3 would be fine for us to understand concept, later we can make time to understand other patterns.


#5

How do we gain experience on these skills especially spark projects? Nosql, sql, java skills we gain if anyone comes with application development background??
I need your help or suggestions to get good hold on Spark, hive, architectures…


#6

Someone who is going to move to big data world from the above mentioned skills, seems to be overwhelming task.


#7

Hi @venkatwilliams,

Answers below inline:

You can get many Big Data projects from GitHub repos & few good blogs. You can practice them in Itversity labs.

As you said, if anyone coming from application development background, then expecting that they will have minimum knowledge on Java (or similar language with OOPS basics). & SQL is basic. So NoSQL is the one new thing to learn.

Initially start with basics then increase your pace to be expertise. Just one step at a time. :slight_smile:


#8

It would be great to prepare a course on real world production scenarios. Even though we can practice the datasets, to get through the interviews you need to know the intricacies.


#9

@shubh.mshr,

I will not deny it, but actually it is. Year by year industry is expecting lot of skills from Big Data Engineers. Lucky for you that, we kept is as simple as possible so that we will not scare anyone. Believe me this transition is possible & it works.

If you are feeling that it is still overwhelmed, then there is simple solution, just break it down into pieces, first learn according to Durga sir’s transition plan. Once you are expert in those topics, then go with my extra additions to plan. Otherwise leave it & go ahead and give it a try. Just informing that there are few more advance tech ahead and be informed that’s it.

All the best.


#10

Well, Thanks for your input. I have started learning about hadoop 3 months back and it took me alot of time to grasp the conceptual details on distributed programming. I am from ETL background and good in python and shell scripting. Recently started with spark and it is fairly easy once you get an understanding of what is going behind the scenes. It would be great if you can provide insight on some sample interview scenarios. I know that it is too much to ask but I believe it will help a lot of people.
Thanks


#11

@shubh.mshr,

Yes initially it will take lot of time to understand it, once you did, you will learn more. Get good hands-on with Spark its very important. Could provide sample interview scenarios, then you will stop learning :stuck_out_tongue:

Just kidding, will share some, don’t worry.


#12

Can you share few github projects… And spark project blogs…


#13

Will post in separate thread & will share that link here.


#14

Thanks Ravi and Durga Sir. This is excellent compilation , I cleared CCA-175 recently and now trying to understand Spark Architecture etc in Detail. There are no free meals and only Hard and Honest Approach can drive.
I found Data Bricks "Spark Summit’s " as good resource as it is coming from the creators and there is so much of information in them.


#15

I am planning to conduct live workshop about Spark Architecture very soon.