Handling CDC with Sqoop

cloudera
sqoop
clouderaquickstart

#1

Hey guys,
Does anyone knows how to handle CDC data from structured data sources like Oracle or Sql Server using Apache Sqoop whether the destination is HDFS or Hive, I’m looking for similar methodology like the normal ETL tools do (SSIS, Informatica, …).

Thanks in advance,
Osama


#2

@osamab

Implementing CDC (Change Data Capture) in Hadoop (HDFS or Hive) is not impossible, but there are few challenges.

  1. It is possible only with Hive, the other option HDFS is not possible (i’m pretty sure you know why :slight_smile: )
  2. This CDC is possible to Hive only in batch mode while data ingestion process i.e. for few specific intervals like mins to hours. The longer interval in a day the better.
  3. If you need to make this happen in real-time (within few seconds/minutes), then Apache Sqoop is not best choice, go with Kafka/Flink/Spark Streaming.
  4. Next biggest thing is Hive Data Modeling, you need to take care of lot of things:
    i. Hive Table should be ORC, bucketed, with transactions enabled
    ii. Hive transactions should be enabled
    iii. Transaction compactions & tombstones
  5. Oozie Workflow to handle scheduling, data ingestion, hive queries, & other housekeeping things

Since you asked for methodology similar to ETL tools, that’s why i have mentioned above comprehensive solution.

Let me know if you need any clarifications.


#3

Thanks @ravi.tejarockon, seems to be require a lot of work to do it, I’ll have a look at it, or maybe use another tool like StreamSets or something similar.
Thanks for your help explaining this.


#4

@osamab,

Yes that’s a lot of work of to do… these steps are followed by few companies don’t have strong team & experience along with technical limitations. As most of them are using only Hadoop & recently started with Spark.

Anyways, I should thank you for recommending StreamSets, i have checked this tool & it was quite impressive. Other alternatives for StreamSets are Apache NiFi & Hortonworks Data Flow.


#5

@ravi.tejarockon / @osamab:

Just out of curiosity, general ETL tools such as Informatica, Pentaho etc etc cannot handle CDC for Hadoop? Why we need Apache NiFi, Hortonworks Data Flow ?

I came from SAP background and SAP BI tool can handle CDC/Delta management efficiently and accurately as batch jobs. I’m trying to understand how Apache NiFi/Hortonworks Data Flow tools different from traditional ETL tools.

Your more info/insights appreciated.
Thanks
Venkat


#6

Hi Venkat,
I don’t know about SAP but Informatica big data solution is still on its early stages and still have some issues, same case for SSIS, you’ll need to twist and squeeze it to the max to be able to handle a huge amount of data, as if you’ve a 1300CC car and you want to go from 0 - 100km/h in 3 seconds, you’ll reach the 100k/h but not in 3 seconds :slight_smile:

I would love to hear from @ravi.tejarockon and the others about their experience as well.

Thanks,
Osama


#7

As @osamab already explained about Informatica, & i don’t have experience in this tool.
Coming to Pentaho, is suitable for ETL activities in Hadoop, but at what scale they were able to success is i don’t know. We need to remember something that Informatica, Pentaho are enterprise tools & these things are not designed for open in nature, in the sense, these are designed in their company & overall client requirements in mind. But coming to Open Source Systems (StreamSets, NiFi, HDF) are for public, & easily available to customize or leverage them to our needs, but enterprise systems are always comes with limited constraints. Most of the companies are showing interest in Open Source so that cost of the software would go down & can easily controlled.

I know about SAP very well, as i’m working on few advanced systems in SAP & all these CDC/Delta Management is inbuilt. But we need to remember one thing that, all these systems are not suitable for distributed system architecture. So to cater Hadoop scale data ingestion using this Enterprise ETL tools is still a question.

That’s why people are always interested to use Open Source tools, as you will never run out of options.


#8

@ravi.tejarockon:

Completely agree with you Ravi, but do you think Apache NiFi can do all the “transformations” as traditional ETL tools can able to do or at least? I know it will do Extraction & Loading but my focus is on transformations - data massaging, blocking spl.chars, split one record into multiple records etc, etc. to name a few.

I spoke with one Informatica working person recently, he also expressed, OSS ETL tools will be taking over traditional ETL tools for distributed computing due to licencing cost factor.

Lets do more brain storming on this and will share info if it is available.

Thanks
Venkat


#9

@avr8082,

Yes it supports transformations on the fly as well, not only from RDBMS kind of databases but for data streaming sources also it is perfectly suitable. All we have to do is just unleash the power of NiFi & it will do ti.

He is right, cause now it is era of OSS tools, as everyone wants to reduce the operational costs, licensing etc…
Using traditional or enterprise tools with this eco-system is not well suitable for all cases.
By the way NiFi is expert in handling IoT kind of streams.

In my experience i’ve seen many Silicon valley companies or big companies always thinking about disruption rather than traditional.