Spark/JDBC/HDFS/Hive Auto incremental load


#1

I got this question in an interview but couldn’t figure it out yet so I am posting it here.

Que:–
If we have data in SQL database which keeps changing in 20 min so built efficient pipeline where end result would be up to date Hive table and spark data frame.

My approach was to run sqoop incremental job to ingest data from SQL to HDFS and run that in every 20 min but I was stuck in a hive where no ACID compliant (condition is given) how can we achieve it (overwrite is not a good option as it’s slow).

How can we achieve same thing in Spark Dataframe without loading again?

If somebody can give some guidance here then it would be a great help…!! (I want an efficient solution for spark)

For now, I found this link only…
https://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/