Thanks in Advance …
Hi Mohitkumar, Typically source database should have update date/update time stamp to see what are the records are recently changed. During the SQOOP import to Hive, last fetched date/time stamp should be stored in log table, this should be used next time when you import so that you are taking recent changes only. Refer https://www.youtube.com/watch?v=ntSK_oJtWlQ for incremental load for reference.
Hi gnanaprakasam,
First of all, thanks for your efforts.
I will just go through it and let you know.
This question is more related to hive than bigdata-labs, hence I have changed the category to Apache Hive. Apache Hive is sub category of big data.
We can change the category by clicking on edit (pencil icon) beside topic title.
Exactly …
Is there any other way by using HBase Hive, etc…
While importing with sqoop we have --incremental last-modified. so it will import once any of the row got updated and in the same way if your base table is updated by rows for the we have --incremental append.
Hope you got this.
Dear All,
Please go through the scenario completely.
I am not asking to insert or import the new row in hive table.
I am asking to only import the newly updated records, not newly inserted record !!
I think if we see the above two table minutely, we will get the complete idea of it.
Make sure that both the above tables have same rows count.
@mohitkumar we cannot update data directly in Hive. We need to have periodic merge strategy.
If the table is small
- Import the whole table on regular basis with overwrite option in sqoop
If the table is big, you have to develop merge strategy
- Get the updated or newly inserted data into stage table (using incremental load or where condition on timestamp column)
- Perform full outer join between staged data with original data on primary column
- Replace original data with data which you got with full outer join
Another approach is to use HBase and create external table in hive.