Thanks in Advance …
Hi Mohitkumar, Typically source database should have update date/update time stamp to see what are the records are recently changed. During the SQOOP import to Hive, last fetched date/time stamp should be stored in log table, this should be used next time when you import so that you are taking recent changes only. Refer https://www.youtube.com/watch?v=ntSK_oJtWlQ for incremental load for reference.
First of all, thanks for your efforts.
I will just go through it and let you know.
This question is more related to hive than bigdata-labs, hence I have changed the category to Apache Hive. Apache Hive is sub category of big data.
We can change the category by clicking on edit (pencil icon) beside topic title.
Is there any other way by using HBase Hive, etc…
While importing with sqoop we have --incremental last-modified. so it will import once any of the row got updated and in the same way if your base table is updated by rows for the we have --incremental append.
Hope you got this.
Please go through the scenario completely.
I am not asking to insert or import the new row in hive table.
I am asking to only import the newly updated records, not newly inserted record !!
I think if we see the above two table minutely, we will get the complete idea of it.
Make sure that both the above tables have same rows count.
@mohitkumar we cannot update data directly in Hive. We need to have periodic merge strategy.
If the table is small
- Import the whole table on regular basis with overwrite option in sqoop
If the table is big, you have to develop merge strategy
- Get the updated or newly inserted data into stage table (using incremental load or where condition on timestamp column)
- Perform full outer join between staged data with original data on primary column
- Replace original data with data which you got with full outer join
Another approach is to use HBase and create external table in hive.