What's the difference between "incremental-append" and "append"?


#1

The following is what I did:

Original table on MySQL has 5 records:
mysql> select * from departments_new;
±--------------±----------------±--------------------+
| department_id | department_name | created_date |
±--------------±----------------±--------------------+
| 10 | physicss | 2018-01-31 22:24:18 |
| 11 | chemistry | 2018-01-31 22:24:18 |
| 12 | math | 2018-01-31 22:24:18 |
| 13 | science | 2018-01-31 22:24:18 |
| 14 | engineering | 2018-01-31 22:24:18 |

After the initial import: original table on Hive has 5 records:

[paslechoix@gw01 ~]$ hdfs dfs -cat departments_new/*
10,physicss,2018-01-31 22:24:18.0
11,chemistry,2018-01-31 22:24:18.0
12,math,2018-01-31 22:24:18.0
13,science,2018-01-31 22:24:18.0
14,engineering,2018-01-31 22:24:18.0

Now add more records to the mysql table and it becomes:
mysql> select * from departments_new;
±--------------±-------------------±--------------------+
| department_id | department_name | created_date |
±--------------±-------------------±--------------------+
| 10 | physicss | 2018-01-31 22:24:18 |
| 11 | chemistry | 2018-01-31 22:24:18 |
| 12 | math | 2018-01-31 22:24:18 |
| 13 | science | 2018-01-31 22:24:18 |
| 14 | engineering | 2018-01-31 22:24:18 |
| 110 | Civil | 2018-01-31 22:32:41 |
| 111 | Mechanical | 2018-01-31 22:32:49 |
| 112 | Automobile | 2018-01-31 22:32:49 |
| 113 | Pharma | 2018-01-31 22:32:49 |
| 114 | social engineering | 2018-01-31 22:32:52 |
±--------------±-------------------±--------------------+

Now begin the game:

  1. incremental append:
    sqoop import -m 1
    –connect=“jdbc:mysql://ms.itversity.com/retail_export”
    –username=retail_user
    –password=itversity
    –table=departments_new
    –check-column “department_id”
    –incremental append

Result:
[paslechoix@gw01 ~]$ hdfs dfs -cat departments_new/*
10,physicss,2018-01-31 22:24:18.0
11,chemistry,2018-01-31 22:24:18.0
12,math,2018-01-31 22:24:18.0
13,science,2018-01-31 22:24:18.0
14,engineering,2018-01-31 22:24:18.0
10,physicss,2018-01-31 22:24:18.0
11,chemistry,2018-01-31 22:24:18.0
12,math,2018-01-31 22:24:18.0
13,science,2018-01-31 22:24:18.0
14,engineering,2018-01-31 22:24:18.0
110,Civil,2018-01-31 22:32:41.0
111,Mechanical,2018-01-31 22:32:49.0
112,Automobile,2018-01-31 22:32:49.0
113,Pharma,2018-01-31 22:32:49.0
114,social engineering,2018-01-31 22:32:52.0

This looks like just appending the mysql table to the existing HDFS. We have 5 + 10 = 15 records now on HDFS

  1. Append
    sqoop import -m 1
    –connect=“jdbc:mysql://ms.itversity.com/retail_export”
    –username=retail_user
    –password=itversity
    –table=departments_new
    –append

On top of the 15 records on HDFS, now I got 15 + 10 = 25 records.
[paslechoix@gw01 ~]$ hdfs dfs -cat departments_new/*
10,physicss,2018-01-31 22:24:18.0
11,chemistry,2018-01-31 22:24:18.0
12,math,2018-01-31 22:24:18.0
13,science,2018-01-31 22:24:18.0
14,engineering,2018-01-31 22:24:18.0
10,physicss,2018-01-31 22:24:18.0
11,chemistry,2018-01-31 22:24:18.0
12,math,2018-01-31 22:24:18.0
13,science,2018-01-31 22:24:18.0
14,engineering,2018-01-31 22:24:18.0
110,Civil,2018-01-31 22:32:41.0
111,Mechanical,2018-01-31 22:32:49.0
112,Automobile,2018-01-31 22:32:49.0
113,Pharma,2018-01-31 22:32:49.0
114,social engineering,2018-01-31 22:32:52.0
10,physicss,2018-01-31 22:24:18.0
11,chemistry,2018-01-31 22:24:18.0
12,math,2018-01-31 22:24:18.0
13,science,2018-01-31 22:24:18.0
14,engineering,2018-01-31 22:24:18.0
110,Civil,2018-01-31 22:32:41.0
111,Mechanical,2018-01-31 22:32:49.0
112,Automobile,2018-01-31 22:32:49.0
113,Pharma,2018-01-31 22:32:49.0
114,social engineering,2018-01-31 22:32:52.0

From above, I don’t see difference between incremental append and append, it would be appreciated if someone can explain to me? Maybe using a different scenario?

Thanks.


#2

An incremental append would add data to your file only if it’s not already present there. But in case of an append, it would append whatever data is passed to the query and may add duplicates.

That is why incremental append asks for a parameter (last-modified or append). Choosing last-modified would add only updated records and choosing append would add all newly added records


#3

In the incremental append , u did not provide last-value. So it behaved like a normal append.


#4

–append would add duplicates to the dataset while --incremental append does not append duplicates.