How to evenly distribute records to mapper during sqoop incremental import


#1

Dear Team,

During sqoop incremental import, upfront we may not know how may records added / modified.

After execution, the mapper output are not evenly distributed.

Will you be able to provide tips to mention the ideal number of mapper and to distribute records evenly across mapper.

Sample have given two mapper details
file size 47 MB - duration time 27 sec CPU time 14670 ms = 0.2445 Minutes
file size 188 MB - duration time 42 mins CPU time 41600 ms = 0.6933 Minutes

Thanks in advance.
Raj.


#2

The above scenario happens when the source data doesnt have a primary key and you explicitly set --num-mappers
try using --split-by <your_desired_column>(This method is useful if primary keys are not evenly distributed) you can decide the column on which it should split its import and distribute evenly with the available mappers.
For Reference: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_data-access/content/using_sqoop_to_move_data_into_hive.html


#3

Thanks for your valuable input Ravi.

Actually this table has primary key and as pointed our we have specified required number of mapper.

One point I have forget to mention I was using --direct option to move data between Oracle and Hadoop.

Without --direct below query will be executed.
SELECT * FROM sometable WHERE id >= lo AND id < hi

With —direct below query will be executed.
SELECT *
FROM sometable
WHERE rowid >= dbms_rowid.rowid_create(1, 893, 1, 279, 0) AND
rowid <= dbms_rowid.rowid_create(1, 893, 1, 286, 32767)

Due to above the records in mapper are not evenly distributed (For Incremental import)
Say if I already imported 1 million records and during incremental import say 1 Lakh record added and 1 lakh modified. How to distributed this 2 Lakhs records among different mapper…

Thanks in advance.