Regarding split by command



I want to know what is the use of --split-by command in sqoop?


Definition from Sqoop Documentation:
–split-by Column of the table used to split work units. Cannot be used with --autoreset-to-one-mapper option.

Additional Points:

  1. Column should be indexed
  2. If no index, each thread has to do full table scan
  3. If indexed, it will be search the index and find the data
  4. It should not have null values
  5. If null values there , sqoop ignore those values
  6. –split-by also can be used for NON-PK columns (for ex: order_status in orders table) . To split data on NON-PK columns, we should use additional property below:
    Disadvantage, with splitting on NON-PK column, sometimes system generates many files than expected.

The below script will import “order_items_nopk” table’s into warehouse-dir using 4 files (default --num-mappers 4) which are into diff.file sizes (skewed) because there is no primary key on this table means, no index on column “order_item_order_id”. To avoid this --split-by can be used, so all data copied into 4 files with even sizes.

sqoop import
–connect jdbc:mysql://
–username retail_user
–password itversity
–table order_items_nopk
–warehouse-dir /user/vanampudi/sqoop_import/retail_db
–split-by order_item_order_id

Sqoop script ref to point # 6:
sqoop import
–connect jdbc:mysql://
–username retail_user
–password itversity
–table orders
–warehouse-dir /user/vanampudi/sqoop_import/retail_db
–split-by order_status

Hope this helps,