Sqoop Import - Using Split By

Let us revisit the standard sqoop import to understand how data is splitted using primary key.

sqoop import \
  --connect jdbc:mysql://ms.itversity.com:3306/retail_db \
  --username retail_user \
  --password itversity \
  --table order_items \
  --warehouse-dir /user/training/sqoop_import/retail_db \
  --delete-target-dir

Let us understand what happens when we run typical sqoop import and when --split-by should be used.

  • When ever we run sqoop import with number of mappers greater than 1, a bounding vals query will run.
  • Bounding Vals Query is run on primary key field to get min and max value of it.
  • Using the min and max split size is computed.
  • If there is no primary key or unique key, import will fail unless number of mappers is set to 1 or specify a field using split-by.
  • It is better practice to use an indexed field which do not contain null values as part of split-by

Let us see an example how we can specify non primary key field to compute splits using --split-by.

sqoop import \
  --connect jdbc:mysql://ms.itversity.com:3306/retail_db \
  --username retail_user \
  --password itversity \
  --table order_items \
  --warehouse-dir /user/training/sqoop_import/retail_db \
  --delete-target-dir \
  --split-by order_item_order_id

Practice hive on state of the art Big Data cluster - https://labs.itversity.com
You can sign up for our courses on Udemy using $10 coupons - Udemy Coupons - Big Data Courses


Hi sir,

Number of rows in order_item_order_id is about 172198, but in the query it is showing 68883. May I know what is that I am missing from below query

sqoop import \
  --connect jdbc:mysql://ms.itversity.com:3306/retail_db \
  --username retail_user \
  --password itversity \
  --table order_items \
  --warehouse-dir /user/training/sqoop_import/retail_db \
  --delete-target-dir \
  --split-by order_item_order_id