Sqoop - "Split Size"

  • Connecting to the mySQL server nn01.itversity.com:3306, database retail_db
  • Table departments (department_id, department_name) has 7 distinct department values
    2
    3
    4
    5
    6
    7
    100
  • Executing a sqoop import with num-mappers=2
  • Here is the output of the sqoop import command

17/02/14 18:06:15 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(department_id), MAX(department_id) FROM departments
17/02/14 18:06:15 INFO db.IntegerSplitter: Split size: 49; Num splits: 2 from: 2 to: 100
17/02/14 18:06:16 INFO mapreduce.JobSubmitter: number of splits:2


  • Given the sqoop sets the ‘Split Size’ to 49, and Number of split=2…what range of values will the 2 splits have?
    Split 1
    = (Minimum value of Department_ID) TO (Minimum value of Department_ID+Split Size)
    =2 TO (2+49)
    =2 TO 51
    Is this correct?

Split 2
= (Upper bound of previous split) TO (Upper bound of previous split) +Split Size
=(51) TO (51) +49
= 51 TO 100
Is this correct?


A similar question relating to the ‘Split size’

  • I am using sqoop to import the ‘Orders’ table from mySQL to HDFS, 68883 records in the mySQL Orders table

  • The Sqoop command specifies 3 mappers
    sqoop import
    –connect=“jdbc:mysql://nn01.itversity.com:3306/retail_db”
    –username=retail_dba
    –password=itversity
    –table=orders
    –num-mappers=3
    –split-by=order_id
    –columns=order_id
    –target-dir=/user/ganesh1146/orders

  • When I look at the log (output from the sqoop command) that is output on the screen, I see Split size: 22960
    17/02/15 11:24:17 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(order_id), MAX(order_id) FROM orders
    17/02/15 11:24:17 INFO db.IntegerSplitter: Split size: 22960; Num splits: 3 from: 1 to: 68883
    17/02/15 11:24:17 INFO mapreduce.JobSubmitter: number of splits:3

  • When I look at the files oon HDFS, I see 3 files (as expected) but each with 22961 records
    [ganesh1146@gw01 ~]$ hadoop fs -cat /user/ganesh1146/orders/part-m-00000 | wc -l
    22961
    [ganesh1146@gw01 ~]$ hadoop fs -cat /user/ganesh1146/orders/part-m-00001 | wc -l
    22961
    [ganesh1146@gw01 ~]$ hadoop fs -cat /user/ganesh1146/orders/part-m-00002 | wc -l
    22961

Qs: Anyone understand why the sqoop command output shows the Split size: 22960 instead of Split size: 22961