How is data split into part files in Sqoop?


Hi there,
I’ve a doubt how the data is partitioned into part files if the data is skewed. If possible, please help me clarifying this.

Let’s say this my department table with department_id as primary key.
hive> select * from departments;
2 Fitness
3 Footwear
4 Apparel
5 Golf
6 Outdoors
7 Fan Shop
If I use sqoop import by mentioning -m 1 in the import command, I will have only one part file generated with all the records in that.

Now I ran the command without specifying any mappers. So by default it should take 4 mappers and it created 4 part files in HDFS. Below is how the records got distributed per part file.

[cloudera@centsosdemo ~]$ hadoop fs -cat /user/cloudera/departments/part-m-00000
[cloudera@centsosdemo ~]$ hadoop fs -cat /user/cloudera/departments/part-m-00001
[cloudera@centsosdemo ~]$ hadoop fs -cat /user/cloudera/departments/part-m-00002
[cloudera@centsosdemo ~]$ hadoop fs -cat /user/cloudera/departments/part-m-00003
7,Fan Shop

Min(department_id)=2, Max(department_id)=8 and 4 mappers to be used by default.
Upon calculation, each mapper should get (8-2)/4=1.5 records. Here I’m not getting how to distribute the data. I couldn’t understand how 2 records came in part-m-00000 and only one in part-m-00001, part-m-00002 and again two in part-m-00003.