Sqoop import Data distributed unevenly

Hi,

I have run the below query

sqoop import
–connect “jdbc:mysql://nn01.itversity.com:3306/retail_db”
–username retail_dba
–password itversity
–table departments
–target-dir /user/hadoopuser17/departments

the expected out put is data should be distributed evenly but I got the below output.

[hadoopuser17@gw01 ~]$ hadoop fs -ls /user/hadoopuser17/departments
Found 5 items
-rw-r–r-- 3 hadoopuser17 hdfs 0 2017-02-16 01:51 /user/hadoopuser17/departments/_SUCCESS
-rw-r–r-- 3 hadoopuser17 hdfs 64 2017-02-16 01:51 /user/hadoopuser17/departments/part-m-00000
-rw-r–r-- 3 hadoopuser17 hdfs 0 2017-02-16 01:51 /user/hadoopuser17/departments/part-m-00001
-rw-r–r-- 3 hadoopuser17 hdfs 0 2017-02-16 01:51 /user/hadoopuser17/departments/part-m-00002
-rw-r–r-- 3 hadoopuser17 hdfs 20 2017-02-16 01:51 /user/hadoopuser17/departments/part-m-00003

Here data is distributed into only 2 files. Why zero byte files are created ?

it depends on the number of mappers and min and max value while computing the map task.
by default 4 number of mappers run , so you get 4 files and the min max split is responsible for putting data in 2 of the mappers while there is not enough records to be distirbuted across 4 mappers.

in the log check the min and max clause and see how it is distributed .
you can also manage this by explicitely invoking the --boundary query .

Agree. but I have the below recs in department table.

2,Development
3,Footwear
4,Apparel
5,Golf
6,Outdoors
7,Fan Shop
100,Hadoop Training

By default it should use 4 mappers, so 3 files should have 2 recs each and 1 rec will go to 4th file…

Please correct me if my understanding is wrong.

i dont have the environment right now to execute it else i could have checked .
it actually depends on your min and max value computed over a hash value and then particular hash value goes to one mapper and similary to other mapper.
in this case based on your min and max value of your say primary/unique key column it might be getting divided on to 2 mappers only , try putting one more records with a key somewhat more than as regular like 8000 or 99999 and you will see this as an outlier and this particular record with the key will be only present in the 3rd mapper and not the 1st and 2nd and the 4th will stillbe empty.

i would recommend you to go through the ITVERSITY tutorial videos and you will see this example there , perhaps that will clear your doubt.

Since the range of keys is from 2 to 100 and the number of mappers is 4, it will split the range from 2 to 100 into 4. i.e. something like keys in the number 2-25 will be imported by the first mapper program, 25-50 will be imported by the second mapper program, 50-75 by the third mapper program and 75-100 by the third mapper program.

That is why your first mapper contains 6 of your records since they are in the range 2-25 and the last mapper contains 1 record since it is the range 75-100. And your 2nd and 3rd mapper outputs are empty.

[Note: The numbers are just a rough estimate, it might 1 to 2 numbers off the exact range]

4 Likes

@Avinash_Parida @pramodvspk @itversity
Guys, I’m having the same doubt. Could you please help me here.

@pramodvspk Just wanted to know how the splits can be [2-25],[25-50],[50-75] and [75-100] only ? I thought they would split this way. Min=2, Max=100 => (100-2)/4=24.5 records per mapper.
and that resulted in all the records ending up in first map output except department 100 and that was placed in last map output.

But our both explanations aren’t holding good in the below case which is why i need clarification.
I’ve below records in the table

2,Development
3,Footwear
4,Apparel
5,Golf
6,Outdoors
7,Fan Shop

After performing sqoop import I ended up like below.
[cloudera@centsosdemo ~]$ hadoop fs -cat /user/cloudera/departments/part-m-00000
2,Fitness
3,Footwear
[cloudera@centsosdemo ~]$ hadoop fs -cat /user/cloudera/departments/part-m-00001
4,Apparel
[cloudera@centsosdemo ~]$ hadoop fs -cat /user/cloudera/departments/part-m-00002
5,Golf
[cloudera@centsosdemo ~]$ hadoop fs -cat /user/cloudera/departments/part-m-00003
6,Outdoors
7,Fan Shop

Could you please explain this scenario.

@iamteja @Avinash_Parida NO ,as Avinash has already stated ,the scenario of equal kind of data distribution will not happen unless you explicitly mention spit-by and boundary-query.

But if you look at the purpose of running more mapper,you shall always do that in case import records size is quite big.As you have very few records ,suggestion would be to run it with one mapper.

@saurabhbhanwala
Thanks for the reply.

I got the point that equal kind of data distribution will not happen unless we specify split-by or boundary query. But I’m bothered how the data is split among the mappers if the split-by column or the boundary query results in uneven distribution.
Let’s say the split-by column resulted in min value=2,and max value=12500 and m=8, how the data is going to be distributed among 8 mappers. Can you please mention the record split.

Pardon me if you think I still didn’t get the point. May be you can provide me an example.

If m is 8, sqoop will divide total data in 8 exclusive chunks of equal size and each each mapper will sqoop that part. e,g. min = 0 max= 1600, each mapper 200 records. For this we need primary key or split by on index column. You can give split by on non index column also but it will not guaranty equal distribution of data in each files.

@N_Chakote
Thanks for the reply.

I understand this case. But am concerned about the case where the (Max Value - Min Value) is not divisible by number of mappers. In that case how the records will be distributed ?

In such case one or two of the mappers will have +/- 1 record.