Sqoop import related doubt


#1

Does changing the number of mappers in the control arguments vary the output for each new number of mappers? If so, how to choose the correct number mappers during the exam?

Also, How should we decide if the num mappers have to be kept more than one in which case again, how to select a primary key for split by? Will this information be provided during the certification? I believe that the describe table command using sqoop eval could be useful to identify the primary key upon which we can split. Correct me if my approach is wrong.


#2

I think the output would vary as the number of mappers decides how the data will be divided between the mappers. To choose the number of mappers, I am not sure but I have read somewhere that the number of mappers depends on the number of cores or CPU in a cluster. I may be wrong. I would like to hear from someone if it is wrong.

Coming to the second part of the question to select the primary key, the column in the table which has unique values and no null values could be a potential candidate for the primary key. Maybe following could help.
Requirements for using split-by:

  1. The column should be indexed
  2. The column should be densely populated.
  3. The values in the column should be evenly distributed
  4. It should not have null values in it.
  5. Splitting on string field is not intuitive enough

Everything related to the exam I am not aware of it. I have not yet given the examination yet.

Thanks


#3

Thanks Bharat. I figured out that in most cases if the requirement is to filter data out of a column, it’s a safe bet to use where clause and table name while importing the data to hdfs. In rest of the cases where custom transformation is needed, primary key information can be extracted from describe tablename command from sqoop eval and accordingly we can split on that key and provide the adequate number of mappers.