Need solution for below question, can use Sqoop / Spark in Scala (preferably)




Connect to the MySQL database on the itversity labs using sqoop and import data with employer_name, case_status and count. Make sure data is sorted by employer_name in ascending order and by count in descending order

Data Description

A MySQL instance is running on a remote node in the instance. You will find a table that contains 3002373 rows of h1b data

MySQL database information:

Installation on the node

Database name is h1b_db

Username: h1b_user

Password: itversity

Table name h1b_data

Output Requirements

Place the h1b related data in files in HDFS directory


Replace whoami with your OS user name

Use text file format and tab (\t) as delimiter

Hint: You can use Spark with JDBC or Sqoop import with query

Output should contain employer name, case status and count

End of Problem


Hopefully, this should do the job in sqoop,

sqoop import
–connect jdbc:mysql://
–username h1b_user
–password itversity
–target-dir /user/whoami/problem20/solution/
–num-mappers 4
–fields-terminated-by “\t”
–query “select employer_name, case_status, count from h1b_data and $CONDITIONS ORDER BY count DESC, employer_name”
–split-by employer_name

Not sure of the DB (primary key constraints) so took split-by employer_name.
For the query, split-by is required.
Also, don’t think NULL characters are present in employer_name. so ignored it.
The default is text file format - so no action required.
\t delimiter condition is provided.
Also there is a \ before $. Some code is being formatted after submit.

PS: I’m a learner too if wrong please correct as I didn’t execute the command.


it should be grouped by name and status to get count

–query “select employer_name, case_status, count(1) as count from h1b_data where $CONDITIONS group by employer_name, case_status ORDER BY employer_name, count desc”


Oh, my bad as I didn’t check the table desc, I didn’t take count(1) into consideration.

:slight_smile: @vikas.gonti glad you were of help.