Need solution for below question, can use Sqoop / Spark in Scala (preferably)

cca-175
apache-spark
sqoop
scala
mysql

#1

Instructions

Connect to the MySQL database on the itversity labs using sqoop and import data with employer_name, case_status and count. Make sure data is sorted by employer_name in ascending order and by count in descending order

Data Description

A MySQL instance is running on a remote node ms.itversity.com in the instance. You will find a table that contains 3002373 rows of h1b data

MySQL database information:

Installation on the node ms.itversity.com

Database name is h1b_db

Username: h1b_user

Password: itversity

Table name h1b_data

Output Requirements

Place the h1b related data in files in HDFS directory

/user/whoami/problem20/solution/

Replace whoami with your OS user name

Use text file format and tab (\t) as delimiter

Hint: You can use Spark with JDBC or Sqoop import with query

Output should contain employer name, case status and count

End of Problem


#2

Hopefully, this should do the job in sqoop,

sqoop import
–connect jdbc:mysql://ms.itversity.com/h1b_db
–username h1b_user
–password itversity
–target-dir /user/whoami/problem20/solution/
–num-mappers 4
–fields-terminated-by “\t”
–query “select employer_name, case_status, count from h1b_data and $CONDITIONS ORDER BY count DESC, employer_name”
–split-by employer_name

Not sure of the DB (primary key constraints) so took split-by employer_name.
For the query, split-by is required.
Also, don’t think NULL characters are present in employer_name. so ignored it.
The default is text file format - so no action required.
\t delimiter condition is provided.
Also there is a \ before $. Some code is being formatted after submit.

PS: I’m a learner too if wrong please correct as I didn’t execute the command.


#3

it should be grouped by name and status to get count

–query “select employer_name, case_status, count(1) as count from h1b_data where $CONDITIONS group by employer_name, case_status ORDER BY employer_name, count desc”


#4

Oh, my bad as I didn’t check the table desc, I didn’t take count(1) into consideration.

:slight_smile: @vikas.gonti glad you were of help.