Solution to


My solutions to all 20 questions are here. Please feel free to comment or suggest different approaches.

I have one question on problem 14.
How can we make sure, nulls are represented as database null?

Prepare for certifications on our state of the art labs which have Hadoop, Spark, Kafka, Hive and other Big Data technologies

  • Click here for signing up for our state of the art 13 node Hadoop and Spark Cluster


Hi vikas,

Re: problem 14 representing NA as null, this worked for me.

sqoop export
–connect jdbc:mysql://
–username h1b_user
–password itversity
–table h1b_data_bspeer15
–export-dir /public/h1b/h1b_data_to_be_exported
–input-fields-terminated-by “\001”
–input-null-string “NA”
–null-string “null”

Verify in mysql shell.

mysql> select count() from h1b_data_bspeer15 where longitude is null;
| count(
) |
| 107229 |

In spark-shell I got the same count

val test_h1b = sc.textFile("/public/h1b/h1b_data_to_be_exported")

test_h1b.filter(rec => rec.split("\001")(9) == “NA”).count

res43: Long = 107229


I do notice some strange behavior with this solution.

In the source data there are 30 records with prevailing_wage = 0.0

val test_h1b = sc.textFile("/public/h1b/h1b_data_to_be_exported")
test_h1b.filter(rec => rec.split("\001")(6).toFloat == 0.0).count

res46: Long = 30

After running the sqoop statement above go into mysql and run

select id, prevailing_wage from h1b_data_bspeer15 where prevailing_wage = ‘NA’;

you will get the 30 records where prevailing_wage = 0

Very Strange