If you can solve these problems.. you may be ready for CCA-175 . Give it a shot!



If it is just one table with simple condition… I would go with “–where” clause


I meant in case if you have to convert from string to toFloat ( = just dont treat it as string)


@rajegaurav try this to see the 2 decimal points.
cast(order_item_product_price) as DECIMAL(10,2))


Thanks Swapna. It worked.


Problem 4 - Can Some one please tell me what mistake I am making. I am using --autoreset-to-one-mapper so there is no need to use --split-by

[shalinisarathykc@gw03 ~]$ sqoop import
–connect jdbc:mysql://ms.itversity.com:3306/retail_export
–username retail_user --password itversity
–query “select ORDER_STATUS,count(order_id)as ORDER_COUNT from shalini_orders where $CONDITIONS group by order_status order by count(order_id) desc”
–target-dir /user/shalinisarathykc/jay/problem4/avro-snappy
Warning: /usr/hdp/ does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
18/06/04 12:49:20 INFO sqoop.Sqoop: Running Sqoop version:
18/06/04 12:49:20 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
When importing query results in parallel, you must specify --split-by.
Try --help for usage instructions.


In problem 6, what is the use of order table? All the required details can be obtained from order_items table alone.


In your solution, you are taking count. Is that what is needed?
I believe by Order Total, the Op wants us to find the total revenue per order.


Again, what is the use of importing products data in problem 7?
The output can be achieved using orderitems table alone


Thank You . Problem updated based on feedback.


Q1. What does this statement mean: “To handle output as sequence… RDD(K,V) . K and V should be string”

Q2. Where do we need to use Hive Context in exam?

Thanks in advance !!


Hi, I am not sure if you are still facing it but i would like to help you.
You are not supposed to put “WHERE $CONDITIONS” when you are planning for serial import i.e., –autoreset-to-one-mapper. Its actually used for parallel imports and get replaced by min(splitby) and max(splitby) values for parallel imports dynamically during runtime. For more info, http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_controlling_parallelism


Q1 -
mappedRDD.saveAsSequenceFile(, Some(classOf[compressionCodecClass]))

mappedRDD(K,V) ==> K is a string. V is a string
K = custid.toString
V = fname+"|"+lname+"|"+address

Q2 -
Use hc in sparkshell if you want to read hive table directly (instead of using its hdfs path) using sql


Thanks mate for explaining.


Thank you that makes sense.


Can someone please review and confirm that code and output of problem #6 is as per problem statement.

sqoop import
–connect jdbc:mysql://quickstart.cloudera:3306/retail_db
–username root
–password cloudera
–num-mappers 1
–fields-terminated-by “\t”
–table orders
–warehouse-dir /scala_files/problem6

sqoop import
–connect jdbc:mysql://quickstart.cloudera:3306/retail_db
–username root
–password cloudera
–num-mappers 1
–fields-terminated-by “\t”
–table order_items
–warehouse-dir /scala_files/problem6

val ordersRDD = sc.textFile("/scala_files/problem6/orders")
val orderItemsRDD = sc.textFile("/scala_files/problem6/order_items")


val ordersDF = ordersRDD.map( o => (o.split("\t",-1)(0).toInt,o.split("\t",-1)(3).toString)).toDF(“order_id”,“order_status”)
val orderItemsDF = orderItemsRDD.map ( o => (o.split("\t",-1)(0).toInt, o.split("\t",-1)(1).toInt,o.split("\t",-1)(5).toFloat,o.split("\t",-1)(4).toFloat )).toDF(“order_items_id”,“order_id”,“product_price”,“order_subtotal”)



val ordersTotalDF = sqlContext.sql("select o.order_id, sum(oi.order_subtotal) as order_total " +
"from orders o, order_items oi " +
"where o.order_id = oi.order_id " +
"and order_status IN ( ‘CLOSED’,‘COMPLETE’) " +
“group by o.order_id”)


val opDF = sqlContext.sql("select ot.order_id, oi.order_items_id, oi.product_price, oi.order_subtotal, ot.order_total " +
"from orders_total ot, order_items oi " +
"where ot.order_id = oi.order_id ")


sqlContext.sql(“select * from jay_mock_orderdetails limit 10”).show
|order_id|order_items_id|product_price|order_subtotal| order_total|
| 231| 567| 49.98| 49.98|559.9300193786621|
| 231| 568| 129.99| 129.99|559.9300193786621|
| 231| 569| 39.99| 79.98|559.9300193786621|
| 231| 570| 299.98| 299.98|559.9300193786621|
| 431| 1072| 59.99| 239.96|639.9400177001953|
| 431| 1073| 399.98| 399.98|639.9400177001953|
| 831| 2072| 399.98| 399.98|579.9500122070312|
| 831| 2073| 59.99| 179.97|579.9500122070312|
| 1031| 2573| 129.99| 129.99|579.9800109863281|
| 1031| 2574| 50.0| 100.0|579.9800109863281|


Can someone please post scala code for problem 8 in dataframes and sparksql. I have dataframe with all the information but struggling to output it as sequence file. Thanks in advance !!


Hi Om,
Yes I know that, but I am not sure what it actually asks for. I changed it to sum(order
_item_subtotal) but no point in that, as order_item_subtotal is already been asked in the query, so sum(order_item_subtotal) does give the same output as order_item_subtotal since it is present in the group by clause.

If you have any suggestions. please say.



For problem 4 I do see to use hive table, why you are directly using mysql and do sqoop import?


Yes I know, it can be done from hive table easily, but if you follow the previous question, we have no where created the table in hive, we have done that in mysql. Hope you got my point.



For problem4 how to validate whether it is compressed with “snappy”? My understanding is output files in hdfs has to be .snappy.avro… But in my case it is not… My output show only .avro