If you can solve these problems.. you may be ready for CCA-175 . Give it a shot!

cca-175

#44

Note : Avro/Snappy does not suffix “.snappy” in the output file name.
One way to find out if compression really worked is… try outputting file with and without compression in 2 different directories… . you will see the file size difference.
My knowledge is gzip does not have any effect on Avro. Only snappy used on Avro.


#45

PROBLEM 8

val order_items = sqlContext.read.parquet("/user/princefayaz84/jay/problem7/order_items")

import com.databricks.spark.avro._;
val products = sqlContext.read.avro("/user/princefayaz84/jay/problem7/products")

val product_tab = sqlContext.sql("select p.product_id,p.product_price,count(order_item_id) order_count "+
"from products p "+
"left outer join order_items oi "+
"on p.product_id = oi.order_item_product_id "+
"group by p.product_id,p.product_price "+
“order by order_count desc”
)

product_tab.rdd.map(rec => rec.mkString("|")).map(rec => (rec.split("|")(0).toString,rec)).coalesce(1).saveAsSequenceFile("/user/princefayaz84/jay/problem8/tab_seq_file_tab")

hadoop fs -cat /user/princefayaz84/jay/problem8/tab_seq_file_tab/part-00000 |head -n 5

SEQorg.apache.hadoop.io.Text org.apache.hadoop.io.Text

val seqfile = sc.sequenceFile("/user/princefayaz84/jay/problem8/tab_seq_file_tab",classOf[org.apache.hadoop.io.Text],classOf[org.apache.hadoop.io.Text])

seqfile.take(10).foreach(println)

Hi All

am able to save as sequence file …but when i read from spark-shell using seqfile.take(10).foreach(println) , it throws the below error

ERROR TaskSetManager: Task 0.0 in stage 61.0 (TID 1276) had a not serializable result: org.apache.hadoop.io.Text
Serialization stack:
- object not serializable (class: org.apache.hadoop.io.Text, value: 3)
- field (class: scala.Tuple2, name: _1, type: class java.lang.Object)
- object (class scala.Tuple2, (3,304|299.99|0))
- element of array (index: 0)


#46

For Problem 6, I believe we need to find out the order_total (combined total price for this order). When you group by all; order_id, order_item_id, order_item_product_price, order_item_subtotal we still get the same value. Instead use below query

select oi.order_item_order_id ORDER_ID, order_item_id, order_item_product_price product_price, order_item_subtotal order_subtotal, order_total
from order_items oi,
(select order_item_order_id, cast(sum(order_item_subtotal) as decimal(10,2)) order_total from order_items group by order_item_order_id) k,
orders o
where k.order_item_order_id = oi.order_item_order_id and
o.order_id = oi.order_item_order_id and
o.order_status in (‘CLOSED’, ‘COMPLETE’);

OR IF we are using HIVE, we can generate order_total by partition by

select o.order_id, oi.order_item_id, oi.order_tem_product_price,
round(oi.order_tem_subtotal, 2) order_subtotal,
round(sum(oi.order_tem_subtotal) over (partition by o.order_id), 2) order_total
from orders o join order_items oi
on o.order_id = oi.order_item_order_id
where o.order_status in (‘COMPLETE’, ‘CLOSED’)

Correct me if i am wrong.


#47

Hi Guys ,
I was trying the solve the problems mentioned here . My concern is while I was trying to use coalesce to limit the number of output file while saving snappy compressed avro files spark shell gets struck and I see lot of exception and retries happening . Did anyone face this kind of problem. Is coalesce not supported with snappy and avro?

Thanks in advance


#48

Hi,

For prob 6 , asked order_total = total price for particular order so group by column should be only order_id

below should be approach :

select a.o_id ORDER_ID, b.o_i_id ORDER_ITEM_ID, cast(b.p_price as decimal(10,2)) PRODUCT_PRICE, cast(b.total as decimal(10,2)) ORDER_SUBTOTAL, sum(cast(b.total as decimal(10,2))) over (partition by a.o_id ) ORDER_TOTAL from ord a,item b where a.o_id=b.o_i_id and order_status in (‘CLOSED’,‘COMPLETE’)