Cca-175 Spark and Hadoop Developer Practice Tests: Assignment 6: CCA 175 Model Exam - Sqoop, Hadoop and Spark : Problem 3 using scala

I bought Udemy practice test for cca-175-spark-and-hadoop-developer-practice-tests.

Assignment 6: CCA 175 Model Exam - Sqoop, Hadoop and Spark : Problem3

**count is not matching between my solution vs answer given for the problem. Can you please help what is wrong with my solution?

Did anyone solved ? Can you help what is your count? My solution given for the problem. Please help.

======================

Problem Statement

Get number of orders submitted by each customer for the month of January, 2014.

Data Description

Data is available in local file system /data/retail_db . We have customer information in customers directory and order information in orders directory.

Fields in customers: customer_id, customer_fname, customer_lname, customer_email, customer_password, customer_street, customer_city, customer_state, customer_zipcode

Fields in orders: order_id, order_date, order_customer_id, order_status

Cluster information:

  • Login to designated Gateway Node.

  • Username and Password are provided as part of the lab page.

  • If you are using your own cluster, make sure to login to gateway node.

Output Requirements

Place the imported data in the HDFS directory
/user/whoami/cca175_practice_test1/problem3/data/ order_count_by_customer

Replace whoami with your OS user name

  • Make sure output is saved only in one file

  • Data should be in JSON and should not be compressed

  • Output should contain customer_fname, customer_lname and order_count for the month of January, 2014.

  • customer_fname and customer_lname should be of type string and order_count should be integer or long or bigint.

  • Data should be sorted in descending order by order_count, then by customer_fname and then by customer_lname

End of Problem

There should be only one value and it should be 4696

========= My Solution ===========================

//problem 3
import scala.io.Source
val customersRaw = Source.fromFile("/data/retail_db/customers/part-00000").getLines().toList

val customerRDD = sc.parallelize(customersRaw)

val customersMap = customerRDD.map(rec=>{
val c = rec.split(",")
(c(0).toInt,c(1),c(2))
})

val customerDF = customersMap.toDF(“cust_id”,“cust_fname”,“cust_lname”)

customerDF.createOrReplaceTempView(“customer_tb”)

val ordersRaw = Source.fromFile("/data/retail_db/orders/part-00000").getLines().toList
val ordersRDD = sc.parallelize(ordersRaw)

val ordersMap = ordersRDD.map(rec=>{
val o = rec.split(",")
(o(2).toInt, o(1).substring(0,10))
})

val orderDF = ordersMap.toDF(“order_cust_id”,“order_date”)

orderDF.select(col(“order_cust_id”),to_date(col(“order_date”)).alias(“order_dt”)).createOrReplaceTempView(“order_tb”)

val janOrdersCountByCust = spark.sql(“select cust_fname, cust_lname, count(*) as order_count from customer_tb join order_tb on (customer_tb.cust_id = order_tb.order_cust_id) where order_dt like ‘2014-01-%’ group by cust_fname, cust_lname order by order_count desc, cust_fname , cust_lname”)

janOrdersCountByCust.coalesce(1).write.format(“json”).mode(“overwrite”).save("/user/jyothi_v/cca175_practice_test1/problem3/data/order_count_by_customer") //count:2596

My answer count: 2596

===================================================================


Prepare for certifications on our state of the art labs which have Hadoop, Spark, Kafka, Hive and other Big Data technologies

  • Click here for signing up for our state of the art 13 node Hadoop and Spark Cluster

Hello,

I think your mistake is that you are not considering the customer_id on your querie.

scala> sqlContext.sql("select customer_fname, "+
| "customer_lname,count(*) order_count from customers cust inner "+
| "join orders ord on cust.customer_id = ord.order_customer_id "+
| "where order_date between ‘2014-01-01’ and ‘2014-01-31’ "+
| “group by customer_fname,customer_lname”).count
res6: Long = 2596

In the other hand, if you consider the customer_id(pk on customer table) the number of records is quite different:

scala> sqlContext.sql("select customer_id,customer_fname, "+
| "customer_lname,count(*) order_count from customers cust inner "+
| "join orders ord on cust.customer_id = ord.order_customer_id "+
| "where order_date between ‘2014-01-01’ and ‘2014-01-31’ "+
| "group by customer_id, customer_fname,customer_lname "+
| “order by order_count desc, customer_fname, customer_lname”).count
res2: Long = 4696

I hope this could help you with the answer

Yes… you are right… Since output expected by last name, first name and count so I grouped based on only these columns …