Successfully cleared CCA175 hadoop exam - May 25

Congratulations !!! which way you have connected to hive meta store table

var hc = new org.apache.spark.sql.hive.HiveContext(sc); this way it worked in exam

I am happy and relieved that I passed the exam today during my second attempt. 8/9
Thank you very much for Durga’s udemy and labs and support .
Itversity Labs is Awesome!. Subscribe/Practice . Don’t hesitate to spend on this .It is 200pct worth it.

Thanks to Arun’s blogs. Drill exercises on various formats. Really helps. Practice it many times.
Practice Practice Practice – all kinds of input formats and output formats / compression techniques

Note : If you check my previous post I failed this exam about a month ago due to various hiccups. Read those stories so that you don’t repeat my mistake.
3 main tips - Lessons learnt from first attempt failure :
Tip 1 : Use sqoop eval to check mysql table details (dont try mysql directly. it may not work . it did not work for me during my first attempt. This time I just relied on sqoop eval. it is smooth and easy)
Tip 2 ; Don’t ever delete target directory in hdfs … if you have to delete for valid reason be very careful and don’t delete the input source directory by mistake (that’s what I did during my first attempt)
Tip 3 : If you are stuck on something… move on to next problem and try to nail easy ones 100pct. You can revisit the unfinished problem later on if you have time. But don’t just struggle with one problem for a long time.

On Exam Experience (8 correct. 1 incorrect)
I would say exam was simple and straight forward. Others have said the same thing. No complicated tricky questions. Just pay attention for input/output/format details mentioned in the problem. Very very unlikely you will get problems on kafka/realtime. (but just brush up…you will never know)

Commands to remember :
spark-shell --master yarn --packages com.databricks:spark-avro_2.10:2.0.1 --> this is very very very important. just memorize this. If you do not then you might get the questions incorrect.
hive --> you need this to get the hdfs path of the table by using "describe formatted " . but you wont perform any processing on this. you mostly use spark-shell for most of your problems
sqoop eval/import/export
hadoop fs -ls -R xxxxxxx

Some of the high-level topics you may want to strengthen your understanding and practice

#sqoop import , export , eval . practice sqoop import using --where and --columns
#reading hdfs files…(various formats). do minor transformaton … filtering … use only selective columns… save the output in various formats in hdfs
#reading hive metastore and table… do some filtering… save in various output format in hdfs
#Practice RDD to DF . register DF as temp table. Then run query on temp table to produce output… assign it to a var. Use that var DF to save the output in HDFS. Sometimes doing through sql way is a lot easier and faster than struggling with various RDD functions for transformation (mainly around grouping , joining etc)
#Know when to use v(0) ,v_1 i.e v(0) is for RDD and DF . v._1 or v._2 or v._1._1 etc are for tuple in a map
#You have to really type commands/code and practice repeatedly so that you can ace the exam.
#While reviewing don’t just read commands and code . Type and practice.


I have created 9 problems for you to practice. Try it out

Good Luck to folks preparing for this exam. Best Wishes!


Prepare for certifications on our state of the art labs which have Hadoop, Spark, Kafka, Hive and other Big Data technologies

  • Click here for signing up for our state of the art 13 node Hadoop and Spark Cluster

5 Likes

Hi jayshawusa,

Thanks for your inputs and experience with the CCA175 exam. Could you please tell me

  • whether we need to be good at both scala and python language for working in RDD related problems.
  • do we get the output format for all the problems in the exam

Thanks in advance

That’s correct . Get “hc” using “sc” as you described.
Use “hc” to select records from hive table and assign it to var
var orderDF=hc.sql(“select * from orders”)

Or

You can use “describe formatted dbname.tablename” in hive shell
Get the corresponding hdfs path of the table
var ordersRDD = sc.textFile("/blahblah/dbname/tablename.xxxx")
(I did it this way in the exam)

Just Scala is good enough. I didn’t know single line of python.
I just used scala/spark RDD & DF/ Hive SQL

They just care about the output matching the expected results.

1 Like

Hey jayshawusa,

You have mentioned using === for filtering. May I know under what scenario do we use === for filtering?

scala is good enough… They clearly state the output format/location etc. no confusion

1 Like

For ex… if you have to read avro hdfs data… And if it contains multiple columns,. And if you are asked to filter out records based on one column … instead of converting DF into rdd and apply condition you can directly filter on DF using ===
This Is my understanding. Feel free to explore other ways. Thx

Hi… I haven’t understood aggregateByKey function and ranking much. Is it fine from exam perspective?

You can always solve the problem using spark sql using group by for ex… min/max/count etc…
(instead of RDD/DF functions like aggregateByKey , reduceByKey , groupBy )

I think reduceByKey is easier .
You can try learning aggregateByKey with 2 params… but as I mentioned above you can solve using SQL.

Ranking - good to know concept. if you get this question in exam there is no other work around other than answering through actual ranking function… but relative to other topics this one may be less occurring in exams…

If you are subscribed to Durga Udemy … it explains above concepts well.

1 Like

@jayshawusa This is so helpful. Thank you so much. I was so breaking my head on this.
I am referring to Itversity youtube videos, but the concept of aggregateByKey with 2 params seems to be bit tricky.
Anyways will start and focus on SparkSQL for this. Thanks again.

Hi jayshawusa,

Congratulations for the success and thank you for sharing tips!!

A quick question. Will it give different output vis-a-vis hc or textFile if we use sqlContext to fetch records from orders table?

Something like: sqlContext.sql(“SELECT * FROM orders”)?

Thanks!

whether you use hc or temptable…or RDD functions… as long as you treat the data types properly you should not have issues.

1 Like

@jayshawusa … Can you please what mysqlconnect string you have used to connect from sqoop import command for --connect … how to get the host name . also can you please share the mail id as well. I have few doubts

Hi @jayshawusa is it possible to store an rdd to Hive and how?

Thanks for the explanation. I would probably use SparkSQL for filtering out the data as I am more comfortable with it.