I have taken CCA 175 on 06-Oct-2019(8 correct out of 9)

I have taken CCA 175 on 06-Oct-2019(8 correct out of 9)

just want to share my personal experience with you, so it may useful who are planning take CC175.

Preparation : if you want to give exam, minimum 1 to 2 months preparation is required with seriousness

I wanted to give exam almost 10months back, but did not prepare seriously, but prepared Python well, that help me, while I started seriously 3months

Materials : Gone through lot of Youtube tutorials on HDFS, Python and Pyspark2…etc, but finally stick with following ITVersity materials.

CCA 175 Spark and Hadoop Developer using Python

Apache Spark 2 using Python 3

Practice :

I have registered “ITVersity Labs” and setup “Cloudera quickstart vm” in my laptop, practiced on both environments and understood the environments better.

Once I ready with preparation, booked the CCA 175 exam date then practiced few mock tests to understand time management

Exam Experience : 1 sqoop import, 1 sqoop export, remain 7 spark questions got in exam.

I had camera issues initially, took extra 15mins to fix.

completed all 9 questions and verified results once again and thought all answers were right.

Tried spark questions in pyspark2 with spark sql, one question had issues with spark sql and DF, so left that question and completed all remain at last I completed that question in RDD(thought to left this question as I completed remain 8), but results saying one was not correct out of remain 8. so we should finish all questions in any format Spark SQL or DF or RDD.

**Some important points **

  • Read question twice before start, then choose Spark sql or DF or RDD, which is good to finish
  • Don’t spend time on failed question, first finish remain all questions and come back to failed one. 10mins per question.
  • use Sublime text, open tab per question(9 tabs for 9 questions), this way you can review your answers before close exam.
  • Open multiple command prompts like for hive, sql, spark and hdfs
  • Check the target paths and sample data from target
  • Before close/submit exam, review questions and your answers once again

Thanks ITVersity and material and Labs

Prepare for certifications on our state of the art labs which have Hadoop, Spark, Kafka, Hive and other Big Data technologies

  • Click here for signing up for our state of the art 13 node Hadoop and Spark Cluster

1 Like

Congrats… Thanks for sharing the info

Hi @Rajeswara_Mukara ,
Congratulations .
From where did you take the mock tests?Also , Did you find the cloudera quick start VM different from itversity labs?

Hi Rumanshi,

i have tried Udemy mock tests, Arun blog and some questions from ITVersity discuss forum.

There is no much change b/w ITVersity labs and cloudera quick start VM, but you can handle avro files, some commands without extra parameters/commands easily in VM.


Hi Raj,

Congratulations… Do we have to submit the code anywhere as part of the exam? or execute the code on REPL and save the final results in the path given? please clarify.

And also I am practicing on IT versity labs. Do we really have to set-up Cloudera Quick Start VM to practice? Any advantage setting up separate env?

You told you prepared 3 months seriously… Can you share some details? like spent more time on these 3months practicing assignments etc or going through youtube or etc

I completed Udemy course “CCA 175 Scala & Hadoop developer certification course” by IT Versity. Currently working on Udemy mock tests for CCA-175 by IT versity. Once I complete these tests will go to Arun Blog.

How much variation in problems between Udemy mock tests and Arun blog questions?

BTW, how did you know that one question was wrong whole solving the problems in the exam?

Appreciate your time addressing above.


Hi Raj,

Udemy mock tests, Some problems its asking write the output in one text file or 4 text files or one json file. I am able to address using RDDs. Can you please help how can define number of output files with data frame while writing output?

[FYI: I did not see solution given as part of answers yet. After solving all the problems, will verify answers]

Jyothi - you can use coalesce function to define the number of splits in the output I believe. Will try this out and confirm

Yes. Bala … tried with repartition and its working fine.

But new probelm, since data getting shuffled so output is not ordered.

nullStockNamesDF.repartition(1).write.mode(“overwrite”).format(“text”).option(“header”, “false”).save("/user/<os_user_name>/spark_practice/problem4/data/no_stock_names_df")

solved sorting issue by
nullStockNamesDF.repartition(1).sort().write.mode(“overwrite”).format(“text”).option(“header”, “false”).save("/user/jyothi_v/spark_practice/problem4/data/no_stock_names_df")

And also tried with coalesce(1) to preserve sort order and working fine
nullStockNamesDF.coalesce(1).write.mode(“overwrite”).format(“text”).option(“header”, “false”).save("/user/jyothi_v/spark_practice/problem4/data/no_stock_names_df")

Hi Jyothi,

You can execute the code on REPL(better keep/backup the solution on sublime for review question and solutions at the end of exam).

on IT versity lab, you need extra commands to handle avro files, those not required Cloudera Quick Start VM & you can analyze configuration files for hadoop, spark…etc with VM.

i have DataStage background, so i have practiced all my functional scenarios on pyspark2 to get certification and to start work on my coming project, if its only for certification, just go through 2 IT versity materilas and mock tests… and go through IT versity blog, you will get lot of suggestions and scenarios.

just try and see diff b/w mock tests and Arun blog, Arun blog is like text book or all-in-one:)

while execute commands, if you see any errors or files not generated at target location, so your solutions is wrong during exam and after exam results will be published by question number wise, if you remember question and solution, you can evaluate.


Thanks Raj… After Udemy practice tests… i will go for Arun Blog…

Hi Rajesh, whoever taken the exam,

Some clarification. Do we get command line help for syntax/methods/available on class on exam console?
Example: Source has fromFile function. if I write scala.io.Source. gives available functions in IT versity lab environment. Do we get similar in exam console also?
Please clarify.


Hi Jyothi,

I dont have much exposure on scala, but we get help for spark/sqoop/hdfs commands…etc on REPL.


Hi Rajesh,

Have you make use of those help during the exam?


Hi Rajesh, [Others],

I solved below problem … Need help if anyone has idea other ways to solve this problem using aggregate functions with RDD etc so wanted to learn different ways of learning to be confident… Can you please share?

Get top 5 performing stocks by volume per day using NYSE End of Day trade data.

Data Description

NYSE End of stock data is available in HDFS under /public/nyse_all/nyse_data

stockticker should be of type string.

tradedate and volume should be of type long or bigint.

Rest of the fields should be of type float or double.

Cluster information:

  • Login to designated Gateway Node.

  • Username and Password are provided as part of the lab page.

  • If you are using your own cluster, make sure to login to gateway node.

Output Requirements

Place the imported data in the HDFS directory

Replace whoami with your OS user name

  • Make sure output is saved two files.

  • Data should be in Parquet and should compressed using snappy

  • Output should contain all the 7 fields from NYSE End Of Day Trade Data for top 5 by volume.

  • Data should be sorted by tradedate in ascending order and then volume in descending.

  • stockticker should be of type string.

  • tradedate and volume should be of type long or bigint.
    ============= Solution ===============

val nyseData = sc.wholeTextFiles("/public/nyse_all/nyse_data")

val nyseDaily = nyseData.flatMap(rec=>rec._2.split("\r?\n"))

val nyseDailyMap = nyseDaily.map(rec=>{

val st = rec.split(",")



val nyseDataDF = nyseDailyMap.toDF(“stockticker”,“tradedate”,“openprice”,“closeprice”,“highprice”,“lowprice”,“volume”)


val nyseDataSort = spark.sql(“select stockticker,tradedate,openprice,closeprice,highprice,lowprice,volume from (select stockticker,tradedate,openprice,closeprice,highprice,lowprice,volume, rank() over (partition by tradedate order by volume desc) rank_no from nyseDataTB where volume != 0) where rank_no<=5”)