Passed CCA 175 on 16th Jan 2021

Hello everyone. I am happy to share that I cleared the cca 175 exam with a score of 9/9 (100%).

Here some comments:

  • Did it in scala using spark-shell. I just ran spark-shell setting yarn as master, and nothing more. dont need to set --packages param to import avro dependencies. it is automatically included on spark-shell session, and you just need to use “import” on the code
  • I only used the terminal with 2 tab (spark-shell and hdfs). I wrote my code directly on the shell, without text editor, and solved all the questions without restarting the spark-shell session. (all this was my own experience, not saying it is the best way to do)
  • everything solved using spark sql
  • before the exam, just make sure that you are confortable reading and write data on all the different format and compressions. on my case, formats were text(using different delimiters), parquet, avro, and orc. compression was only snappy.
  • 2 questions using hive. first one just using a hive table as input source. second one, write data into a non-existent database/table.
  • when writing sorted data, make sure you coalesce it into 1 single partition.
  • spark built-in functions: concat and substring
  • 1 simple question using join
  • none of the input data of type text had header, so be ready to deal with “_c” columns
  • all questions have an example of the output. none of them had header neither.
  • 1 or 2 questions did not mentioned the format of input data. in this case, I used hdfs “tail” command to see what data looked like. both was text.
  • I finished evertyhing in 1h15min, and took 30 minutes to review it doing simple validations using spark code and hdfs commands.

hope my experience is gonna help you.

good luck

1 Like

Congratulation. And thanks for sharing your experience in details.
In the exam do they mention that csv data has header or not.Actually while practicing i have made habit of checking data before writing code. I am currently struggling to complete practice tests in 2 hrs.Can you please share your suggestion so that i can reduce time to solve.

Also please suggest me which type of text editor they provide.so that I can practice on that.intention is reuse some of the generic syntax used between different codes

1 Like

no mention about csv headers on my exam. I manually verified the existence of header or not, and didnt found header in any question.
for the output format, every question have an example of output data, so you can see if there is header or not. again, for me, none of the questions asked headers on the output.

the easiest way to verify if the file has header, is simply doing println(spark.read.textFile("").first)

about the text editor, I honestly dont remember the name, but I recommend you to use Sublime for study

Thanks a lot for the explanation and sharing the command:)
In exam do we need to store the program/solution we created too. or they just expect output of question in respective path/table shared in the question

dont need to store your code. only the results on specific directories

@bru when you say you used spark-shell, do you mean spark2-shell?

can someone please guide me on this.

When you say used spark-shell, do we intend to mean spark2-shell, because the spark.sql commands works only in spark2-shell and not in spark-shell, so wanted to know if I am missing something here

hi. i am not sure I understood your point. You just need to run spark-shell command on the console to start a spark shell with spark 2.4 version on it

thanks for the response

In the Itversity lab, when I start with spark-shell it connect me to the spark version 1.6.3. Here in this version it support sc (spark context) but not the spark.sql commands

When I start with spark2-shell --master yarn --conf spark.ui.port=0 --num-executors 10 --executor-memory 3G --executor-cores 2 it connects me to spark version 2.3.0.2.6.5.0-292. Here in this version it support spark.sql but not the sc (spark context)

  1. So the question that I was trying to ask is, if I login in using spark shell , then I cant use spark.sql commands, is this all specific to Itversity environment ? and in the exam we can access both sc and spark.sql commands using spark-shell is that what you are trying to say?

  2. Another question, if you could help it would be great, do you have a sample code on how to read sequence file using spark.sql or spark.read option?

dont worry, for the test environment, you simply need to run spark-shell --master yarn, and the session will be spark 2 with sql working well.
for seq files, I didnt see it on the exam, and for sample codes, better ask google than me!

1 Like

spark2-shell is just a shell script.
You can type this command in the gatewaynode
which spark2-shell
It will show you that its just a script file in /bin/ folder.
If you look inside the file you see that its just sets spark version 2 and then run the spark-shell command.
So when you type spark2-shell you are starting the spark-shell command with spark version set to 2.

I think there’s just spark-shell, spark-sql and pyspark clis that come with spark. There’s no specific spark2-shell for spark version 2.

itversity

Yes when you run spark-shell in the exam you’ll have a spark session object named spark and a spark context object named sc.

1 Like

During exam do they provide text editor ?

I haven’t taken the test. According to this post a text editor is provided.

thanks Dilip for the detailed explanation

@bru congrats sir. I want to ask u 1 question. How do you know the file was in text format by using hdfs “tail”?

How much time it took u to complete the course and do we need any programming lanuages required. i was into hadoop adminstartion.But i want to start learining cca 175 examination.

Hi I have few questions…

  1. how did u read data from hive table ?
    was hive metastore integrated with spark-sql ?
    or you had to launch hive shell itself ?

  2. did you tried pressing tab after function to get function definition in spark-shell ?

  3. can we open multiple tabs from terminal ?

  4. do we need to import databricks library for avro data or is it already integrated?

  5. was documentation provided ?