Hello everyone. I am happy to share that I cleared the cca 175 exam with a score of 9/9 (100%).
Here some comments:
- Did it in scala using spark-shell. I just ran spark-shell setting yarn as master, and nothing more. dont need to set --packages param to import avro dependencies. it is automatically included on spark-shell session, and you just need to use “import” on the code
- I only used the terminal with 2 tab (spark-shell and hdfs). I wrote my code directly on the shell, without text editor, and solved all the questions without restarting the spark-shell session. (all this was my own experience, not saying it is the best way to do)
- everything solved using spark sql
- before the exam, just make sure that you are confortable reading and write data on all the different format and compressions. on my case, formats were text(using different delimiters), parquet, avro, and orc. compression was only snappy.
- 2 questions using hive. first one just using a hive table as input source. second one, write data into a non-existent database/table.
- when writing sorted data, make sure you coalesce it into 1 single partition.
- spark built-in functions: concat and substring
- 1 simple question using join
- none of the input data of type text had header, so be ready to deal with “_c” columns
- all questions have an example of the output. none of them had header neither.
- 1 or 2 questions did not mentioned the format of input data. in this case, I used hdfs “tail” command to see what data looked like. both was text.
- I finished evertyhing in 1h15min, and took 30 minutes to review it doing simple validations using spark code and hdfs commands.
hope my experience is gonna help you.