Passed CCA175 on August/19/2020 finished in 90 min

I pass my CCA175 exam on 2020/08/19, really thanks to Durga and this lab

[Exam Outcome] :
I really spend 30 min reflect capslock not working problem with monitor, but I still pass the exam in 7/9(cause I don’t have any time to check answer), what I want to say is not I am good,
I want to say is Durga class and practice example is really enough

[About the exam]:

  1. All question can be solved in spark sql , I mean in spark.sql("…") ,I use pure dataframe only in read and write, also give the schema to the data, I do all transform in spark.sql()

  2. I do not use any other pyspark parameter such as --executor-cores, I only use pyspark --master yarn and in my case it can solve all problem

3.The test environment can’t use ctrl+c& ctrl+v, but can open two terminal( on for hadoop fs -cat and other for spark), and there is an text editor(I forgot the name) you can use, i write all my code on it and paste it to terminal

4.I don’t think there is too hard transformation or too hard sql, the hardest one I think is only join table and group data and basic where. For my exam, not even subquery and window function needed(but better prepare for it in case you need it)

  1. How to read and write file in different format and compression is very very important, and also how to give the correct data schema

6.In my exam,there is two question related to hive table, one is save df to hive table in specific file format and compression, and other one is read table and process it

7.I got problem with lower and upper case during the exam(and I am sure it is not my problem, cause I can switch upper and lower case smoothly in psi chatbox which also use in chrome…)=> if you got
this problem you are not alone, contact exam monitor and he or she will reflect this problem to cloudera

  1. Remember to prepare One form of government issued photo ID (i.e. driver’s licence, passport),the detail one please look on psi website, and I strong suggest prepare a good camera in order to show clear ID phone for psi to authenticate you ID

[How I prepare and Prepare Step]

  1. I took the class on udemy, not sure whether it is on youtube, but his class is really good enough

  2. I use itversity lab and practice the Exercise01~06 under this forum topic certification=> spark exercise(http://discuss.itversity.com/c/certifications/spark-exercises)

  3. I write my pyspark note => I really go to spark document and try to understand the function I am using=> e.g you need to know what spark.read will give you => you need to know what spark.read.option() is doing => try to understand the basic class and function (if you understand it , you will memorize it)

  4. I do the 6 exercise again, not looking the answer, try to learn what I don’t know and save it to my note => try to improve your ability to read and save file from different format and compression

  5. Practice all example in https://github.com/dgadiraju/itversity-books/tree/master/CCA175%20Scala

Above is all my experience, Not necessarily 100% working on you
That’s all, good luck to whoever going to take the exam, thanks Durga again!


Prepare for certifications on our state of the art labs which have Hadoop, Spark, Kafka, Hive and other Big Data technologies

  • Click here for signing up for our state of the art 13 node Hadoop and Spark Cluster

3 Likes

Hi,
Congratulations for passing the exam. I purchased the training materials as well but not the labs yet. The training material is based on Spark1.6. What is the Spark version in the ITVersity lab? 2.4?

Thanks
Liangbin

I use pyspark2 in itversity lab and do all my exercise in it and it version is enough
I have check it today(8/30), it’s version is spark 2.3.0.2.6.5.0-292,

Hi,
Thanks for the reply. I eventually signed up for his lab and it is worth it. Now after I went through his training and read your post again, I got several more questions and I hope you can spare some time.

  1. When you write dataframe to csv file, do you include header?
  2. Are you able to write to the folder multiple times? I guess you can. If that is the case, how did you check them if they are compressed?
  3. You mentioned crl-c + crl-v is disabled, how did copy code from text editor to console? Select and right click on the console? Does the text editor has auto-completion function?
  4. You mentioned caps lock didn’t work. Did you try onscreen keyboard?
    Sorry I never did the exam so I have a lot of questions. Thanks in advance.

Liangbin

Congrats bro… could please help me on the steps to prepare for the CCA 175 certification? I have no idea from where to start…and where to find the study contents

The study content I use is on “udemy”, url: https://www.udemy.com/course/cca-175-spark-and-hadoop-developer-python-pyspark/, it is Durga sir class and it use itversity lab to teach(i also use this lab to practice).
If your budget it limited, try to find his tutorial on his youtube(Durga sir is really nice guy again)

After go through his class, all the prepare step is on my post section [How I prepare and Prepare Step]

  1. As far as I can remember, I don’t think the question has specifically ask me to save the file with header(cause majority of my question is to save as parquet file, which already include file format), but I really recommend you to know how to save csv with header(also read) in case you face the question in test

  2. Yes you can, you can use mode(“overwrite”), all the test environment is in your control, you can use write.mode(“overwrite”) or you just want to hadoop fs -rm the file directly ,the sad part is I don’t have any time to check the question cause I spend 30 min to reflect the env problem, as for how to check them if they are compressed => the most save and accurate way is read them in compress mode and show it, but I think there is also a tricky way, you can try to save it in different compress format and use hadoop fs to check the file name extension and see is there any different.

3.Yes like you said, Select and right click on the console to choose copy paste, No , it didn’t have auto-completion function, it is a plain text editor

4.No, I don’t have any time to find onscreen keyboard in the test env, I just waste a lot of time to tell my exam monitor and he say he will reflect this to cloudera

No need to say Sorry, feel free to ask any question about the exam

Thanks so much for your reply. I really appreciate it.
More questions:

  1. Just to clarify, you didn’t include header when export csv, right?
  2. To export dataframe to text file, df.write.csv() is good enough, right? Did you use the old saveAsTextFile() method?
  3. Last year some guys had the “UnicodeEncodeError" when exporting text file, did you have the same issue?
  4. For avro, did you start spark with: pyspark2 --master yarn
    –packages com.databricks:spark-avro_2.11:4.0.0
  5. For read and write to avro format, is it: spark.read.format(‘com.databricks.spark.avro’).load(’…’)
    df.write.format(‘com.databricks.spark.avro’).save(’…’)
  6. Did they ask you to read any file from local file system?
  7. Is the Spark 2.4 documentation provided to you during the exam?

Thanks again

  1. Sorry I really cannot remember weather I have include the header if the question didn’t
    specifically required me to do so.

  2. Really good question, from my experience,Yes, it’s good enough, I remember one question tell me to store the outcome as “textfile” format with separation “\t”, but I use df.write.csv() to save the outcome which means the extension name on hdfs file will be ‘csv’ not ‘text’, but the outcome for that question is Pass(Right), it really depend on cloudera official to determine, cause when you use df.write.csv(), the hdfs file extension name will be “csv” not “text”, but cloudera give me pass for that question

3.No, good luck for me, I didn’t face that question

4.Good question! No, I am sure I didn’t use –packages com.databricks:spark-avro_2.11:4.0.0, I just launch the "pyspark2 --master yarn " once and use it through all my exam
but I can still save the file as avro using df.write.format(‘avro’) (My exam has one question which ask me to save df as avro, but it didn’t give me the avro package name, so I
intuitively thought I don’t need to include it as package), and I pass that question eventually

5.Same as answer 4, No I did’t use any com.databricks.spark.avro, just format(‘avro’)

6.No, I didn’t face any question to ask me to read any file from local file system

7.Yes, but I cannot find Hive document, I have use the spark document during exam, cause there is a question about save file as hive table in parquet format, so I check the document for df.write.format(“parquet”).saveAsTable()…, but I recommend not to search for document too often cause it will waste you a lot of time

Thanks so much for the detailed reply. I feel much more confident now :slight_smile:

One more question about hive: I realized when we use df.write.saveAsTable(), the rows in hive table will not be sorted, even though the df is sorted already. Did your question ask you to sort the table? Did you specify anything in .option? So far the only way I figured out is create the table with schema and insert into it with order by in sql.

No, my question didn’t ask me to do that, how about sort the dataframe and df.coalesce(1).write.saveAsTable(),I am not sure weather it will work

It works. It is way easier than my approach. Thanks so much.

Friendly reminder, try to understand this function, cause it can work on other file format except hive table ,and you can find it in spark exercise(http://discuss.itversity.com/c/certifications/spark-exercises )

Sure. Thanks. It is easier than set the shuffle partitions.

Congrats man!!
How were you able to code without using capslock?

It can work sometimes, so when I need upper case , I basically copy it from previous one…

Hi Congratulations, I am planning take this exam in next week. Could you please clarify my questions below ?
1.when you save the data into parquet/orc did you add this option compression as uncompressed ? If suppose they not mention any about compression in that question then We should not compress it right ? Because if we wouldn’t use compression option, by default it uses snappy format for orc/parquet.
2. While saving data into text file did you include header into it ? Or you just save the data without header column?
3.when you were dealing with SQL join. Did you get any duplicate records ? If yes did you use distinct to resolve duplicates ? Can we use distinct if they not specify anything to handle duplicates in questions?
4. Do they mention no of records would expected in outcome results for all the questions ?
Please clarify my questions and your answers would be helpful for me . Thanks in advance.

Hi Liangbin_Chen , how do you save sorted data into mysql table ? spark.sql method will be useful when we save data into hive table right ?

  1. right, I will not compress it unless the question ask me to do so
  2. I really cannot remember weather I have include the header if the question didn’t
    specifically required me to do so.
  3. No, I didn’t face any duplicate records, I don’t know weather can we use distinct if they not specify anything, but as far as I remember, the join question in the exam is super simple so I didn’t think too much
  4. No , they didn’t mention no of records expected