Passed CCA 175 - 11 Feb 2021

Passed CCA 175 with 8/9 score.

The VM is a bit slow. So take a second to make sure your logic and syntax is correct before performing any action on a dataframe. Practice reading and saving to different file formats in different compressions. There was a questions where I read from 2 files, joined them and saved the result back to hdfs. For any complex transformations I used temp view and spark sql.

There was a problem while using the parquet-tools and avro-tools to verify the final files. I tried passing the hdfs path with and without the namenode uri and it didn’t work. Finally I copied one of the files I needed to verify to local filesystem and used the parquet-tools and avro-tools with local file system path and it worked. I’m sure there was some mistake in the way I was passing the hdfs file path but anyway I found a way around that issue.

For saving to text file format I used concat_ws function to concat all the columns into a single column. This works even if the columns are of numeric data types.
result = df.select(concat_ws("\t", df.col1, df.col2))

I practiced a lot. Before a I started answering any question, I made sure I understood all the key requirements for the answer. I made a mental note of all the steps I needed for answering the question. Just be organized in your approach.

Thanks Mr. Durga and Itversity team for the lab environment and the video lessons.


Prepare for certifications on our state of the art labs which have Hadoop, Spark, Kafka, Hive and other Big Data technologies

  • Click here for signing up for our state of the art 13 node Hadoop and Spark Cluster

Congrats sir, I want to ask you a question. What do you mean by this statement “There was a problem while using the parquet-tools and avro-tools”? We can’t read/save avro file without pyspark --packages org.apache.spark:blabla?

Thanks sir

congratulation @ddileepkumar
I have few queries.generally i am comfortable writing entire command in single line in notepad.but same is not possible in lab.just checking if you used '\ after end of every line or the editor allows it.
Also,do we have to call any package for avro or it is already loaded.

For the exam you can use pyspark command spark.read.parquet() or spark.read.avro to read the files. Parquet-tools and avro-tools are external libraries that are used to read the contents of these files for verification purposes. Also for the exam you don’t need to download external packages by calling pyspark with --packages argument. Avro and Parquet are supported in the exam spark environment.

I wrote all of my command in a single line. I didn’t use \. I’m certain that you can type \ and break the command into multiple lines in the exam terminal. Avro jars are already integrated into the exam spark environment, so you don’t need to download them. We don’t need to download any packages.

Hi @ddileepkumar,
Could you please share the below details if possible?
Code snippet to read data and write the data in Avro format and how to validate using avro-tool.
Also for Parquet format too.
It will be very helpful for us.

Thanks,
Sam

thanks @ddileepkumar. Which text editor is provided in the exam .so that i can practice in that.
also do we need to read different data in all questions. or 1 dataset is used in several questions. Reason of asking is ,creating dataframe by reading data also consumes lot of time and i am worried that i may not be able to solve all question if this is the case

On the exam to read avro and parquet files.
df = spark.read.format(“avro”).load(“pathtofile”).
df = spark.read.parquet(“pathtofile”)

to read parquet files with parquet-tools
parquet-tools cat --json hdfs://user/cert/q1/part-r-00000-6a3ccfae-5eb9-4a88-8ce8-b11b2644d5de.gz.parquet

to read avro files with avro-tools
avro-tools tojson hdfs://user/cert/q1/part-r-00000-6a3ccfae-5eb9-4a88-8ce8-b11b2644d5de.avro

I’m sorry I didn’t notice which text editor was available. I just used the terminal. I think someone has mentioned the text editor in one of the other cca-175 success posts.

Each questions has its own separate dataset. Each question has its own folder in hdfs with the input data stored in it. One needs to be very comfortable with reading files of different formats and compressions and delimiters. You can practice in the lab environment.

In the exam the VM might be a bit slow compared to the lab so there would be some delay if you perform an action.

Hi,
Thanks a lot for reply. I am using ITversity lab for practice.
While writing Avro file on labs, I am getting below error.
step 1 ) Connecting with below command
pyspark2 --master yarn --conf spark.ui.port=34566 --packages com.databricks:spark-avro_2.11:4.0.0
step -2 ) Trying to save data into Avro format but getting below error.

pyspark.sql.utils.AnalysisException: u’Failed to find data source: avro

inn="/public/retail_db/orders/part-00000"
out="/user/samadhan/Solutions/problem1"
df=spark.read.format(“csv”).option(‘sep’,’,’).option(“header”,“false”).load(inn)
df.write.mode(“overwrite”).option(“compression”,“snappy”).format(“avro”).save(out)
Could you please share code snippet which is working for you. or share details how to resolve this error.

Also unable to execute parquet-tools command in labs to verify file content. share some details

Thanks,
Sam

Congratulations,
I have a question , when you use contact_ws , that does not accept to save the file with the header named like this ,
if i change the header, spark does not accept to write the dataframe with a header that contains “\t”.
do you have a solution to this issue ?
Thanks

In the lab we are using spark2.3 and not spark 2.4.
In the lab using spark 2.3, start pyspark by passing the packages argument.
pyspark2 --master yarn --conf spark.ui.port=34566 --packages com.databricks:spark-avro_2.11:4.0.0

To read avro file
df=spark.read.format(“com.databricks.spark.avro”).load(“pathtofile”)

To write avro file
df=spark.write.format(“com.databricks.spark.avro”).save(“pathtofile”)

I don’t think parquet-tools is installed in labs.

According to pyspark documentation there is no option to save a header if you are using spark.write.text() function.

Its kind of vague about whether csv file format is accepted as a text file format. Some people who have passed the exam have used just csv when asked to save in text file format and they’ve mentioned that they passed, so the exam evaluators are okay with that. But that’s just my understanding.

This was my approach going into the exam.
If the exam asks to save in text file format with any delimiter mentioned then I’ll use spark.write.csv() function and save as csv file.
If the exam just asks to save in text file format without any mention about delimiter then I’ll concat the columns with concat_ws function and save using spark.write.text() function.

1 Like

Hi,
Can you please share the code snippet of this using spark 2.3.
I tried but it is not working. It will great help for us.

Thanks,
Sam

@ddileepkumar,
How did you write the code? Using Unix Terminal or Sublime Text Editor?
Could you please advise.

I used the terminal.

Here is a screenshot for reading and writing avro files in the lab.

Could it be the apostrophe? I’ve only used quotes before.

Thanks @ ddileepkumar. Please help for below queries.

1)In exam do they askto create databse tables by create statement and load the data?
2)are the tables in questions are similar to what we use in itversity eg orders,customers,products etc or different.if so than probably we may have to spend more time in understanding structure and joing condition
3)do they provide schema(field and datatype both) for all the source data.e.g i got below practice test from somewhere.Now i dont know joining column.So i need to analyze data to derive joining column and data what they expect in the output

Find top 10 products which has made highest revenue.Products path in HDFS is and order_items path in HDFS is

In the exam there was a question where the database did not exist and I had to create the database before using saveAsTable command. You don’t need to create the table using create statement, you just need to create the database and then use saveAsTable.

Yes tables are similar.

Yes the question provides schema of the input data.

HI Dileep,
congratulations on getting certified. I wanted to know a few more things about your prep and syllabus.
1)did you solve model questions more or read through the concepts more?
2)could you please link me to the course that has helped you with the latest concepts relevant to the syllabus of the exam.
3)Are spark 1.6 or RDDs necessary for the exam?
4) to what extent do we use scala/python to solve the questions? so far i have only come across spark SQL and DF APIsonly.