Passed CCA 175 with 8/9 score.
The VM is a bit slow. So take a second to make sure your logic and syntax is correct before performing any action on a dataframe. Practice reading and saving to different file formats in different compressions. There was a questions where I read from 2 files, joined them and saved the result back to hdfs. For any complex transformations I used temp view and spark sql.
There was a problem while using the parquet-tools and avro-tools to verify the final files. I tried passing the hdfs path with and without the namenode uri and it didn’t work. Finally I copied one of the files I needed to verify to local filesystem and used the parquet-tools and avro-tools with local file system path and it worked. I’m sure there was some mistake in the way I was passing the hdfs file path but anyway I found a way around that issue.
For saving to text file format I used concat_ws function to concat all the columns into a single column. This works even if the columns are of numeric data types.
result = df.select(concat_ws("\t", df.col1, df.col2))
I practiced a lot. Before a I started answering any question, I made sure I understood all the key requirements for the answer. I made a mental note of all the steps I needed for answering the question. Just be organized in your approach.
Thanks Mr. Durga and Itversity team for the lab environment and the video lessons.
Prepare for certifications on our state of the art labs which have Hadoop, Spark, Kafka, Hive and other Big Data technologies
- Click here for signing up for our state of the art 13 node Hadoop and Spark Cluster