Cleared CCA 175 on 14th Jan 2018 - Passed all the questions

bigdatalabs
cca175
Cleared CCA 175 on 14th Jan 2018 - Passed all the questions
0.0 0

#21

Congratulation to harshilchhajed12.

I am wondering if you can please share what type of questions in your exam? like how many for sqoop, spark, kafka, flume, etc

Especially for Flume, someone from Cloudera said Flume indeed is obsolete now and in my work it seems to be true because we don’t use Flume at all.

Thanks.


#22

@harshilchhajed12

When you invoked spark-shell are you able to handle avro files directly? Or you had to import any databricks package for dealing with avro?


#23

@rajeshkancharla:

avro packages are integrated in Cloudera VM in general. If you feel to import, you can do it again. No harm.
import com.databricks.spark.avro._

Or if they give avro packages as part of the question, then you can start spark-shell using that package like below:
spark-shell --master yarn --conf spark.ui.port=12345
–num-executors 10
–executor-cores 2
–executor-memory 3G
–packages com.databricks:spark-avro_2.10:2.0.1

Either way is same.
Hope this helps.
Thanks
Venkat


#24

harshilchhajed12 congrats!!!

Can I get your personal email so we exchange some knowledge regarding one of the questions that I had during the exam.
Here is my email so you can ping me with thanks: hosni.akrmi@gmail.com


#25

Is course from Udemy needs to be followed or I can go ahead with the you tube videos “Hadoop and spark developer as per revised syllabus”? I have access to both. please suggest


#26

Using spark-shell:

spark-shell --master yarn --conf spark.ui.port=15674 --packages com.databricks:spark-avro_2.10:2.0.1

if all the packages are loaded without any error you can use below commands:

import com.databricks.spark.avro._
val ordersDF=sqlContext.read.avro("/user///orders/part-00000.avro")
or
val ordersDf=sqlContext.read.format(“com.databricks.spark.avro”).load("/user///orders/part-00000.avro")

After executing either of those commands you will have a ordersDF to type DataFrame
Used HDFS location of the file.

You can refer to this link: https://github.com/databricks/spark-avro


#27

During certification Should we use spark-shell directly or should we set driver-memory, executor-memory and master as yarn and then num-executors and executor-cores as well.


#28

@harshilchhajed12, Congratulations. I am going to take the exam soon, can you explain what kind of ambiguous you are facing in the exam, and what kind of information missed during you exam, how can we found the correct information?

also when starting spark-shell. does bellow command is enough?

 spark-shell --master yarn  (-conf spark.ui.port=12345 or anything else is needed?)

and do we need to set the “spark.sql.shuffle.partitions” to make the cluster run more better?

Appreciated!!

xuyoumi@gmail.com

Tahnks


#29

@harshilchhajed12 congrats!!!

Can I get your personal email so we exchange some knowledge regarding one of the questions that I had during the exam.

Here is my email so you can ping me with thanks: hosni.akrmi@gmail.com


#30

Congratulations @harshilchhajed12, glad to see an aspirant who has got all questions right.

I understand that you have used spark-shell.
For aspirants like me who use pyspark, do you know how avro format compression is achieved (outside of sqoop import/export)?


#31

How can you do the same thing in Python?


#32

Well, I found the answer to my own question:
In Pyspark,
I tried writing as avro file without using any compression and I found this in my logs:

18/03/20 10:26:16 INFO AvroRelation: using snappy for Avro output

Using ‘uncompressed’, found this:

18/03/20 10:28:33 INFO AvroRelation: writing Avro out uncompressed

So, I infer that snappy is the default compression used when writing as avro files. You can check your logs and you shall see what’s happening for you too.

NOTE: the uncompressed avro files were quite larger than the default avro files (snappy compressed).


#33

Could you write the code you used to get that log?


#34

Hi,

It’s the usual way of saving a Dataframe as avro file,

Without explicitly specifying compression:

result_DF.coalesce(1).write.format('com.databricks.spark.avro'). \
save('/user/snehaananthan/problem2/solution')

The above line led to the log:

18/03/20 10:26:16 INFO AvroRelation: using snappy for Avro output

And,

sqlContext.setConf('spark.sql.avro.compression.codec', 'uncompressed')
result_DF.coalesce(1).write.format('com.databricks.spark.avro'). \
save('/user/snehaananthan/problem2/solution-uncompressed')

led to:

18/03/20 10:28:33 INFO AvroRelation: writing Avro out uncompressed

If you are using itversity labs, you would be able to find this line in the logs below, just after you run the line of code.


#35

@sneha_ananthan thank you very much.

I have one more question.

Is it your second script:

sqlContext.setConf(‘spark.sql.avro.compression.codec’, ‘uncompressed’)
result_DF.coalesce(1).write.format(‘com.databricks.spark.avro’).
save(’/user/snehaananthan/problem2/solution-uncompressed’)

equal to:
result_DF.write.format(‘com.databricks.spark.avro’).save("/loudacre/orderItem/order_Item_AVRO_SNAPPY", compressionCodecClass=“org.apache.hadoop.io.compress.SnappyCodec”)

If not, could you explain why?


#36

Hi,

I’m afraid the DataFrameWriter’s save method does not allow a ‘compressionCodecClass’ argument. Hence,

sqlContext.setConf(‘spark.sql.avro.compression.codec’, ‘uncompressed’)
result_DF.coalesce(1).write.format(‘com.databricks.spark.avro’). 
save(’/user/snehaananthan/problem2/solution-uncompressed’)

this should compress the avro data for us.