Cleared CCA 175 on 24 July

Hi All,
Today I have cleared CCA Spark and Hadoop Developer (CCA175) certification with 8/9 .
A very special and big thanks to Durga Gadiraju, @itversity and people of discussion forum to solve all the query. Itversity is one of the best place to learn big data technologies. As other people earlier said 2 sqoop, 7 spark, all questions are easy.
Some people always try to use dataframe and spark-sql for all problem but according to me that is not. Creating dataframe from text file requires good amount of coding.
Other than that we also create dataframe from textfile format using databricks spark-csv api including other options(I haven’t tried during exam).

Sign up for state of the art 13 node cluster https://labs.itversity.com if you do not have proper environment for practice. It have all the resources one need to prepare for certification.

Feel free to ask questions.

I hope this helps you but still you have to verify before using it I am not responsible if something goes wrong.

============== Hadoop Related ============
import org.apache.hadoop.io._ ==>{IntWritable,NullWritable,FloatWritable,Text,DoubleWritable}
import org.apache.hadoop.io.compress._ =>{SnappyCodec,GzipCodec}

import org.apache.hadoop.mapred._ =>{TextInputFormat,TextOutputFormat,SequenceFileInputFormat,SequenceFileOutputFormat,KeyValueTextInputFormat} OLD API
import org.apache.hadoop.mapreduce.lib.input._
import org.apache.hadoop.mapreduce.lib.output._

=========== Save As New API Hadoop FIle and Retrival ============
import org.apache.hadoop.conf.Configuration
val conf = new Configuration()
conf.set(“textinputformat.record.delimeter”,"\u0001") // required when reading newAPIHadoopFile in classOf[TextInputFormat]
newRDD.saveAsNewAPIHadoopFile("/user/vishvaspatel34/result/deptNewAPIHadoopFile",classOf[IntWritable],classOf[Text],classOf[TextOutputFormat[IntWritable,Text]])
val newAPIRDD = sc.newAPIHadoopFile("/user/vishvaspatel34/result/deptNewAPIHadoopFile",classOf[TextInputFormat],classOf[LongWritable],classOf[Text])
newAPIRDD.map(x=>x._2.toString).collect().foreach(println)

In above I have stored as IntWritable but I have to retrive as LongWritable

============= SaveAsObjectFIle =============

============ Scala Related ===============
import scala.math._

========== DataBricks ==================== com.databricks:spark-csv_2.10:1.5.0 , com.databricks:spark-avro_2.10:2.1.0
import com.databricks.spark.avro._
import com.databricks.spark.csv._

============== Spark Related =============
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.storage.StorageLevel =>{StorageLevel.DISK_ONLY, MEMORY_ONLY}
import sqlContext.implicits._
import org.apache.spark.sql.SaveMode._ =>{Append,ErrorIfExists,Ignore,Overwrite,musq}

------------ Sql Context ------------------
sqlContext.setConf(“spark.sql.shuffle.partitions”,“2”)
sqlContext.setCont(“spark.sql.parquet.compression.codec”,“snappy”) =>{snappy,gzip,lzo}
sqlContext.setCont(“spark.sql.avro.compression.codec”,“snappy”) =>{snappy,gzip,lzo}

============= Java Related ================
import java.util._ ==>{Properties}
import java.util.Properties
val prop = new Properties()
prop.setProperty(“user”,“root”)
prop.setProperty(“password”,“password”)
prop.setProperty(“url”,“jdbc:mysql://localhost:3306/retail_db”)

2 Likes

Congratulations vishvaspatel. Were you able to solve all spark questions using databricks API?

Hi @BVSKanth
As I have said I haven’t tried in exam. But in normal scenario you can import packages using this command:
spark-shell --packages com.databricks:spark-csv_2.10:1.5.0
And it will give you the dataframe with column name C0, C1… and we can use sqlContext.sql() on top of that.

2 Likes

Congrats . Do you write code in spark-shell or do you have to create jars and submit to cluster

@Pratha I have used 3 terminals each one for spark-shell, mysql, hadoop command. Apart from that I have used sublime texteditor but did’t used that much. I would recommend that use another screen If you have smaller screen laptop.

Thanks Vishvas . So you don’t have to create jar and submit the jobs in cluster .
Running commands from spark-shell is fine?

Yes running commands in spark-shell is fine.

1 Like

hi vishwas,
i am not much good in scala is dataframe and sql is fine to solve problems , do we need any codecs and is there any questions on flume and what about joins is it easy to do or just joining is enough or do we meed further more option as my exam is in 2 days am in little dielammaa can you clear me.

You should know spark dataframe and spark-sql as you will read and write data in different file format, hive data with different compression codec. Bcz sparkContext doesn’t support parquet, avro, json, orc, read table, and many more. In post I am uploading some content please check. No questions on flume and spark-streaming. Sorry but I will not disclose any questions.

yeah thank you vishvas this should be fine if its possible juzt clear about joins or take it easy and thank you for your fast response.

Join questions was easy. If you think it is coplicated leave for later don’t waste your time. To pass the exam you should have 7/9 if u don’t able to do join question still it is fine.

Hi Visvas , Why have you mentioned lot of header files . I thought most of them were including in the spark shell by default .

Hi,All library files included in spark-shell (except com.databricks.spark._) but you have to import before using that lib jar.
Some times it is handy to import directly on spark-shell instead using in to the command.

2 Likes

Hi , How did you verify your answers in CCA . Because if i make a mistake which does not throw error i will get a invalid data which i would be transforming further .
For sample if i join 2 DF , without column name mentioned . i would get invalid data .

Always check for input if it is textfile, and also check for output even those are in diff file format. by
hadoop fs -ls <file_path> | head
hadoop fs -ls <file_path> | tail
If you find file extention format with compression that actually the same as required that it is fine. But if it is not than as above mentioned see the some content.
For joinning two dataframe if you will not specify the joinning condition than it will give you compile time error. So there is not chance of getting invalid data.

Suppose you are making dataframe directly without specifying class than you can do this:
val df = sc.textFile(“path”).map(_.split(",")).map(x=>(x(0).toInt,x(1))).toDF()
The above command make DF and it will shows as org.apache.spark.sql.DataFrame[_1:Int,_2:String]
Here what can you do is
Val newDF = df.select(col("_1").alias(“Column1”), col("_2").alias(“Column2”))
and than join the two dataframe.

1 Like

@vishvaspatel34 -is it possible for you to share your email id?

vishvaspatel can you tell me your emai id or else ping me sagar.bunny2@gmail.com i have some doubts can you clarify it

Hi, What version of cloudera VM do we get when we appear for new syllabus exam >

Sorry - I found answer for the same from cloudera website

Each user is given their own CDH5 (currently 5.10.0) cluster pre-loaded with Spark 1.6, Impala, Crunch, Hive, Pig, Sqoop, Kafka, Flume, Kite, Hue, Oozie, DataFu, and many others (See a full list). In addition the cluster also comes with Python (2.6, 2.7, and 3.4), Perl 5.10, Elephant Bird, Cascading 2.6, Brickhouse, Hive Swarm, Scala 2.11, Scalding, IDEA, Sublime, Eclipse, and NetBeans.

hy viswas can you please share ur email id … my id is wijay10789@gmail.com