Cleared CCA 175 on 24 July

#9

You should know spark dataframe and spark-sql as you will read and write data in different file format, hive data with different compression codec. Bcz sparkContext doesn’t support parquet, avro, json, orc, read table, and many more. In post I am uploading some content please check. No questions on flume and spark-streaming. Sorry but I will not disclose any questions.

0 Likes

#10

yeah thank you vishvas this should be fine if its possible juzt clear about joins or take it easy and thank you for your fast response.

0 Likes

#11

Join questions was easy. If you think it is coplicated leave for later don’t waste your time. To pass the exam you should have 7/9 if u don’t able to do join question still it is fine.

0 Likes

#12

Hi Visvas , Why have you mentioned lot of header files . I thought most of them were including in the spark shell by default .

0 Likes

#13

Hi,All library files included in spark-shell (except com.databricks.spark._) but you have to import before using that lib jar.
Some times it is handy to import directly on spark-shell instead using in to the command.

2 Likes

#14

Hi , How did you verify your answers in CCA . Because if i make a mistake which does not throw error i will get a invalid data which i would be transforming further .
For sample if i join 2 DF , without column name mentioned . i would get invalid data .

0 Likes

#15

Always check for input if it is textfile, and also check for output even those are in diff file format. by
hadoop fs -ls <file_path> | head
hadoop fs -ls <file_path> | tail
If you find file extention format with compression that actually the same as required that it is fine. But if it is not than as above mentioned see the some content.
For joinning two dataframe if you will not specify the joinning condition than it will give you compile time error. So there is not chance of getting invalid data.

Suppose you are making dataframe directly without specifying class than you can do this:
val df = sc.textFile(“path”).map(_.split(",")).map(x=>(x(0).toInt,x(1))).toDF()
The above command make DF and it will shows as org.apache.spark.sql.DataFrame[_1:Int,_2:String]
Here what can you do is
Val newDF = df.select(col("_1").alias(“Column1”), col("_2").alias(“Column2”))
and than join the two dataframe.

1 Like

#16

@vishvaspatel34 -is it possible for you to share your email id?

0 Likes

#17

vishvaspatel can you tell me your emai id or else ping me sagar.bunny2@gmail.com i have some doubts can you clarify it

0 Likes

#18

Hi, What version of cloudera VM do we get when we appear for new syllabus exam >

0 Likes

#19

Sorry - I found answer for the same from cloudera website

Each user is given their own CDH5 (currently 5.10.0) cluster pre-loaded with Spark 1.6, Impala, Crunch, Hive, Pig, Sqoop, Kafka, Flume, Kite, Hue, Oozie, DataFu, and many others (See a full list). In addition the cluster also comes with Python (2.6, 2.7, and 3.4), Perl 5.10, Elephant Bird, Cascading 2.6, Brickhouse, Hive Swarm, Scala 2.11, Scalding, IDEA, Sublime, Eclipse, and NetBeans.

0 Likes

#20

hy viswas can you please share ur email id … my id is wijay10789@gmail.com

0 Likes

#21

Sorry guys for late reply. My email id: vpatel19@binghamton.edu
You can shoot me an email on this id.

Can anyone help me on this problem, require urgently solve it.
Hi I want to covert given table structure data into column formatted data
Input File:

Key,V1,V2
A,1,2
B,2,3

OuputFile:

Key,Type,Value
A,V1,1
A,V2,2
B,V1,2
B,V2,3

How can I convert row formatted data in column formatted data using rdd and dataframe.
Also viceversa from given column formatted data (Output File) into row formatted data (Input File)
Guys please require answer urgently.

0 Likes

#22

Answers for the above problem.

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext

/**

  • Created by Vishvas on 8/4/17.
    */
    object result {
    def main(args: Array[String]): Unit = {
    val map = scala.collection.immutable.Map((“master”,“yarn”),(“deploy-mode”,“cluster”),(“driver-cores”,“4”),(“num-executors”,“16”),(“executor-memory”,“8G”),(“driver-memory”,“8G”))
    val conf = new SparkConf().setAppName(“rowToColumnFormat”).setAll(map)
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)
/**
  * Read the original text file,
  * seperated header file and data,
  * Input:
  * Key,V1,V2
  * A,1,2
  * B,2,3
  *
  * Output:
  * Key,Type,Value
  * A,V1,1
  * A,V2,2
  * B,V1,2
  * B,V2,3
  *
  */
val rdd = sc.textFile("/user/vishvaspatel34/original.txt") //Reading the Textfile
val header = rdd.first() //header file
val data = rdd.mapPartitionsWithIndex((index,itr)=>if(index==0) itr.drop(1) else itr) //Get only data, no header
val l = header.split(",").toList //converted header into list
val bl = sc.broadcast(l.slice(1,l.size)) //broadcasted the after key parts
val preResultRDD = data.map(_.split(",")).flatMap(x=>(bl.value zip x.slice(1,x.size)).map(y=>(x(0),y))).map(x=>x._1+","+x._2.productIterator.mkString(","))
val schemaRDD = sc.parallelize(List("Key,Type,Value"))
val resultRDD = schemaRDD.union(preResultRDD)
resultRDD.collect().foreach(println)


/**
  * Read the Column oriented text file using
  * databricks spark api and converte into df
  * seperated header file and data,
  *
  * Input:
  * Key,Type,Value
  * A,V1,1
  * A,V2,2
  * B,V1,2
  * B,V2,3
  *
  * Output:
  * Key,V1,V2
  * A,1,2
  * B,2,3
  *
  * Pass your input file
  *
  */
import sqlContext.implicits._
sqlContext.setConf("spark.sql.shuffle.partitions","2")
val df = sqlContext.read.format("csv").option("inferSchema","true").option("header","true").load("/user/vishvaspatel34/columnFormatted.txt")
df.groupBy(df.schema.toList(0).name).pivot(df.schema.toList(1).name).sum(df.schema.toList(2).name).> show()

}
}

1 Like

#23

I have lot of doubts in cca175 exam environment like where We have to save our code and how to run the code what we have written. is it possible to write code in scala only for Python questions also

0 Likes

#24

You don’t have to save your code anywhere. They just care out O/P. As long as O/P is correct good to go. Questions are not spcific or whether they ask you to do by this method or neither they will give you a template. You can use any language in which you are comfortable.

0 Likes

#25

Thanks for the material Vishvas,

Quick question on the spark-avro version. com.databricks:spark-avro_2.10:1.5.0 -> Is this version Correct ?

I was using this 1 in the lab com.databricks:spark-avro_2.10:2.0.1

0 Likes

#26

Thank u. Yes you are right. I have made mistake there. In the exam u don’t have mention --packages. As avro library is already included in cloudera CDH just you have to import.

0 Likes

#27

HI all.

i m practising using the cloudera vm 5.13.0 and i am planning to take the test soon. I know that avro package is already installed but what about the csv package.

do we need to import pyspark --packages com.databricks:spark-csv_2.10:1.5.0 ??? or will it be already installed.

Please help.

0 Likes

#28

same question as sabeerph asked.

Can someone answer tthis?

I tried to import avro

import com.databricks.spark.avro_2.10;
error: no module

is this the way to import? or we have to use --packages <> while launching the pyspark2 first and then use this import?
I appreciate if someone can post the exact command for both avro and csv

Thanks

0 Likes