Input data supplied in file /public/randomtextwriter (In hdfs) looks suspect

#1

With regards to the exercise in Udemy course for HDPCD Spark Certification, the final exercise relating to Spark SQL, the data in /public/randomtextwriter (In hdfs) seems suspect. Am I missing something -

Nov '17
Details - Duration 20 minutes
Data is available in HDFS /public/randomtextwriter
Get word count for the input data using space as delimiter (for each word, we need to get how many types it is repeated in the entire input data set)
Number of executors should be 10
Executor memory should be 3 GB
Executor cores should be 20 in total (2 per executor)
Number of output files should be 8
Avro dependency details: groupId -> com.databricks, artifactId -> spark-avro_2.10, version -> 2.0.1
Target Directory: /user/<YOUR_USER_ID>/solutions/solution05/wordcount
Target File Format: Avro
Target fields: word, count
Compression: N/A or default
Validation

My solution was as follows -

/* spark2-shell --master yarn --num-executors 10 --executor-cores 2 --executor-memory 3GB --packages com.databricks:spark-avro_2.11:4.0.0 */

val inputFileContentRDD = sc.textFile("/public/randomtextwriter")
val flatMapWordsRDD = inputFileContentRDD.flatMap(line => line.split(" ")).map(inp => (inp, 1))

//flatMapWordsRDD.take(10).foreach(println)

val wordCount = flatMapWordsRDD.reduceByKey((total, value) => total + value)
val wordCountDF = wordCount.toDF(“word”, “count”)

wordCountDF.coalesce(8).write.format(“com.databricks.spark.avro”).save("/user/vramakrishnan3/solutions/solutions05/wordcount")

To validate ------
spark.read.format(“com.databricks.spark.avro”).load("/user/vramakrishnan3/solutions/solutions05/wordcount").show(50)

scala> spark.read.format(“com.databricks.spark.avro”).load("/user/vramakrishnan3/solutions/solutions05/wordcount").show(50)
±-------------------±----+
| word|count|
±-------------------±----+
| ?>unexplicit| 2|
|*���Uu+��<…| 1|
| �
diminutively| 42|
|�������2�"���&�
…| 1|
|XWConfervales| 1|
|%XWConfervales| 1|
|����h$_� ��/��r…| 1|
|*���Uu+��<▒▒…| 1|
|X����g
�L…| 1|
|����0�pp��Ď�z�…| 2|

0 Likes