First practice in python failed, can anyone help please? Thank you very much


Problem Scenario 70 : Write down a Spark Application using Python,
In which it read a file “Content.txt” (On hdfs) with following content.
Do the word count and save the results in a directory called “problem85” (On hdfs)
Hello this is
This is
Apache Spark Training
This is Spark Learning Session
Spark is faster than MapReduce

Solution :

source file:
[paslechoix@gw03 ~]$ hdfs dfs -ls p84
Found 1 items
-rw-r–r-- 3 paslechoix hdfs 143 2018-02-14 07:04 p84/Content.txt

Step 1 : Create an application with following code and store it in

Import SparkContext and SparkConf

from pyspark import SparkContext, SparkConf

Create configuration object and set App name

conf = SparkConf().setAppName(“CCA 175 Problem 84”)
sc = SparkContext(conf=conf)
#load data from hdfs
contentRDD = sc.textFile(“p84/Content.txt”)
#filter out non-empty lines
nonempty_lines = contentRDD.filter(lambda x: len(x) > O)
#Split line based on space
words = nonempty_lines.flatMap(lambda x: x.split(""))
#Do the word count
wordcounts = x: (x, 1)) \
.reduceByKey(lambda x, y: x+y) \
.map(lambda x: (x[l],x[0])).sortByKey(Flase)