Input path error where is the issue?

apache-spark

#1

any suggestion 3rd line, shows error as input path doesnot exist, it read the file intially and now its showing as error ?

clouderaVM
Spark 1.6.0

/05/28 20:22:30 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Spark context available as sc (master = local[*], app id = local-23w344w).
18/05/28 20:22:41 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
SQL context available as sqlContext.

scala> val textFile = sc.textFile("/home/cloudera/Documents/ard.txt")
textFile: org.apache.spark.rdd.RDD[String] = /home/cloudera/Documents/ard.txt MapPartitionsRDD[1] at textFile at :27

scala> val counts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1))
counts: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map at :29

scala> counts.reduceByKey(+).saveAsTextFile("/home/cloudera/Documents/1.txt")
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://quickstart.cloudera:8020/home/cloudera/Documents/ard.txt
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)

@Sunil_Itversity could you help


#2

Hi @bgdata1141

Reduce takes two elements and produce a third after applying a function to the two parameters.

Instead of defining dummy variables and write a lambda, Scala is smart enough to figure out that what you trying achieve is applying a func (sum in this case) on any two parameters it receives and hence the syntax is

reduceByKey(_ + _)

Change your code to

counts.reduceByKey(_ + _).saveAsTextFile("/home/cloudera/ard")

In my case, you can see below screenshots for the output


#3