org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/public/retail_db/order_items/part-00000


#1

Can someone please help me in finding root cause and fixing the below issue.
i am getting following error while running using sbt “run-main DailyRevenue”. when i run inside main function code in spark shell, it is running successfully and able to see the results. not sure why am i getting inputpath does not exist.

18/03/05 22:49:32 INFO SparkContext: Created broadcast 1 from textFile at DailyRevenue.scala:10
[error] (run-main-0) org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/public/retail_db/order_items/part-00000
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/public/retail_db/order_items/part-00000
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.Partitioner$$anonfun$2.apply(Partitioner.scala:58)
at org.apache.spark.Partitioner$$anonfun$2.apply(Partitioner.scala:58)
at scala.math.Ordering$$anon$5.compare(Ordering.scala:122)
at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
at java.util.TimSort.sort(TimSort.java:220)
at java.util.Arrays.sort(Arrays.java:1438)
at scala.collection.SeqLike$class.sorted(SeqLike.scala:615)
at scala.collection.AbstractSeq.sorted(Seq.scala:40)
at scala.collection.SeqLike$class.sortBy(SeqLike.scala:594)
at scala.collection.AbstractSeq.sortBy(Seq.scala:40)
at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:58)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$join$2.apply(PairRDDFunctions.scala:651)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$join$2.apply(PairRDDFunctions.scala:651)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.PairRDDFunctions.join(PairRDDFunctions.scala:650)
at DailyRevenue$.main(DailyRevenue.scala:32)
at DailyRevenue.main(DailyRevenue.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
[trace] Stack trace suppressed: run last compile:runMain for the full output.
18/03/05 22:49:33 ERROR Utils: uncaught error in thread SparkListenerBus, stopping SparkContext
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at java.util.concurrent.Semaphore.acquire(Semaphore.java:312)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:73)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:72)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:72)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:71)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1181)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:70)
18/03/05 22:49:33 ERROR ContextCleaner: Error in cleaning thread
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:176)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1181)
at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:173)
at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:68)
18/03/05 22:49:33 INFO SparkUI: Stopped Spark web UI at http://172.16.1.109:4044
java.lang.RuntimeException: Nonzero exit code: 1
at scala.sys.package$.error(package.scala:27)
[trace] Stack trace suppressed: run last compile:runMain for the full output.
[error] (compile:runMain) Nonzero exit code: 1
[error] Total time: 5 s, completed Mar 5, 2018 10:49:33 PM


#2

below is the code which is giving me the error mentioned earlier.

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

object DailyRevenue {
def main(args: Array[String]) = {
val conf = new SparkConf().setMaster(“local”).setAppName(“Daily Revenue”)
val sc = new SparkContext(conf)

val orders = sc.textFile("/public/retail_db/orders/part-00000")
val orderItems = sc.textFile("/public/retail_db/order_items/part-00000")

val ordersCompletedCount = sc.accumulator(0, "Orders Completed Count")
val ordersNonCompletedCount = sc.accumulator(0, "Orders not Completed Count")
val ordersFiltered = orders.
  filter(order => {
    val isCompletedOrder = order.split(",")(3)== "COMPLETE" ||  order.split(",")(3)=="CLOSED"
      if(isCompletedOrder)
      ordersCompletedCount += 1
     else
      ordersNonCompletedCount += 1
    isCompletedOrder
  })

val  ordersMap = ordersFiltered.map(order => {
  (order.split(",")(0).toInt, order.split(",")(1))
})

val orderItemsMap = orderItems.map(orderItem => {
        (orderItem.split(",")(1).toInt, (orderItem.split(",")(2).toInt,orderItem.split(",")(4).toFloat))
})

val ordersJoin = ordersMap.join(orderItemsMap)
val ordersJoinMap = ordersJoin.map(rec => ((rec._2._1,rec._2._2._1),rec._2._2._2))
val dailyRevenuePerProductId = ordersJoinMap.reduceByKey((total, revenue)=> total + revenue)

import scala.io.Source
val productsRaw = Source.fromFile("/data/retail_db/products/part-00000").getLines.toList
val products = sc.parallelize(productsRaw)
val productsMap = products.map(product => (product.split(",")(0).toInt, product.split(",")(2)))

val dailyRevenuePerProductIdMap = dailyRevenuePerProductId.map(rec => (rec._1._2, (rec._1._1, rec._2)))
val dailyRevenuePerProductJoin = dailyRevenuePerProductIdMap.join(productsMap)
val dailyRevenuePerProductSorted = dailyRevenuePerProductJoin.map(rec => ((rec._2._1._1, -rec._2._1._2), (rec._2._1._1, rec._2._1._2, rec._2._2))).sortByKey()

val dailyRevenuePerProduct = dailyRevenuePerProductSorted.map(rec => rec._2._1 + "," + rec._2._2 + "," +rec._2._3)
dailyRevenuePerProduct.take(10).foreach(println)

}
}


#3

after i changed the path to hdfs://nn01.itversity.com:8020/public/retail_db/order_items/, it is getting failed in creating the final output directory in hdfs using sbt run-main .
if i run through spark shell as commands it is able to create the output directory.

18/03/06 02:03:21 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
18/03/06 02:03:21 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
18/03/06 02:03:21 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
18/03/06 02:03:21 ERROR FileOutputCommitter: Mkdirs failed to create file:/user/daily_revenue_txt_scala/_temporary/0
18/03/06 02:03:21 INFO SparkContext: Starting job: saveAsTextFile at DailyRevenue.scala:46
18/03/06 02:03:21 INFO DAGScheduler: Registering RDD 6 (map at DailyRevenue.scala:27)
18/03/06 02:03:21 INFO DAGScheduler: Registering RDD 5 (map at DailyRevenue.scala:23)
18/03/06 02:03:21 INFO DAGScheduler: Registering RDD 10 (map at DailyRevenue.scala:32)
18/03/06 02:03:21 INFO DAGScheduler: Registering RDD 13 (map at DailyRevenue.scala:38)
18/03/06 02:03:21 INFO DAGScheduler: Registering RDD 14 (map at DailyRevenue.scala:40)
18/03/06 02:03:21 INFO DAGScheduler: Registering RDD 18 (map at DailyRevenue.scala:42)
18/03/06 02:03:21 INFO DAGScheduler: Got job 0 (saveAsTextFile at DailyRevenue.scala:46) with 1 output partitions
18/03/06 02:03:21 INFO DAGScheduler: Final stage: ResultStage 6 (saveAsTextFile at DailyRevenue.scala:46)
18/03/06 02:03:21 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 5)
18/03/06 02:03:21 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 5)
18/03/06 02:03:21 INFO DAGScheduler: Submitting ShuffleMapStage 3 (MapPartitionsRDD[13] at map at DailyRevenue.scala:38), which has no missing parents
18/03/06 02:03:21 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.4 KB, free 510.9 MB)
18/03/06 02:03:21 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1543.0 B, free 510.9 MB)
18/03/06 02:03:21 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:55834 (size: 1543.0 B, free: 511.1 MB)
18/03/06 02:03:21 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
18/03/06 02:03:21 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 3 (MapPartitionsRDD[13] at map at DailyRevenue.scala:38)
18/03/06 02:03:21 INFO TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
18/03/06 02:03:21 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[6] at map at DailyRevenue.scala:27), which has no missing parents
18/03/06 02:03:21 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 3.8 KB, free 510.9 MB)
18/03/06 02:03:21 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 2.3 KB, free 510.9 MB)
18/03/06 02:03:21 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:55834 (size: 2.3 KB, free: 511.1 MB)
18/03/06 02:03:21 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006
18/03/06 02:03:21 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[6] at map at DailyRevenue.scala:27)
18/03/06 02:03:21 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
18/03/06 02:03:21 WARN TaskSetManager: Stage 3 contains a task of very large size (174 KB). The maximum recommended task size is 100 KB.
18/03/06 02:03:21 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 178914 bytes)
18/03/06 02:03:21 INFO DAGScheduler: Submitting ShuffleMapStage 1 (MapPartitionsRDD[5] at map at DailyRevenue.scala:23), which has no missing parents
18/03/06 02:03:21 INFO Executor: Running task 0.0 in stage 3.0 (TID 0)
18/03/06 02:03:21 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 4.5 KB, free 510.9 MB)
18/03/06 02:03:21 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 2.6 KB, free 510.9 MB)
18/03/06 02:03:21 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:55834 (size: 2.6 KB, free: 511.1 MB)
18/03/06 02:03:21 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1006
18/03/06 02:03:21 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 1 (MapPartitionsRDD[5] at map at DailyRevenue.scala:23)
18/03/06 02:03:21 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
18/03/06 02:03:22 INFO Executor: Finished task 0.0 in stage 3.0 (TID 0). 1158 bytes result sent to driver
18/03/06 02:03:22 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 1, localhost, partition 0,ANY, 2161 bytes)
18/03/06 02:03:22 INFO Executor: Running task 0.0 in stage 0.0 (TID 1)
18/03/06 02:03:22 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 0) in 103 ms on localhost (1/1)
18/03/06 02:03:22 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
18/03/06 02:03:22 INFO DAGScheduler: ShuffleMapStage 3 (map at DailyRevenue.scala:38) finished in 0.119 s
18/03/06 02:03:22 INFO DAGScheduler: looking for newly runnable stages
18/03/06 02:03:22 INFO DAGScheduler: running: Set(ShuffleMapStage 0, ShuffleMapStage 1)
18/03/06 02:03:22 INFO DAGScheduler: waiting: Set(ShuffleMapStage 5, ShuffleMapStage 2, ResultStage 6, ShuffleMapStage 4)
18/03/06 02:03:22 INFO DAGScheduler: failed: Set()
18/03/06 02:03:22 INFO HadoopRDD: Input split: hdfs://nn01.itversity.com:8020/public/retail_db/order_items/part-00000:0+5408880
18/03/06 02:03:22 INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost:55834 in memory (size: 1543.0 B, free: 511.1 MB)
18/03/06 02:03:22 INFO Executor: Finished task 0.0 in stage 0.0 (TID 1). 2253 bytes result sent to driver
18/03/06 02:03:22 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, partition 0,ANY, 2156 bytes)
18/03/06 02:03:22 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
18/03/06 02:03:22 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 1) in 788 ms on localhost (1/1)
18/03/06 02:03:22 INFO DAGScheduler: ShuffleMapStage 0 (map at DailyRevenue.scala:27) finished in 0.863 s
18/03/06 02:03:22 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
18/03/06 02:03:22 INFO DAGScheduler: looking for newly runnable stages
18/03/06 02:03:22 INFO DAGScheduler: running: Set(ShuffleMapStage 1)
18/03/06 02:03:22 INFO DAGScheduler: waiting: Set(ShuffleMapStage 5, ShuffleMapStage 2, ResultStage 6, ShuffleMapStage 4)
18/03/06 02:03:22 INFO DAGScheduler: failed: Set()
18/03/06 02:03:22 INFO HadoopRDD: Input split: hdfs://nn01.itversity.com:8020/public/retail_db/orders/part-00000:0+2999944
18/03/06 02:03:22 INFO BlockManagerInfo: Removed broadcast_3_piece0 on localhost:55834 in memory (size: 2.3 KB, free: 511.1 MB)
18/03/06 02:03:22 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 2293 bytes result sent to driver
18/03/06 02:03:22 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 149 ms on localhost (1/1)
18/03/06 02:03:22 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
18/03/06 02:03:22 INFO DAGScheduler: ShuffleMapStage 1 (map at DailyRevenue.scala:23) finished in 1.007 s
18/03/06 02:03:22 INFO DAGScheduler: looking for newly runnable stages
18/03/06 02:03:22 INFO DAGScheduler: running: Set()
18/03/06 02:03:22 INFO DAGScheduler: waiting: Set(ShuffleMapStage 5, ShuffleMapStage 2, ResultStage 6, ShuffleMapStage 4)
18/03/06 02:03:22 INFO DAGScheduler: failed: Set()
18/03/06 02:03:22 INFO DAGScheduler: Submitting ShuffleMapStage 2 (MapPartitionsRDD[10] at map at DailyRevenue.scala:32), which has no missing parents
18/03/06 02:03:22 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 3.5 KB, free 510.9 MB)
18/03/06 02:03:22 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 1963.0 B, free 510.9 MB)
18/03/06 02:03:22 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:55834 (size: 1963.0 B, free: 511.1 MB)
18/03/06 02:03:22 INFO SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1006
18/03/06 02:03:22 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 2 (MapPartitionsRDD[10] at map at DailyRevenue.scala:32)
18/03/06 02:03:22 INFO TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
18/03/06 02:03:22 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 3, localhost, partition 0,PROCESS_LOCAL, 1956 bytes)
18/03/06 02:03:22 INFO Executor: Running task 0.0 in stage 2.0 (TID 3)
18/03/06 02:03:22 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
18/03/06 02:03:22 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 3 ms
18/03/06 02:03:23 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
18/03/06 02:03:23 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
18/03/06 02:03:24 INFO BlockManagerInfo: Removed broadcast_4_piece0 on localhost:55834 in memory (size: 2.6 KB, free: 511.1 MB)
18/03/06 02:03:24 INFO Executor: Finished task 0.0 in stage 2.0 (TID 3). 1374 bytes result sent to driver
18/03/06 02:03:24 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 3) in 1661 ms on localhost (1/1)
18/03/06 02:03:24 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
18/03/06 02:03:24 INFO DAGScheduler: ShuffleMapStage 2 (map at DailyRevenue.scala:32) finished in 1.661 s
18/03/06 02:03:24 INFO DAGScheduler: looking for newly runnable stages
18/03/06 02:03:24 INFO DAGScheduler: running: Set()
18/03/06 02:03:24 INFO DAGScheduler: waiting: Set(ShuffleMapStage 5, ResultStage 6, ShuffleMapStage 4)
18/03/06 02:03:24 INFO DAGScheduler: failed: Set()
18/03/06 02:03:24 INFO DAGScheduler: Submitting ShuffleMapStage 4 (MapPartitionsRDD[14] at map at DailyRevenue.scala:40), which has no missing parents
18/03/06 02:03:24 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 2.9 KB, free 510.9 MB)
18/03/06 02:03:24 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 1793.0 B, free 510.9 MB)
18/03/06 02:03:24 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:55834 (size: 1793.0 B, free: 511.1 MB)
18/03/06 02:03:24 INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:1006
18/03/06 02:03:24 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 4 (MapPartitionsRDD[14] at map at DailyRevenue.scala:40)
18/03/06 02:03:24 INFO TaskSchedulerImpl: Adding task set 4.0 with 1 tasks
18/03/06 02:03:24 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 4, localhost, partition 0,NODE_LOCAL, 1883 bytes)
18/03/06 02:03:24 INFO Executor: Running task 0.0 in stage 4.0 (TID 4)
18/03/06 02:03:24 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
18/03/06 02:03:24 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
18/03/06 02:03:24 INFO Executor: Finished task 0.0 in stage 4.0 (TID 4). 1374 bytes result sent to driver
18/03/06 02:03:24 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 4) in 97 ms on localhost (1/1)
18/03/06 02:03:24 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool
18/03/06 02:03:24 INFO DAGScheduler: ShuffleMapStage 4 (map at DailyRevenue.scala:40) finished in 0.098 s
18/03/06 02:03:24 INFO DAGScheduler: looking for newly runnable stages
18/03/06 02:03:24 INFO DAGScheduler: running: Set()
18/03/06 02:03:24 INFO DAGScheduler: waiting: Set(ShuffleMapStage 5, ResultStage 6)
18/03/06 02:03:24 INFO DAGScheduler: failed: Set()
18/03/06 02:03:24 INFO DAGScheduler: Submitting ShuffleMapStage 5 (MapPartitionsRDD[18] at map at DailyRevenue.scala:42), which has no missing parents
18/03/06 02:03:24 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 3.9 KB, free 510.9 MB)
18/03/06 02:03:24 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 2.1 KB, free 510.9 MB)
18/03/06 02:03:24 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on localhost:55834 (size: 2.1 KB, free: 511.1 MB)
18/03/06 02:03:24 INFO SparkContext: Created broadcast 7 from broadcast at DAGScheduler.scala:1006
18/03/06 02:03:24 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 5 (MapPartitionsRDD[18] at map at DailyRevenue.scala:42)
18/03/06 02:03:24 INFO TaskSchedulerImpl: Adding task set 5.0 with 1 tasks
18/03/06 02:03:24 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 5, localhost, partition 0,PROCESS_LOCAL, 1956 bytes)
18/03/06 02:03:24 INFO Executor: Running task 0.0 in stage 5.0 (TID 5)
18/03/06 02:03:24 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
18/03/06 02:03:24 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
18/03/06 02:03:24 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
18/03/06 02:03:24 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
18/03/06 02:03:24 INFO Executor: Finished task 0.0 in stage 5.0 (TID 5). 1374 bytes result sent to driver
18/03/06 02:03:24 INFO TaskSetManager: Finished task 0.0 in stage 5.0 (TID 5) in 189 ms on localhost (1/1)
18/03/06 02:03:24 INFO DAGScheduler: ShuffleMapStage 5 (map at DailyRevenue.scala:42) finished in 0.189 s
18/03/06 02:03:24 INFO DAGScheduler: looking for newly runnable stages
18/03/06 02:03:24 INFO DAGScheduler: running: Set()
18/03/06 02:03:24 INFO DAGScheduler: waiting: Set(ResultStage 6)
18/03/06 02:03:24 INFO DAGScheduler: failed: Set()
18/03/06 02:03:24 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool
18/03/06 02:03:24 INFO DAGScheduler: Submitting ResultStage 6 (MapPartitionsRDD[21] at saveAsTextFile at DailyRevenue.scala:46), which has no missing parents
18/03/06 02:03:24 INFO MemoryStore: Block broadcast_8 stored as values in memory (estimated size 49.3 KB, free 510.8 MB)
18/03/06 02:03:24 INFO MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 17.3 KB, free 510.8 MB)
18/03/06 02:03:24 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on localhost:55834 (size: 17.3 KB, free: 511.1 MB)
18/03/06 02:03:24 INFO SparkContext: Created broadcast 8 from broadcast at DAGScheduler.scala:1006
18/03/06 02:03:24 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 6 (MapPartitionsRDD[21] at saveAsTextFile at DailyRevenue.scala:46)
18/03/06 02:03:24 INFO TaskSchedulerImpl: Adding task set 6.0 with 1 tasks
18/03/06 02:03:24 INFO TaskSetManager: Starting task 0.0 in stage 6.0 (TID 6, localhost, partition 0,NODE_LOCAL, 1894 bytes)
18/03/06 02:03:24 INFO Executor: Running task 0.0 in stage 6.0 (TID 6)
18/03/06 02:03:24 INFO deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
18/03/06 02:03:24 INFO deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
18/03/06 02:03:24 INFO deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
18/03/06 02:03:24 INFO deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
18/03/06 02:03:24 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
18/03/06 02:03:24 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
18/03/06 02:03:25 ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 6)
java.io.IOException: Mkdirs failed to create file:/user/daily_revenue_txt_scala/_temporary/0/_temporary/attempt_201803060203_0006_m_000000_6
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:438)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:905)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:798)
at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1191)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1183)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/03/06 02:03:25 WARN TaskSetManager: Lost task 0.0 in stage 6.0 (TID 6, localhost): java.io.IOException: Mkdirs failed to create file:/user/daily_revenue_txt_scala/_temporary/0/_temporary/attempt_201803060203_0006_m_000000_6


#4

@rnit Change output path also with hdfs.


#5

Even I’m also facing similar issue.

Hi, I’ve configured winutils.exe to support spark in windows machine. While I’m trying to develop sample spark application in Intellij Idea, I was repeatedly getting “input file path doesn’t exist”. Since, i had a file at local, hdfs . File name is “one.txt”
I know that loading a local file syntax is : scala.io.Source.fromFile(“filepath”)
for hdfs location means: sc.textFile(“filepath”)

In Intellij idea I’ve used to load file as:
//file loading from hdfs location
val data = sc.textFile(“hdfs://nn01.itversity.com:8020/user/saikiranchepuri35/one.txt”)
// file loading from local file location
// val data1 = Source.fromFile("/home/saikiranchepuri35/one.txt").getLines().toList
// val data = sc.parallelize(data1)

val rdd1 = data.flatMap(_.split(",")).map(word=>(word,1))

rdd1.collect.foreach(println)

Can anyone please help me out how to give exactly the correct path to work on it. As I’m facing this issue from morning to evening and went the videos couldn’t resolve it.

Exception in thread “main” java.io.FileNotFoundException: \home\saikiranchepuri35\one.txt (The system cannot find the path specified)
at java.io.FileInputStream.open0(Native Method)


Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs

  • Click here for access to state of the art 13 node Hadoop and Spark Cluster