Unable to access local file using PySpark

Unable to access local file using PySpark
0.0 0

#1

HI team,

This is the first time I have configured Python and Spark in my system (Windows 10 , 64 bit ,16 GB Ram). I am trying to access a file placed on my Local system desktop through Pyspark code.
I am getting the error like
"
py4j.protocol.Py4JJavaError: An error occurred while calling o18.partitions.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/C:/Users/Admin/Desktop/sparkRDD.txt "

Command executed is :

sc.textFile(“C:\Users\Admin\Desktop\sparkRDD.txt”).first()

I have also attached the screenshot of the error.

Please let me know if there is something that I have missed in my process.

Thanks and Regards,
Siddhant Chowdhury


#2

I am facing the same issue too, Can someone please answer this. I and trying Exercise 02 of Durga Sir’s list. where it is specifially mentioned to load from local.

orders=sc.textFile("/home/shalinisarathykc/retail_db/orders") ----> not working (data is available)
orders=sc.textFile("/public/retail_db/orders") -----> works

PS: I am able to load from HDFS. But want to know what is the issue in loading data from local.


#3

@siddhant @shalinisarathykc

To read a local file which is in the local system we need to use below command

sc.textFile(“file:///path\filename”).first()

Thanks & Regards,
Sunil Abhishek


#4

orders=sc.textFile(“file:///home/shalinisarathykc/retail_db/orders”).first()
I used this query, still same problem.

Caused by: java.io.FileNotFoundException: File file:/home/shalinisarathykc/retail_db/orders/part-00000 does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:624)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:850)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:614)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:422)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:146)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:348)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:782)
at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:109)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:237)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
… 1 more