Spark.read.json, Error: Path does not exist: file:/public/retail_db_json/orders

python3
Python 3.6.8 (default, Aug 7 2019, 17:28:10)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(“Testing”).master(‘local’).getOrCreate()
20/08/13 17:57:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Using Spark’s default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to “WARN”.
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/08/13 17:57:07 WARN Utils: Service ‘SparkUI’ could not bind on port 4040. Attempting port 4041.
20/08/13 17:57:07 WARN Utils: Service ‘SparkUI’ could not bind on port 4041. Attempting port 4042.
20/08/13 17:57:07 WARN Utils: Service ‘SparkUI’ could not bind on port 4042. Attempting port 4043.
20/08/13 17:57:07 WARN Utils: Service ‘SparkUI’ could not bind on port 4043. Attempting port 4044.
20/08/13 17:57:07 WARN Utils: Service ‘SparkUI’ could not bind on port 4044. Attempting port 4045.
20/08/13 17:57:07 WARN Utils: Service ‘SparkUI’ could not bind on port 4045. Attempting port 4046.
20/08/13 17:57:07 WARN Utils: Service ‘SparkUI’ could not bind on port 4046. Attempting port 4047.
20/08/13 17:57:07 WARN Utils: Service ‘SparkUI’ could not bind on port 4047. Attempting port 4048.

df = spark.read.json("/public/retail_db_json/orders")
Traceback (most recent call last):
File “”, line 1, in
File “/home/resanmail/.local/lib/python3.6/site-packages/pyspark/sql/readwriter.py”, line 300, in json
return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File “/home/resanmail/.local/lib/python3.6/site-packages/py4j/java_gateway.py”, line 1305, in call
answer, self.gateway_client, self.target_id, self.name)
File “/home/resanmail/.local/lib/python3.6/site-packages/pyspark/sql/utils.py”, line 137, in deco
raise_from(converted)
File “”, line 3, in raise_from
pyspark.sql.utils.AnalysisException: Path does not exist: file:/public/retail_db_json/orders;

Why does it throw path doesn’t exist error?
It works when I tried in pyspark2.

Hi @resanmail,

spark.read.json("/public/retail_db_json/orders")

this line of code is pointing to your local location(because you launching pyspark on local mode) and you given hdfs location. thats why there is error with your code. you have to give local path which is ("/data/retail_db_json/orders")
please go through below code-

spark.read.json("/data/retail_db_json/orders")

and, there is already pyspark-shell avaliable on our lab so please do your practice on that. importing pyspark module on python shell is not a good idea.

Thank you.

Thanks, probably I missed the part of copying data files to local, from the video.

BTW, for cca175, shouldn’t I try it in Python mode to complete it as a script?