Error when run spark content read from file(the first program to validate spark with pycharm?

#1

the code is :
from pyspark import SparkConf, SparkContext
sc = SparkContext(‘local’, ’ Spark Demo’)
print (sc.textFile(“C:\deckofcards.txt”).first())

when I run the file I got error

C:\Users\aakdar\PycharmProjects\FirstProject\venv\Scripts\python.exe C:/Users/aakdar/PycharmProjects/FirstProject/Spark.py
Traceback (most recent call last):
File “C:/Users/aakdar/PycharmProjects/FirstProject/Spark.py”, line 2, in
sc = SparkContext(‘local’, ’ Spark Demo’)
File “C:\spark-1.6.3-bin-hadoop2.6\python\pyspark\context.py”, line 112, in init
SparkContext._ensure_initialized(self, gateway=gateway)
File “C:\spark-1.6.3-bin-hadoop2.6\python\pyspark\context.py”, line 245, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
File “C:\spark-1.6.3-bin-hadoop2.6\python\pyspark\java_gateway.py”, line 48, in launch_gateway
SPARK_HOME = os.environ[“SPARK_HOME”]
File “C:\Users\aakdar\PycharmProjects\FirstProject\venv\lib\os.py”, line 425, in getitem
return self.data[key.upper()]
KeyError: ‘SPARK_HOME’

Process finished with exit code 1

please advise ?

0 Likes

#2

please follow below blog to Develop pyspark program using Pycharm on Windows 10

0 Likes

#3

I already follow all steps from Udemy course
same error here
any advise ?

0 Likes

#4

I set env inside the pycharm
the error is changed
C:\Users\aakdar\PycharmProjects\FirstProject\venv\Scripts\python.exe C:/Users/aakdar/PycharmProjects/FirstProject/Spark.py
Traceback (most recent call last):
File “C:/Users/aakdar/PycharmProjects/FirstProject/Spark.py”, line 2, in
sc = SparkContext(master=“local”,appName=“Spark Demo”)
File “C:\spark-1.6.3-bin-hadoop2.6\python\pyspark\context.py”, line 112, in init
SparkContext._ensure_initialized(self, gateway=gateway)
File “C:\spark-1.6.3-bin-hadoop2.6\python\pyspark\context.py”, line 245, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
File “C:\spark-1.6.3-bin-hadoop2.6\python\pyspark\java_gateway.py”, line 79, in launch_gateway
proc = Popen(command, stdin=PIPE, env=env)
File “C:\Python27\Lib\subprocess.py”, line 390, in init
errread, errwrite)
File “C:\Python27\Lib\subprocess.py”, line 640, in _execute_child
startupinfo)
WindowsError: [Error 2] The system cannot find the file specified

Process finished with exit code 1

0 Likes

#5

To read a local file you can use below way, you have to use double slashes
rdd = sc.textFile(“C:\Users\username\Desktop\sample.txt”)

0 Likes

#6

the problem is resolved I change the vm java XMS XMX

0 Likes

#7

Hi
I Follow the all the steps but install Spark 2 with Python 3.6, everything is fine, and pyspark is working fine but when i try to read the file its gave error: ModuleNotFoundError: No module named ‘resource’.
code:
sc.textFile(“C:\Users\Parvez\Documents\SparkAndPython\deckofcards.txt”).first()

Error:

File “C:\Users\Parvez\Anaconda3\lib\runpy.py”, line 193, in _run_module_as_main
main”, mod_spec)
File “C:\Users\Parvez\Anaconda3\lib\runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “C:\Users\Parvez\Documents\SparkAndPython\spark-2.4.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py”, line 25, in
ModuleNotFoundError: No module named ‘resource’
2019-03-06 01:27:12 ERROR Executor:91 - Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Python worker failed to connect back.
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketTimeoutException: Accept timed out
at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
at java.net.AbstractPlainSocketImpl.accept(Unknown Source)
at java.net.PlainSocketImpl.accept(Unknown Source)
at java.net.ServerSocket.implAccept(Unknown Source)
at java.net.ServerSocket.accept(Unknown Source)
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)
… 14 more
2019-03-06 01:27:12 WARN TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketTimeoutException: Accept timed out
at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
at java.net.AbstractPlainSocketImpl.accept(Unknown Source)
at java.net.PlainSocketImpl.accept(Unknown Source)
at java.net.ServerSocket.implAccept(Unknown Source)
at java.net.ServerSocket.accept(Unknown Source)
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)
… 14 more

0 Likes

#8

@Mohammad_Parvez Can you install the previous version of Spark (2.3 instead of 2.4). It is an issue of the lastest version of pyspark.

0 Likes

#9

@annapurna that is not correct. It should work with 2.4 as well. Please see the code is working on your windows PC with Spark 2.3 or not.

0 Likes

#10

I think on Windows you need to provide path with \\ not \ at each and every directory.

We recommend using Ubuntu on your Windows 10 to practice Pyspark. You can setup using Windows Subsystem for Linux.

0 Likes

#11

@Mohammad_Parvez Its work fine in spark2.3. Please refer below screenshot.

We recommend using Ubuntu on your Windows 10 to practice Pyspark. You can setup using Windows Subsystem for Linux.

0 Likes

#12

thanks , i installed spark2.3. its working fine.

0 Likes

#13

I changed the path and try again but it was not working in spark2.4. its working fine in spark2.3.

0 Likes

#14

Is uninstalling 2.4 version and instlling 2.3 the only way to solve this issue? I tried using printing: print(sc.textFile(“C:\deckofcards.txt”).first()). Got the same issue " ModuleNotFoundError: No module named ‘resource’"

0 Likes

#15

I tried with double ‘’ as well, still the same problem.
print(sc.textFile(“C:\deckofcards.txt”).first()) - ModuleNotFoundError: No module named ‘resource’"

0 Likes

#16

i’m having this issue anyone can help

An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.lang.IllegalArgumentException: Unsupported class file major version 55

0 Likes