Importing of Pandas and numpy fails in lab environment


I am trying to execute pyspark code in itversity lab environment. I am using pandas and numpy in my pyspark code. But when I submit in itversity lab environment, an error is thrown saying import of pandas module failed.

so I added a small function which imports dependencies through RDD.

from pyspark import SparkContext

def import_dependencies(x):
import pandas as pd
import numpy as np
return x

sc = SparkContext()
int_rdd = sc.parallelize([1, 2, 3, 4]) x: import_dependencies(x)).collect()

But even this failed. How to submit pyspark job with dependencies.

Can anyone help me out?



It does not work that way. @kmln will get you the solution.


Thanks Dgadiraju. I will approach him


Hello Ananth,

You can add the dependencies by adding them as a zip file before you import. You can copy the zipped modules into your hdfs directory and give the path as a parameter.

sc.addPyFile("hdfs path/")

We can proceed to import the files normally after this. You can find the files for all the popular python modules in their git repositories.



Thanks Koushik but doesn’t the pandas module needs to be installed in workers node using pip?



We can do that but thats not an optimal solution. It is always better to pass whatever third party modules you are using so that there is no issues with dependencies.


Thanks Koushik for the quick reply.