Importing of Pandas and numpy fails in lab environment


#1

I am trying to execute pyspark code in itversity lab environment. I am using pandas and numpy in my pyspark code. But when I submit in itversity lab environment, an error is thrown saying import of pandas module failed.

so I added a small function which imports dependencies through RDD.

from pyspark import SparkContext

def import_dependencies(x):
import pandas as pd
import numpy as np
return x

sc = SparkContext()
int_rdd = sc.parallelize([1, 2, 3, 4])
int_rdd.map(lambda x: import_dependencies(x)).collect()

But even this failed. How to submit pyspark job with dependencies.

Can anyone help me out?

Thanks
Ananthu


#2

It does not work that way. @kmln will get you the solution.


#3

Thanks Dgadiraju. I will approach him


#4

Hello Ananth,

You can add the dependencies by adding them as a zip file before you import. You can copy the zipped modules into your hdfs directory and give the path as a parameter.

sc.addPyFile("hdfs path/module.zip")

We can proceed to import the files normally after this. You can find the files for all the popular python modules in their git repositories.

Regards,
Koushik


#5

Thanks Koushik but doesn’t the pandas module needs to be installed in workers node using pip?

Regards
Ananth


#6

We can do that but thats not an optimal solution. It is always better to pass whatever third party modules you are using so that there is no issues with dependencies.


#7

Thanks Koushik for the quick reply.

Regards
Ananthu