Reading HDFS files from python code



How to read HDFS Files from python code? I use the below method to read from local file system but it fails to read HDFS files given HDFS file path.

localFileSystem = ‘/home/classic/data/’ ## For reading the file from local file system
with open("".join([input, ‘mappingFile.pkl’]), mode=‘rb’) as fp:
pd_mappingFile = cpick.load(fp)

hdfsFileSystem = ‘/user/classic/data/’ ## For reading the file from HDFS file system but the above open method is unable to read

How do I read HDFS file from python code?


Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs

  • Click here for access to state of the art 13 node Hadoop and Spark Cluster


You need to use pyspark. Where are you trying this?
pyspark is nothing but Python with Spark engine. It requires to set up Spark and here are the instructions


Hi Dgadiraju,

Thanks for you reply.

Reading CSV,Json, textFiles,jdbc, parquet through sparkSQL is available out of the box. But how to read Pickle file in HDFS? Python Open() method is not reading from HDFS. should I use any hadoop library to read Pickle files which is uploaded in HDFS?

Thanks and Regards,


@koushikmln might be able to answer your question…


Hello Ananth,

You have to read the files as binaryFiles using spark api and then use pickle to parse the files individually.

Code would look something like:
rdd = sc.binaryFiles(inputDirectory)

Loading pickle files into dataframe:
import pickle
from io import BytesIO
rddMap = rdd.values().map(lambda p: pickle.load(BytesIO(p)))

You can parse each record using map as shown above.

It seems pickle comes as part of python and you should be able to use out of the box. If not, you will need to install pickle binary on the worker nodes for this to work or alternatively, you can pass the pickle module as zip using the --py-files while running the spark command.