Reading data from linux folder for spark streaming?


#1

For spark streaming, we use streamingContext.textFileStream(directory) method.
The spark document says that directory is Hadoop-compatible filesystem. HDFS directory can be hdfs://dir/subdir/

what should be URL for directory if I want to read data for spark stream from linux folder?
Please advise.


#2

@abdullah7786 Currently, it is not possible to stream data from simple file system. Currently Spark streams data from HDFS compatible file system like S3 through StreamingContext.textFileStream().

Hope this helps!


#3

@venkatreddy-amalla Thanks!, can there be a workaround? Like exposing folder through SMB or FTP etc.


#4

@abdullah7786 There could be some workaround but I’m not sure.

Do you mind to explain what are you trying to achieve eventually.


#5

For referring Linux file system, you need to prefix your folder path with “file://” (ex: file:///home/workspace/data). Also, make sure the file that you copy/move has the latest timestamp. When you copy, most of the times, the file will have the past timestamp so the program will not recognize your file. As per the spark documentation, it is recommended to directly create the file in the folder that you are polling. I had the similar issue and got it resolved in that way.