Bulk read .txt file to RDD and write them I separated - help needed


#1

Dear friends.

I have a question on reading text files in buld from a folder and make a single RDD. Can anyone please help?

Here is the problem statement -

  1. I have a weather folder where daily weather data is stored as .txt file, comma separated.
  2. I have to bulk read all 365 files from the folder and create a RDD and do some transformation.
  3. I have to save the resultant file in HDFS as pipe (|) separated.

I was trying to bulk read using wildcard -
lines = open("/home/arindamb/log_files/*").read().splitlines()

But it is giving following error -

Traceback (most recent call last):
File “”, line 1, in
IOError: [Errno 2] No such file or directory: ‘/home/arindamb/log_files/*’

But I do have the files in my folder -

/home/arindamb/log_files
[arindamb@gw02 log_files]$ ls -lrt
total 12
-rw-r–r-- 1 arindamb students 277 Jun 27 03:41 log1.txt
-rw-r–r-- 1 arindamb students 277 Jun 27 03:42 log2.txt
-rw-r–r-- 1 arindamb students 277 Jun 27 03:42 log3.txt

Also, let me know how to save the file with pipe delimiter.
Thanks in advance.
Arindam

Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs

  • Click here for access to state of the art 13 node Hadoop and Spark Cluster