Why RDD based on different location creates same result?


#1

In certificate CCA 175 lecture 63, I notice the following:
val orders = sc.textFile("/public/retail_db/orders")

However, “/public/retail_db/orders” is NOT a file, it’s a folder, the file path is:
"/public/retail_db/orders/part-00000"

Yet the above command still generates the same result as
val orders = sc.textFile("/public/retail_db/orders/part-00000")

Can someone help to explain this? Thanks


#2

@paslechoix:

Either way same here becoz in this path /public/retail_db/orders there is only one file with name part-00000 there. Spark can also able to read data if you give parent directory of the data.
Data was copied into this orders directory like that to simulate how MR will create files in HDFS.
Make sense?

Thanks
Venkat


#3

what about if there are multiple files in the folder? which one will be picked up if not specifying the file’s name?

Thanks


#4

@paslechoix:

It will read all the available files.

Thanks
Venkat


#5

Thanks. Then the method should say sc.textFolder instead of sc.textFile, though there isn’t that method, the textFile is indeed misleading or at least not accurate.


#6

@paslechoix:

Agree.

Thanks
Venkat