Need Suggestion on joining with Multiple RDDs

Using Apache Spark 1.6 with Scala.


I have monthly folders for each year as given below:



All these folders contains AVRO files for individual months.

In delta or daily processing, expectation is to read incoming record, search corresponding record in these folders based on Key. If found, pull that record and attach with delta record.

Instead of reading from all directories, I need to pick latest 6 months directories to find corresponding record.

One way is to create 6 RDDs by loading last 6 month directories individually and do a UNION.

On that UNION RDD run a join with delta record.

Can someone suggest a best way to accomplish this scenario?

@vikasmishra you need not read all the files, instead you can get the latest 6 months and year from scala Calender function and store them as key values in a list and convert to rdd. Then you can just join on that, where your month exists in those months present in the created rdd. This way anytime you run, it will pick the latest 6 months with year.