Using Apache Spark 1.6 with Scala.
I have monthly folders for each year as given below:
All these folders contains AVRO files for individual months.
In delta or daily processing, expectation is to read incoming record, search corresponding record in these folders based on Key. If found, pull that record and attach with delta record.
Instead of reading from all directories, I need to pick latest 6 months directories to find corresponding record.
One way is to create 6 RDDs by loading last 6 month directories individually and do a UNION.
On that UNION RDD run a join with delta record.
Can someone suggest a best way to accomplish this scenario?