Multiple files mapreduce input

I am working on Map Reduce project “like the Word count example” with some changes, In my case I have many files to be process if I run the program, I want each map to take one of the files and process it separate from others “I want the output for a file independent from other files output”

I try to use the:

Path filesPath = new Path(“file1.txt,file2.txt,file3.txt”);

MultipleInputs.addInputPath(job, filesPath, TextInputFormat.class, Map.class);
but the output I got is mixing all the files output together, and if a word appear in more than file, it processed once, and that what I don’t want. I want the word count in each file separate.

So how can i use this?

if I put the files in a directory is it will process independent?


What is the size of largest file?

If size of files is larger than split size, you cannot ensure that each file is processed by different mapper.

1 Like

I am not an expert,But, I think, we need to create a custom input format by overriding splittable method, this avoids splitting of input file.One file will be processed by one mapper.

Then to send this mapper O/P to a separate reducer,I guess, we need to create custom partitioner.

1 Like

I solved the problem by pass the file name with each word, so that garantee each file process independent.

and this is how to get the file name:

the required library:
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

Code in the map:
FileSplit split = (FileSplit) context.getInputSplit();

//to get the file name with extention
String filename= split.getPath().getName();