Duplicate a folder in HDFS

I have a folder in HDFS that will have files coming in everyday. I want to duplicate the folder in such a way that whenever a new file comes to the original folder, I want that to be duplicated/synced in the duplicate folder.

Basically, I want to sync a folder with another in HDFS

How can we achieve that in hadoop?

I might want to use a shell script if possible.


write a file watcher script. here u go:

In actual folder(one time process)
create an empty file flist in orig folder

script should be executed on original folder and here is the logic:

  1. list the files to flist_tmp
  2. do diff of flist and flist_tmp
  3. for each line in the above output(diff):
  • copy the file to duplicate folder
  • if copy process succeeds then add the file name in flist

as this is a file watcher script…it has to run continously.

I am not sure if you have any utilities in Hadoop, just sharing my thoughts.

If your source can land the files in the local file system then you can configure a flume script as spool-dir(Source)–>Memory/Appropriate(Channel) --> HDFS(Sink).

But I do see that you have mentioned HDFS Source :slight_smile: not sure whether Flume can handle that.