We have a requirement to migrate data from ODS (plus some social media, web analytics etc) into Hadoop for which we need to create a cluster. Please find below the details:

  1. It will be Cloudera Enterprise edition and deployed on Azure
  2. Initial expected Data volume is 7.5 TB (includes replication factor of 3 & overhead of 20%)
  3. Incremental load is expected to be 1 GB/day
  4. Thinking to have Sqoop, Hive, Oozie, Flume, Spark, Kafka, HBase as well.
  5. Initial workload will be mainly around Data Import and ETL(Spark).
  6. Further, there could be some Analytics use case involving Classification, Recommendation Algos etc
  7. Based upon Durga Sir’s videos, I have come up with following sizing(for production env).
    NN- Name Node, JN- Journal Node, RM- Resource Manager, ZK - Zookeeper, CM- Cloudera Manager

Node Type Disk in TB’s (7200 RPM) RAM Cores
NN + JN + RM +ZK 1(OS) + 2(FSImage & Edit logs) + 1(JN) + 1(ZK) 32 14
StandBy NN + JN Same as NN 32 14
Edge + CM 1 14 4
Cloudera Director node 1 14 4
Data Nodes (43TB)
(3 disks of 1 TB per node) 4
3 32 8
(Also one of DN will be JN as well)


  1. Can anyone please confirm if I need to change anything ?
  2. Is it mandatory to have separate RM node in prod? If yes, what should be its conf?
  3. Also, please suggest what should I change to set up a Dev env as well ?


