We have a requirement to migrate data from ODS (plus some social media, web analytics etc) into Hadoop for which we need to create a cluster. Please find below the details:
- It will be Cloudera Enterprise edition and deployed on Azure
- Initial expected Data volume is 7.5 TB (includes replication factor of 3 & overhead of 20%)
- Incremental load is expected to be 1 GB/day
- Thinking to have Sqoop, Hive, Oozie, Flume, Spark, Kafka, HBase as well.
- Initial workload will be mainly around Data Import and ETL(Spark).
- Further, there could be some Analytics use case involving Classification, Recommendation Algos etc
- Based upon Durga Sir’s videos, I have come up with following sizing(for production env).
NN- Name Node, JN- Journal Node, RM- Resource Manager, ZK - Zookeeper, CM- Cloudera Manager
Node Type Disk in TB’s (7200 RPM) RAM Cores
NN + JN + RM +ZK 1(OS) + 2(FSImage & Edit logs) + 1(JN) + 1(ZK) 32 14
StandBy NN + JN Same as NN 32 14
Edge + CM 1 14 4
Cloudera Director node 1 14 4
Data Nodes (43TB)
(3 disks of 1 TB per node) 43 32 8
(Also one of DN will be JN as well)
- Can anyone please confirm if I need to change anything ?
- Is it mandatory to have separate RM node in prod? If yes, what should be its conf?
- Also, please suggest what should I change to set up a Dev env as well ?