Help on Hadoop in production environments


I am a traditional data warehouse developer (with kind of architect role as well) and our company is planning to shift (part of) to Hadoop (using Hive as of now) on Azure platform. My inputs might be taken by the team before arriving at a decision so I am going through any help which I can. Anyone who can help me with the following doubts I have:

  1. How many nodes cluster is generally used in a prod env. I understand it all depends on the volumes of data but if someone can help me with how much volume of prod data they have and how many nodes of what memory size are they using in Prod? I will get a fair idea from it.
  2. I am looking for use cases of Hive vs Spark in actual production environments. Do the 2 technologies coexist in a production environment? If yes, what kind of transformation are fine through HiveQL and what cases are handled through SparkSQL?
  3. In Oracle or other DBs, we have a concept of PL/SQL package where we can package multiple queries/procedures and call them inside a UNIX script. In case of Hive queries, what’s the process used to package and automate the query processing in actual production environments.



It’s a good thing that you guys are doing Data migration from proprietary data warehouse solution to Hadoop.

Let us discuss your doubts:

For production env, number of nodes for Hadoop cluster can be concluded based on formulae to conclude about product cluster sizing.

Yes they both co-exist in same cluster & can run seamlessly without any issues & it is match made in heaven.
HiveQL is deprecated, SparkSQL is standard now. Most of OLAP & OLTP analysis can be done.

Hive has Hive Procedural Language (HPL) just like PL/SQL package, but it is not widely used. My suggestion is to use Spark on Hive data so you can run queries & programming as well. Hive Queries can reused in Spark.