I am a traditional data warehouse developer (with kind of architect role as well) and our company is planning to shift (part of) to Hadoop (using Hive as of now) on Azure platform. My inputs might be taken by the team before arriving at a decision so I am going through any help which I can. Anyone who can help me with the following doubts I have:
- How many nodes cluster is generally used in a prod env. I understand it all depends on the volumes of data but if someone can help me with how much volume of prod data they have and how many nodes of what memory size are they using in Prod? I will get a fair idea from it.
- I am looking for use cases of Hive vs Spark in actual production environments. Do the 2 technologies coexist in a production environment? If yes, what kind of transformation are fine through HiveQL and what cases are handled through SparkSQL?
- In Oracle or other DBs, we have a concept of PL/SQL package where we can package multiple queries/procedures and call them inside a UNIX script. In case of Hive queries, what’s the process used to package and automate the query processing in actual production environments.