Converting HQL's to Spark SQL or Scala


#1

For last few days I am trying to convert my hqls which are written for batch loading ( ETL processes).

Mostly doing insert into target table but both source ( multiple ) and target (single ) tables having more than 50 columns which loads of joins.

Can some one suggest how to write those HQLs in Spack SQL or Scala.


#2

You have to use sqlContext.sql and pass most of the Hive queries to it. However Spark SQL is behind Hive and some Hive queries might not work.


#3

Thank you sir for your kind reply.

Now instead using sqlcontext if I 'll use existing hive queries as is and use engine as spark instead Mr, will it be same ?


#4

Typically you should build wrapper around SQL/Hive queries either using shell scripting or Scala or Python using relevant APIs.

Here is the typical development life cycle using Hive queries via Spark. This approach can be validated on our state of the art cluster - https://labs.itversity.com

  • Understand data
  • Launch spark sql or Hive and come up with queries
  • Embed those queries as part of Python or Scala application with Spark dependencies (typically development should be done using IDE on your PC)
  • Build as application and ship it to the cluster (in our case - gateway node on https://labs.itversity.com)
  • Run or Schedule using spark-submit

You can also script around Hive or spark-sql, but that is not recommended as it is not reliable practice.

I hope it answer your question.


Building Spark applications using Spark SQL is extensively covered as part of our Udemy courses.

  • Click here for $35 coupon for CCA 175 Spark and Hadoop Developer using Python.
  • Click here for $35 coupon for CCA 175 Spark and Hadoop Developer using Scala.