Apache Spark Python - Spark Metastore - Exploring Spark Catalog

Let us get an overview of Spark Catalog to manage Spark Metastore tables as well as temporary views.

  • Let us say spark is of type SparkSession. There is an attribute as part of spark called as catalog and it is of type pyspark.sql.catalog.Catalog.

  • We can access catalog using spark.catalog.

  • We can permanently or temporarily create tables or views on top of data in a Data Frame.

  • Metadata such as table names, column names, data types etc for the permanent tables or views will be stored in Metastore. We can access the metadata using spark.catalog which is exposed as part of SparkSession object.

  • spark.catalog also provide us the details related to temporary views that are being created. Metadata of these temporary views will not be stored in Spark Metastore.

  • Permanent tables are typically created using databases in spark metastore. If not specified, the tables will be created in default database.

  • There are several methods that are part of spark.catalog. We will explore them in the later topics.

  • Following are some of the tasks that can be performed using spark.catalog object:

    • Check current database and switch to different databases.

    • Create permanent table in metastore.

    • Create or drop temporary views.

    • Register functions.

  • All the above tasks can be performed using SQL style commands passed to spark.sql.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.

from pyspark.sql import SparkSession

import getpass

username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Spark Metastore'). \
    master('yarn'). \
    getOrCreate()

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Watch the video tutorial here

Conclusion

Spark’s catalog is a powerful component of the SparkSession that simplifies the management of both metastore and temporary data entities within Spark. By enabling seamless interaction with the metastore database, users can efficiently manage permanent tables and views, as well as temporary views, directly from their Spark applications. Whether you are manipulating data programmatically in Python, using interactive Scala shells, or executing Spark SQL commands, the catalog provides a unified interface to enhance productivity and streamline database operations. This functionality, along with Spark’s ability to integrate with various data sources and its scalable architecture, makes it an indispensable tool for data engineers and developers working with large-scale data processing and analytics.