Let us get an overview of Spark Catalog to manage Spark Metastore tables as well as temporary views.
-
Let us say
spark
is of typeSparkSession
. There is an attribute as part ofspark
called as catalog and it is of type pyspark.sql.catalog.Catalog. -
We can access catalog using
spark.catalog
. -
We can permanently or temporarily create tables or views on top of data in a Data Frame.
-
Metadata such as table names, column names, data types etc for the permanent tables or views will be stored in Metastore. We can access the metadata using
spark.catalog
which is exposed as part of SparkSession object. -
spark.catalog
also provide us the details related to temporary views that are being created. Metadata of these temporary views will not be stored in Spark Metastore. -
Permanent tables are typically created using databases in spark metastore. If not specified, the tables will be created in default database.
-
There are several methods that are part of
spark.catalog
. We will explore them in the later topics. -
Following are some of the tasks that can be performed using
spark.catalog
object:-
Check current database and switch to different databases.
-
Create permanent table in metastore.
-
Create or drop temporary views.
-
Register functions.
-
-
All the above tasks can be performed using SQL style commands passed to
spark.sql
.
Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.
from pyspark.sql import SparkSession
import getpass
username = getpass.getuser()
spark = SparkSession. \
builder. \
config('spark.ui.port', '0'). \
config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
enableHiveSupport(). \
appName(f'{username} | Python - Spark Metastore'). \
master('yarn'). \
getOrCreate()
If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.
Using Spark SQL
spark2-sql \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Scala
spark2-shell \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Pyspark
pyspark2 \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Conclusion
Spark’s catalog is a powerful component of the SparkSession that simplifies the management of both metastore and temporary data entities within Spark. By enabling seamless interaction with the metastore database, users can efficiently manage permanent tables and views, as well as temporary views, directly from their Spark applications. Whether you are manipulating data programmatically in Python, using interactive Scala shells, or executing Spark SQL commands, the catalog provides a unified interface to enhance productivity and streamline database operations. This functionality, along with Spark’s ability to integrate with various data sources and its scalable architecture, makes it an indispensable tool for data engineers and developers working with large-scale data processing and analytics.