Apache Spark Python - Spark Metastore - Creating Metastore Tables using catalog

Data Frames can be written into Metastore Tables using APIs such as saveAsTable and insertInto available as part of write on top of objects of type Data Frame.

Create Database

To create a database named demo_db in the metastore, we can use the following SQL query:

spark.sql(f"CREATE DATABASE demo_db")

List Databases

To list all the databases in the metastore, we can use the following API:

spark.catalog.listDatabases()

Create Data Frame and Table

To create a Data Frame with one column named dummy and value X, and then create a table named dual using this Data Frame, we can execute the following steps:

l = [("X", )]
df = spark.createDataFrame(l, schema="dummy STRING")
df.write.saveAsTable("dual", mode='overwrite')

Insert Data into Table

To insert data into an existing empty table, we can use the following commands:

schema = df.schema
spark.catalog.createTable('dual', schema=schema)
df.write.insertInto('dual')

Drop Table and Database

To drop the table dual and the database demo_db, we can execute the following SQL queries:

spark.sql("DROP TABLE dual")
spark.sql("DROP DATABASE demo_db")

Watch the video tutorial here

Hands-On Tasks

  1. Create a database named demo_db in the metastore.
  2. List all databases in the metastore.
  3. Create a Data Frame with one column dummy and value X, and create a table named dual with this Data Frame.
  4. Insert the Data Frame into the dual table.
  5. Drop the dual table and the demo_db database.

Conclusion

In this article, we learned how to create, insert data into, and drop Metastore tables using the catalog in Spark. Practice these tasks to enhance your understanding and feel free to engage with the community for further learning.