Apache Spark Python - Spark Metastore - Create Partitioned Tables

We can also create partitioned tables as part of Spark Metastore Tables. There are some challenges in creating partitioned tables directly using spark.catalog.createTable. But if the directories are similar to partitioned tables with data, we should be able to create partitioned tables. Let us create a partitioned table for orders by order_month.

Spark Session Initialization

To start a Spark Session, we need to configure the necessary Spark properties like UI port, warehouse directory, and Hive support. This is essential for enabling Spark SQL for data processing.

spark = SparkSession.builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Spark Metastore'). \
    master('yarn'). \
    getOrCreate()

Watch the video tutorial here

Tasks

Let us perform tasks related to partitioned tables.

  1. Read data from a file into a data frame.
  2. Add an additional column that will be used for partitioning the data.
  3. Write the data into the target location for creating the table.
  4. Create a partitioned table using the location where the data is stored and validate it.
  5. Recover partitions by running MSCK REPAIR TABLE using spark.sql or the spark.catalog.recoverPartitions method.

Hands-On Tasks

  1. Read data from file into data frame.
  2. Add additional column for partitioning.
  3. Write data into target location.
  4. Create partitioned table and validate.
  5. Recover partitions with spark.catalog.recoverPartitions.

Conclusion

In this article, we discussed how to create partitioned tables using Spark Metastore Tables. By following the step-by-step guide and explanations provided, you should now be able to create partitioned tables in Spark and perform related tasks efficiently. Make sure to practice and engage with the community for further learning and enhancement of your skills.