Apache Spark Python - Spark Metastore - Define Schema for Tables using StructType

When we want to create a table using spark.catalog.createTable or spark.catalog.createExternalTable, we need to specify Schema.

  • Schema can be inferred or we can pass schema using StructType object while creating the table.

  • StructType takes a list of objects of type StructField.

  • StructField is built using a column name and data type. All the data types are available under pyspark.sql.types.

  • We need to pass the table name and schema for spark.catalog.createTable.

  • We have to pass the path along with name and schema for spark.catalog.createExternalTable.

Watch the video tutorial here

Tasks

Let us perform tasks to create an empty table using spark.catalog.createTable or using spark.catalog.createExternalTable.

  1. Create a database {username}_hr_db and table employees with the following fields. Let us create the Database first and then we will see how to create the table.
    • employee_id of type Integer
    • first_name of type String
    • last_name of type String
    • salary of type Float
    • nationality of type String
import getpass

username = getpass.getuser()
spark.sql(f"CREATE DATABASE IF NOT EXISTS {username}_hr_db")
spark.catalog.setCurrentDatabase(f"{username}_hr_db")

from pyspark.sql.types import StructField, StructType, IntegerType, StringType, FloatType

employeesSchema = StructType([
    StructField("employee_id", IntegerType()),
    StructField("first_name", StringType()),
    StructField("last_name", StringType()),
    StructField("salary", FloatType()),
    StructField("nationality", StringType())
])

spark.catalog.createTable("employees", schema=employeesSchema)

spark.catalog.listTables()
spark.catalog.listColumns('employees')
spark.sql('DESCRIBE FORMATTED employees').show(100, truncate=False)

Conclusion

In this article, we learned about defining Schema for tables using StructType in PySpark. By following the provided tasks, you can create a table with a specified schema and perform operations on it. Practice these concepts to improve your understanding and don’t hesitate to engage with the community for further learning.