Apache Spark Python - Spark Metastore - Inferring Schema for Tables

In this section, we will cover the key concepts related to inferring schema for tables in Spark SQL.

Schema Inference

Schema can be inferred from the DataFrame and passed using the StructType object while creating the table.

# Example
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])

StructType

StructType takes a list of objects of type StructField to define the schema.

StructField

StructField is built using the column name and data type from pyspark.sql.types.

Watch the video tutorial here

Hands-On Tasks

Here are some hands-on tasks you can perform to apply the concepts discussed:

  1. Create a database by the name {username}_airtraffic.
  2. Create an external table for airport-codes.txt:
    • Data has a header.
    • Fields in each record are delimited by a tab character.
    • Pass options such as sep, header, and inferSchema to define the schema.

To create the database and external table, you can follow the code snippets provided in the text.

Conclusion

In this article, we discussed the process of inferring schema for tables in Spark SQL. By following the provided example and tasks, you can gain practical experience in defining schemas and creating external tables. Practice these concepts to deepen your understanding of Spark SQL.