In this section, we will cover the key concepts related to inferring schema for tables in Spark SQL.
Schema Inference
Schema can be inferred from the DataFrame and passed using the StructType
object while creating the table.
# Example
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)
])
StructType
StructType
takes a list of objects of type StructField
to define the schema.
StructField
StructField
is built using the column name and data type from pyspark.sql.types
.
Hands-On Tasks
Here are some hands-on tasks you can perform to apply the concepts discussed:
- Create a database by the name
{username}_airtraffic
. - Create an external table for airport-codes.txt:
- Data has a header.
- Fields in each record are delimited by a tab character.
- Pass options such as
sep
,header
, andinferSchema
to define the schema.
To create the database and external table, you can follow the code snippets provided in the text.
Conclusion
In this article, we discussed the process of inferring schema for tables in Spark SQL. By following the provided example and tasks, you can gain practical experience in defining schemas and creating external tables. Practice these concepts to deepen your understanding of Spark SQL.