Apache Spark Python - Data Processing Overview - Inferring Schema

In this section, we will understand how to quickly infer the schema of a dataset using a single file and then apply this schema to read the entire dataset. This approach is particularly useful when dealing with large datasets where inferring the schema for every file would be computationally expensive.

Spark Session Initialization

To start processing data, we need to initialize a Spark Session. This object will allow us to interact with the Spark cluster and perform various operations.

from pyspark.sql import SparkSession

import getpass

username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Data Processing - Overview'). \
    master('yarn'). \
    getOrCreate()

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Inferring Schema from a Single File

Instead of inferring the schema from the entire dataset, we can infer the schema from a single file and apply it to the rest of the dataset.

airlines_part_00000 = spark.read. \
    csv("/public/airlines_all/airlines/part-00000",
        header=True,
        inferSchema=True
       )

airlines_schema = airlines_part_00000.schema

Applying the Inferred Schema to the Entire Dataset

Using the schema inferred from a single file, we can read the entire dataset efficiently.

airlines = spark.read. \
    schema(airlines_schema). \
    csv("/public/airlines_all/airlines/part*",
        header=True
       )

Data Preview and Validation

We can print the schema and preview the data to ensure everything is loaded correctly.

airlines.printSchema()
airlines.show()
airlines.count()

Advantages of This Approach

  • Efficiency: Inferring the schema from a single file is faster than processing the entire dataset.
  • Consistency: Ensures that the schema is consistent across all files if the data structure is uniform.
  • Scalability: Ideal for large datasets where reading all files to infer the schema would be impractical.

Watch the video tutorial here

Conclusion

In this article, we explored an efficient way to infer the schema of a large dataset in Apache Spark using a single file and applying it to the entire dataset. This method improves performance and ensures consistency in data processing. By following the steps provided, you can handle large datasets more efficiently in your data processing workflows.

Engage with the community to share your experiences and learn more advanced data processing techniques in Apache Spark.