Apache Spark Python - Data Processing Overview - Previewing Airlines Data

In this section, we will preview the airlines data to gain a better understanding of its structure and content.

Spark Session Initialization

To start processing data, we need to initialize a Spark Session. This object will allow us to interact with the Spark cluster and perform various operations.

from pyspark.sql import SparkSession

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Data Processing - Overview'). \
    master('yarn'). \
    getOrCreate()

Reading Airlines Data

We will read the airlines data from the specified path and create a DataFrame with the given schema.

airlines_schema = spark.read. \
    csv("/public/airlines_all/airlines/part-00000",
        header=True,
        inferSchema=True
       ). \
    schema

airlines = spark.read. \
    schema(airlines_schema). \
    csv("/public/airlines_all/airlines/part*",
        header=True
       )

Data Preview

We can print the schema of the DataFrame using airlines.printSchema() and preview the data with airlines.show(100, truncate=False). This allows us to see the structure and content of the dataset.

Data Transformation

We can perform standard transformations on our data using various DataFrame and column functions. This step is crucial for data preparation and analysis.

Data Validation

To ensure data quality, we can check for duplicates in the dataset using airlines.distinct().count(). Removing duplicates is essential for maintaining clean and reliable data.

Watch the video tutorial here

Hands-On Tasks

  1. Initialize a Spark Session in your environment.
  2. Read the airlines data from the provided path.
  3. Print the schema of the airlines DataFrame.
  4. Preview the first 100 records of the airlines data without truncating.

Conclusion

In this article, we explored the process of previewing airlines data using Apache Spark. By following the steps provided and practicing hands-on tasks, you can enhance your skills in data processing and analysis. Remember to engage with the community for further learning opportunities.