Apache Spark Python - Data Processing Overview - Understanding airlines data

In this article, we will explore how to read and understand the structure of airlines data stored in text files using Apache Spark. This involves identifying key aspects such as the presence of a header and the field delimiter used in the data.

Spark Session Initialization

To start processing data, we need to initialize a Spark Session. This object will allow us to interact with the Spark cluster and perform various operations.

from pyspark.sql import SparkSession
import getpass

username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Data Processing - Overview'). \
    master('yarn'). \
    getOrCreate()

If you are going to use CLIs, you can use Spark SQL using one of the three approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using PySpark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Previewing the Data

First, let’s list the files in the directory to ensure we are accessing the correct file.

%%sh

hdfs dfs -ls -h /public/airlines_all/airlines/part-00000

Next, we will read one of the files using spark.read.text to preview the data and understand its structure.

airlines = spark.read.text("/public/airlines_all/airlines/part-00000")
airlines.show(truncate=False)

Analyzing the Data Structure

By previewing the data, we can determine the following:

  • Whether a header is present in the files or not.
  • The field delimiter used in the data.

Reading the Data with Appropriate Options

Once we have identified that the data includes a header and the fields are delimited by commas, we can use spark.read.csv with the appropriate options to read the entire dataset.

airlines_schema = spark.read. \
    csv("/public/airlines_all/airlines/part-00000",
        header=True,
        inferSchema=True
       ). \
    schema

airlines = spark.read. \
    schema(airlines_schema). \
    csv("/public/airlines_all/airlines/part*",
        header=True
       )

Data Preview and Validation

We can print the schema and preview the data to ensure everything is loaded correctly.

airlines.printSchema()
airlines.show()
airlines.count()

Understanding the Data Structure and Format

  • Data with Header: The data includes a header, which can be handled using the header=True option in the spark.read.csv method.
  • Field Delimiter: The fields are delimited by a comma, which is the default delimiter in the spark.read.csv method.

Watch the video tutorial here

Hands-On Tasks

  1. Read the airlines data using spark.read.text.
  2. Analyze the header and field delimiter in the data.
  3. Use the appropriate options in spark.read.csv to read the data.

Conclusion

In this article, we discussed how to read and understand the structure of airlines data stored in text files using Apache Spark. By identifying key aspects such as the presence of a header and the field delimiter, we can read and process the data efficiently. Practice these concepts to enhance your data processing skills.

Engage with the community to share your experiences and learn more advanced data processing techniques in Apache Spark.