Apache Spark Python - Transformations - Analyze Datasets for Joins

Siva · May 22, 2024, 10:43am

In this tutorial, we’ll delve into analyzing two datasets: the January 2008 air traffic data and airport codes dataset. These datasets will be used for performing joins.

Let us analyze both January 2008 air traffic as well as airport codes data sets that are going to be used for joins.

We will use January 2008 air traffic data, which have all relevant flight details such as departure, arrival, etc.
As part of the analysis, we also might want additional metadata such as city, state, etc., for airports. We got that information as part of the airport codes data set.
Let us read and review both January 2008 air traffic data as well as airport codes data set. Let us start the spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state-of-the-art cluster/labs to learn Spark SQL using our unique integrated LMS.

from pyspark.sql import SparkSession

import getpass

username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Joining Data Sets'). \
    master('yarn'). \
    getOrCreate()

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

spark.conf.set("spark.sql.shuffle.partitions", "2")
airtraffic = spark. \
    read. \
    parquet("/public/airtraffic_all/airtraffic-part/flightmonth=200801")
airtraffic.printSchema()
airtraffic.show()

We will be using another data set to get details about airports. Details include information such as State, City, etc., for a given airport code.
Let us analyze the Dataset to confirm if there is a header and also how the data is structured.

airportCodesPath = "/public/airtraffic_all/airport-codes"
spark. \
    read. \
    text(airportCodesPath). \
    show(truncate=False)

Data is tab-separated.
There is a header for the data set.
The Dataset has 4 fields - Country, State, City, IATA

airportCodesPath = "/public/airtraffic_all/airport-codes"
airportCodes = spark. \
    read. \
    option("sep", "\t"). \
    option("header", True). \
    option("inferSchema", True). \
    csv(airportCodesPath)

airportCodes.show()
airportCodes.printSchema()
airportCodes.count()

Get the count of unique records and see if it is the same as the total count.

airportCodes. \
    select("IATA"). \
    distinct(). \
    count()

If they are not equal, analyze the data to identify IATA codes that are repeated more than once.

from pyspark.sql.functions import lit, count

duplicateIATACount = airportCodes. \
    groupBy("IATA"). \
    agg(count(lit(1)).alias("iata_count")). \
    filter("iata_count > 1")

duplicateIATACount.show()

Filter out the duplicates using the most appropriate one and discard others.

airportCodes. \
    filter("IATA = 'Big'"). \
    show()
airportCodes. \
    filter("!(State = 'Hawaii' AND IATA = 'Big')"). \
    show()
airportCodes. \
    filter("!(State = 'Hawaii' AND IATA = 'Big')"). \
    count()

Get the number of airports (IATA Codes) for each state in the US. Sort the data in descending order by count.

from pyspark.sql.functions import count, col, lit

airportCountByState = airportCodes. \
    groupBy("Country", "State"). \
    agg(count(lit(1)).alias("IATACount")). \
    orderBy(col("IATACount").desc())

airportCountByState.show(51)
airportCountByState.count()

Watch the video tutorial here

Conclusion

In this tutorial, we explored how to analyze and join two datasets: the January 2008 air traffic data and the airport codes dataset. By leveraging Spark SQL, we were able to efficiently load, inspect, and manipulate these datasets to extract meaningful insights. We reviewed the structure of the data, handled duplicates, and performed aggregations to gain a deeper understanding of the airport data. These techniques are essential for anyone looking to perform data analysis and integration in a scalable and efficient manner using Apache Spark. To dive deeper into these concepts and see them in action, you can watch the full video tutorial here. Happy analyzing!