Apache Spark Python - Transformations - Solution - Get Flight Count Per US Airport

This article provides a step-by-step guide on how to analyze the number of flights departed from each US airport in January 2008 using Spark SQL. It includes explanations, code examples, and hands-on tasks to help you understand and apply the concepts discussed.

Airtraffic Data Analysis

In this section, we will analyze airtraffic data to determine the number of flights departed from US airports in January 2008. We will use Spark SQL to perform queries on the data.

Inner Join Operation

The inner join operation is used to combine data from two tables based on a related column between them. In this case, we will perform an inner join between the airtraffic data and airport codes to get relevant information.

The video linked in the article demonstrates the practical implementation of the concepts covered in the text. It visually complements the written instructions, making it easier for readers to follow along and understand the process.

Watch the video tutorial here

Hands-On Tasks

Description of the hands-on tasks. Provide a list of tasks that the reader can perform to apply the concepts discussed in the article.

  1. Load the airtraffic data for January 2008.
  2. Load the airport codes data and filter for US airports.
  3. Perform an inner join between airtraffic data and airport codes.
  4. Calculate the number of flights departed from each US airport.
  5. Display the results in descending order of flight count.

Conclusion

In conclusion, this article has provided you with a comprehensive guide on how to analyze the number of flights departed from US airports in January 2008 using Spark SQL. By following the steps outlined and practicing the hands-on tasks, you can enhance your understanding of data analysis techniques and spark SQL operations.

Solutions - Problem 1

Get the number of flights departed from each US airport in January 2008.

# Start Spark session
from pyspark.sql import SparkSession

import getpass

username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Joining Data Sets'). \
    master('yarn'). \
    getOrCreate()

spark.conf.set("spark.sql.shuffle.partitions", "2")

# Load airtraffic data for January 2008
airtrafficPath = "/public/airtraffic_all/airtraffic-part/flightmonth=200801"
airtraffic = spark.read.parquet(airtrafficPath)

# Load airport codes data and filter for US airports
airportCodesPath = "/public/airtraffic_all/airport-codes"
def getValidAirportCodes(airportCodesPath):
    airportCodes = spark.read.option("sep", "\t").option("header", True).option("inferSchema", True).csv(airportCodesPath).filter("!(State = 'Hawaii' AND IATA = 'Big') AND Country = 'USA'")
    return airportCodes

airportCodes = getValidAirportCodes(airportCodesPath)

# Perform inner join between airtraffic data and airport codes
result = airtraffic.join(airportCodes, airportCodes["IATA"] == airtraffic["Origin"]) \
    .select("Year", "Month", "DayOfMonth", airportCodes["*"], "CRSDepTime")

result.show()

# Calculate number of flights departed from each US airport
result.groupBy("Origin").agg(count(lit(1)).alias("FlightCount")).orderBy(col("FlightCount").desc()).show()