Apache Spark Python - Basic Transformations - Solution - Problem 2 Aggregations By FlightDate

This article provides a step-by-step guide on how to analyze air traffic data to identify delayed flights. It includes explanations of key concepts, hands-on tasks, and a video tutorial to complement the text.

Grouping Data by Flight Date

In this section, we group the data by flight date using concatenation and padding functions to create a unified flight date format.

groupBy(concat("Year", lit("-"), lpad("Month", 2, "0"), lit("-"), lpad("DayOfMonth", 2, "0")).alias("FlightDate"))

Getting Counts by FlightDate

Here, we filter out cancelled flights, group the data by flight date, and calculate the total number of flights for each day.

agg(count(lit(1)).alias("FlightCount"))

Solutions - Problem 2

Get number of flights which are delayed in departure and number of flights delayed in arrival for each day along with the number of flights departed for each day.

  • Output should contain 4 columns - FlightDate, FlightCount, DepDelayedCount, ArrDelayedCount
  • FlightDate should be of yyyy-MM-dd format.
  • Data should be sorted in ascending order by flightDate

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Reading airtraffic data

airtraffic_path = “/public/airtraffic_all/airtraffic-part/flightmonth=200801”
airtraffic = spark.read.parquet(airtraffic_path)
airtraffic.printSchema()

Watch the video tutorial here

Hands-On Tasks

  1. Filter out cancelled flights.
  2. Group the data by flight date.
  3. Calculate the total number of flights for each day.
  4. Count delayed departures and arrivals for each day.
  5. Sort the data by flight date.

Conclusion

In conclusion, this article walked you through analyzing air traffic data to identify delayed flights. By following the step-by-step guide and performing the hands-on tasks, you can gain a better understanding of flight delays and how to work with such data.