# Apache Spark Python - Basic Transformations - Solution - Problem 2 Aggregations By FlightDate

This article provides a step-by-step guide on how to analyze air traffic data to identify delayed flights. It includes explanations of key concepts, hands-on tasks, and a video tutorial to complement the text.

## Grouping Data by Flight Date

In this section, we group the data by flight date using concatenation and padding functions to create a unified flight date format.

``````groupBy(concat("Year", lit("-"), lpad("Month", 2, "0"), lit("-"), lpad("DayOfMonth", 2, "0")).alias("FlightDate"))
``````

## Getting Counts by FlightDate

Here, we filter out cancelled flights, group the data by flight date, and calculate the total number of flights for each day.

``````agg(count(lit(1)).alias("FlightCount"))
``````

## Solutions - Problem 2

Get number of flights which are delayed in departure and number of flights delayed in arrival for each day along with the number of flights departed for each day.

• Output should contain 4 columns - FlightDate, FlightCount, DepDelayedCount, ArrDelayedCount
• FlightDate should be of yyyy-MM-dd format.
• Data should be sorted in ascending order by flightDate

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.

``````spark2-sql \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/\${USER}/warehouse
``````
``````spark2-shell \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/\${USER}/warehouse
``````
``````pyspark2 \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/\${USER}/warehouse
``````

airtraffic_path = “/public/airtraffic_all/airtraffic-part/flightmonth=200801”
airtraffic.printSchema()

Watch the video tutorial here