Apache Spark Python - Basic Transformations - Overview of Basic Transformations

Let us define problem statements to learn more about Data Frame APIs. We will try to cover filtering, aggregations, and sorting as part of solutions for these problem statements.

Problem 1: Total Number of Flights and Delayed Flights

  • Get the total number of flights as well as the number of flights which are delayed in departure and the number of flights delayed in arrival.

    • The output should contain 3 columns - FlightCount, DepDelayedCount, ArrDelayedCount.

Problem 2: Delayed Flights for Each Day

  • Get the number of flights which are delayed in departure and the number of flights delayed in arrival for each day along with the number of flights departed for each day.

    • The output should contain 4 columns - FlightDate, FlightCount, DepDelayedCount, ArrDelayedCount.

    • FlightDate should be in the yyyy-MM-dd format.

    • Data should be sorted in ascending order by FlightDate.

Problem 3: Flights Departed Late but Arrived Early

  • Get all the flights which are departed late but arrived early (IsArrDelayed is NO).

    • The output should contain - FlightCRSDepTime, UniqueCarrier, FlightNum, Origin, Dest, DepDelay, ArrDelay.

    • FlightCRSDepTime needs to be computed using Year, Month, DayOfMonth, CRSDepTime.

    • FlightCRSDepTime should be displayed using the yyyy-MM-dd HH:mm format.

    • Output should be sorted by FlightCRSDepTime and then by the difference between DepDelay and ArrDelay.

    • Also get the count of such flights.

Watch the video tutorial here

Conclusion

In this article, we explored various problem statements and their solutions using Data Frame APIs in Spark. We covered filtering, aggregations, and sorting to analyze air traffic data for January 2008. These examples demonstrate the power and flexibility of Spark’s Data Frame APIs for data manipulation and analysis.