Apache Spark Python - Transformations - Solution - Get Count of Flights without master data

This article provides a step-by-step guide on how to perform an outer join using Spark SQL to get the total number of flights departed from airports in January 2008 that do not have entries in airport-codes dataset.

Performing Outer Join

In this key concept, we will be performing an outer join between the airtraffic dataset and airport-codes dataset to identify flights without corresponding airport codes.

# Performing left outer join between airtraffic and airportCodes dataframes
airtraffic.join(airportCodes, airtraffic["Origin"] == airportCodes["IATA"], "left")

Filtering Missing Airport Codes

This key concept emphasizes the filtering process to exclude flights with valid airport codes and select only the flights departed from airports with missing codes.

# Filtering out flights with valid airport codes using IATA column
filter("IATA IS NULL")

If you need assistance with executing these steps or have any questions, feel free to engage with the community for further support.

Watch the video tutorial here

Hands-On Tasks

Description of the hands-on tasks. Provide a list of tasks that the reader can perform to apply the concepts discussed in the article.

  1. Navigate to the provided Spark SQL environment or set up your own.
  2. Load the airtraffic and airport-codes datasets into Spark dataframes.
  3. Perform a left outer join on the dataframes and filter out flights with missing airport codes.
  4. Display the resulting flights data and count the total number of flights without valid airport codes.

Conclusion

In this article, we explored the concept of using outer joins in Spark SQL to analyze air traffic data and identify flights departed from airports without corresponding codes. By following the step-by-step instructions and hands-on tasks provided, you can gain a better understanding of how to leverage Spark SQL for data analysis tasks.

Solutions - Problem 5

Get the total number of flights departed from the airports in January 2008 that do not contain entries in airport-codes.

  • This is an example for outer join.
  • We need to get the number of flights from the 2008 January air traffic data which do not have entries in airport-codes.
  • Based on the side of the air traffic data set, we can say left or right. We will be invoking join using air traffic and hence we will use a left outer join.
  • We will be performing a join first and then aggregating to get the number of flights from the concerned airports.
  • In this case, we will get the total number of flights.