In this article, you will learn how to retrieve a list of airports in the US from which flights did not depart in the month of January 2008. The article includes a step-by-step guide using Spark SQL to join two datasets and filter out the relevant information.
Outer Join
An outer join is used to combine rows from two datasets even if the join condition does not match rows in the joining columns. In this case, we will use a left outer join to retrieve airports that are in the airport codes dataset but not in the 2008 January air traffic dataset.
airportCodes.join(airtraffic, col("IATA") == col("Origin"), "left").filter("Origin IS NULL").select(airportCodes["*"], col("Origin")).show()
Hands-On Tasks
- Start the Spark context using the provided code snippet.
- Retrieve the list of airports in the US that did not have flights departing in January 2008.
Conclusion
In conclusion, the article provided a practical example of using Spark SQL to perform a left outer join and filter out relevant information from two datasets. Encourage readers to practice the steps discussed and to engage with the community for further learning.