Apache Spark Python - Transformations - Solution - Get Count of Flights:Airport without data

This article walks through a hands-on task to get the total number of flights per airport that do not have entries in airport-codes. It demonstrates an example of an outer join using Spark SQL with the 2008 January air traffic data. The article includes step-by-step instructions and code examples to help you perform the task efficiently.

Joining Airtraffic Data Set with Airport-Codes

We perform a left outer join between the airtraffic data and airport-codes to get flights per airport.

airtraffic.join(airportCodes, airtraffic["Origin"] == airportCodes["IATA"], "left")

Filtering Out Entries with NULL Airport Codes

Filtering out entries where IATA code is NULL to get only the flights that do not have corresponding entries in airport-codes.

filter("IATA IS NULL")

Aggregating Total Flight Counts per Airport

Grouping by “Origin” airport and aggregating the count of flights to get the total number of flights per airport.

groupBy("Origin").agg(count(lit(1)).alias("FlightCount")).orderBy(col("FlightCount").desc())

Watch the video tutorial here

Hands-On Tasks

  1. Load the 2008 January air traffic data.
  2. Load the airport-codes data.
  3. Perform a left outer join on air traffic data with airport-codes.
  4. Filter out entries with NULL airport codes.
  5. Group by “Origin” airport and aggregate the flight counts.
  6. Display the total number of flights per airport.

Conclusion

In conclusion, this article demonstrated how to get the total number of flights per airport that do not have entries in airport-codes using Spark SQL. It provided step-by-step guidance and code examples for executing the task effectively. Encourage practicing the task to enhance your Spark SQL skills.

Solutions - Problem 6

Get the total number of flights per airport that do not contain entries in airport-codes.

  • This is an example for outer join.

  • We need to get number of flights per airport from the 2008 January airtraffic data which do not have entries in airport-codes.

  • Based on the side of the airtraffic data set, we can say left or right. We will be invoking join using airtraffic and hence we will use left outer join…

  • We will be performing join first and then we will aggregate to get the number of flights from the concerned airports per airport.

  • In this case will get the total number of flights per airport. Let us start spark context for this Notebook so that we can execute the code provided.

You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.

from pyspark.sql import SparkSession
import getpass

username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Joining Data Sets'). \
    master('yarn'). \
    getOrCreate()

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
spark.conf.set("spark.sql.shuffle.partitions", "2")
airtrafficPath = "/public/airtraffic_all/airtraffic-part/flightmonth=200801"
airtraffic = spark. \
    read. \
    parquet(airtrafficPath)
airtraffic. \
    select(
        "Year", "Month", "DayOfMonth", 
        "Origin", "Dest", "CRSDepTime"
    ). \
    show()
airtraffic.count()
airportCodesPath = "/public/airtraffic_all/airport-codes"
def getValidAirportCodes(airportCodesPath):
    airportCodes = spark. \
        read. \
        option("sep",

Conclusion

This tutorial demonstrated an efficient method using Spark SQL to determine the total number of flights per airport without entries in the airport-codes dataset. By employing outer joins and aggregation, we processed the January 2008 air traffic data, providing practical insights into data manipulation and analysis using Spark SQL. Practicing these techniques will enhance proficiency in big data processing and analysis.