This article walks through a hands-on task to get the total number of flights per airport that do not have entries in airport-codes. It demonstrates an example of an outer join using Spark SQL with the 2008 January air traffic data. The article includes step-by-step instructions and code examples to help you perform the task efficiently.
Joining Airtraffic Data Set with Airport-Codes
We perform a left outer join between the airtraffic data and airport-codes to get flights per airport.
airtraffic.join(airportCodes, airtraffic["Origin"] == airportCodes["IATA"], "left")
Filtering Out Entries with NULL Airport Codes
Filtering out entries where IATA code is NULL to get only the flights that do not have corresponding entries in airport-codes.
filter("IATA IS NULL")
Aggregating Total Flight Counts per Airport
Grouping by “Origin” airport and aggregating the count of flights to get the total number of flights per airport.
groupBy("Origin").agg(count(lit(1)).alias("FlightCount")).orderBy(col("FlightCount").desc())
Hands-On Tasks
- Load the 2008 January air traffic data.
- Load the airport-codes data.
- Perform a left outer join on air traffic data with airport-codes.
- Filter out entries with NULL airport codes.
- Group by “Origin” airport and aggregate the flight counts.
- Display the total number of flights per airport.
Conclusion
In conclusion, this article demonstrated how to get the total number of flights per airport that do not have entries in airport-codes using Spark SQL. It provided step-by-step guidance and code examples for executing the task effectively. Encourage practicing the task to enhance your Spark SQL skills.
Solutions - Problem 6
Get the total number of flights per airport that do not contain entries in airport-codes.
-
This is an example for outer join.
-
We need to get number of flights per airport from the 2008 January airtraffic data which do not have entries in airport-codes.
-
Based on the side of the airtraffic data set, we can say left or right. We will be invoking join using airtraffic and hence we will use left outer join…
-
We will be performing join first and then we will aggregate to get the number of flights from the concerned airports per airport.
-
In this case will get the total number of flights per airport. Let us start spark context for this Notebook so that we can execute the code provided.
You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.
from pyspark.sql import SparkSession
import getpass
username = getpass.getuser()
spark = SparkSession. \
builder. \
config('spark.ui.port', '0'). \
config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
enableHiveSupport(). \
appName(f'{username} | Python - Joining Data Sets'). \
master('yarn'). \
getOrCreate()
If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.
Using Spark SQL
spark2-sql \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Scala
spark2-shell \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Pyspark
pyspark2 \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
spark.conf.set("spark.sql.shuffle.partitions", "2")
airtrafficPath = "/public/airtraffic_all/airtraffic-part/flightmonth=200801"
airtraffic = spark. \
read. \
parquet(airtrafficPath)
airtraffic. \
select(
"Year", "Month", "DayOfMonth",
"Origin", "Dest", "CRSDepTime"
). \
show()
airtraffic.count()
airportCodesPath = "/public/airtraffic_all/airport-codes"
def getValidAirportCodes(airportCodesPath):
airportCodes = spark. \
read. \
option("sep",
Conclusion
This tutorial demonstrated an efficient method using Spark SQL to determine the total number of flights per airport without entries in the airport-codes dataset. By employing outer joins and aggregation, we processed the January 2008 air traffic data, providing practical insights into data manipulation and analysis using Spark SQL. Practicing these techniques will enhance proficiency in big data processing and analysis.