This article explains how to use the IN
operator when filtering data in Spark SQL against multiple values. It serves as an alternative for Boolean OR
where a single column is compared with multiple values using an equal condition.
Let us start by understanding the basic concepts and then proceed with hands-on tasks to implement the concepts discussed.
Using IN
Operator in SQL
The IN
operator is a convenient way to filter data based on multiple values within a single column. Here’s an example illustrating its usage:
airtraffic.filter("Origin IN ('ORD', 'DFW', 'ATL', 'LAX', 'SFO')").count()
Using isin
Function in PySpark API
In the API Style, you can also use the isin
function to achieve the same result. Here’s an example using API Style:
from pyspark.sql.functions import col
airtraffic.filter(col("Origin").isin("ORD", "DFW", "ATL", "LAX", "SFO")).count()
Hands-On Tasks
Let’s perform some hands-on tasks to apply the concepts we discussed:
- Read the data for the month of January 2008:
airtraffic_path = "/public/airtraffic_all/airtraffic-part/flightmonth=200801"
airtraffic = spark.read.parquet(airtraffic_path)
airtraffic.printSchema()
- Get the count of flights departed from major airports - ORD, DFW, ATL, LAX, SFO:
airtraffic.filter("Origin IN ('ORD', 'DFW', 'ATL', 'LAX', 'SFO')").count()
API Style:
from pyspark.sql.functions import col
airtraffic.filter(col("Origin").isin("ORD", "DFW", "ATL", "LAX", "SFO")).count()
- Get the number of flights departed late on Sundays and Saturdays:
airtraffic.withColumn("FlightDate", concat(col("Year"), lpad(col("Month"), 2, "0"), lpad(col("DayOfMonth"), 2, "0"))).\
filter("IsDepDelayed = 'YES' AND Cancelled = 0 AND date_format(to_date(FlightDate, 'yyyyMMdd'), 'EEEE') IN ('Saturday', 'Sunday')").count()
API Style:
from pyspark.sql.functions import col, concat, lpad, date_format, to_date
airtraffic.withColumn("FlightDate", concat(col("Year"), lpad(col("Month"), 2, "0"), lpad(col("DayOfMonth"), 2, "0"))).\
filter((col("IsDepDelayed") == "YES") & (col("Cancelled") == 0) & (date_format(to_date("FlightDate", "yyyyMMdd"), "EEEE").isin("Saturday", "Sunday"))).count()
Conclusion
In this article, we explored how to use the IN
operator or isin
function for filtering data in Spark SQL. By understanding and implementing these concepts, readers can efficiently filter data based on specific conditions and column values. I encourage you to practice these tasks and engage with the community for further learning.