Apache Spark Python - Basic Transformations - Using IN Operator or isin Function

This article explains how to use the IN operator when filtering data in Spark SQL against multiple values. It serves as an alternative for Boolean OR where a single column is compared with multiple values using an equal condition.

Let us start by understanding the basic concepts and then proceed with hands-on tasks to implement the concepts discussed.

Using IN Operator in SQL

The IN operator is a convenient way to filter data based on multiple values within a single column. Here’s an example illustrating its usage:

airtraffic.filter("Origin IN ('ORD', 'DFW', 'ATL', 'LAX', 'SFO')").count()

Using isin Function in PySpark API

In the API Style, you can also use the isin function to achieve the same result. Here’s an example using API Style:

from pyspark.sql.functions import col

airtraffic.filter(col("Origin").isin("ORD", "DFW", "ATL", "LAX", "SFO")).count()

Watch the video tutorial here

Hands-On Tasks

Let’s perform some hands-on tasks to apply the concepts we discussed:

  1. Read the data for the month of January 2008:
airtraffic_path = "/public/airtraffic_all/airtraffic-part/flightmonth=200801"
airtraffic = spark.read.parquet(airtraffic_path)
airtraffic.printSchema()
  1. Get the count of flights departed from major airports - ORD, DFW, ATL, LAX, SFO:
airtraffic.filter("Origin IN ('ORD', 'DFW', 'ATL', 'LAX', 'SFO')").count()

API Style:

from pyspark.sql.functions import col

airtraffic.filter(col("Origin").isin("ORD", "DFW", "ATL", "LAX", "SFO")).count()
  1. Get the number of flights departed late on Sundays and Saturdays:
airtraffic.withColumn("FlightDate", concat(col("Year"), lpad(col("Month"), 2, "0"), lpad(col("DayOfMonth"), 2, "0"))).\
filter("IsDepDelayed = 'YES' AND Cancelled = 0 AND date_format(to_date(FlightDate, 'yyyyMMdd'), 'EEEE') IN ('Saturday', 'Sunday')").count()

API Style:

from pyspark.sql.functions import col, concat, lpad, date_format, to_date

airtraffic.withColumn("FlightDate", concat(col("Year"), lpad(col("Month"), 2, "0"), lpad(col("DayOfMonth"), 2, "0"))).\
filter((col("IsDepDelayed") == "YES") & (col("Cancelled") == 0) & (date_format(to_date("FlightDate", "yyyyMMdd"), "EEEE").isin("Saturday", "Sunday"))).count()

Conclusion

In this article, we explored how to use the IN operator or isin function for filtering data in Spark SQL. By understanding and implementing these concepts, readers can efficiently filter data based on specific conditions and column values. I encourage you to practice these tasks and engage with the community for further learning.