Apache Spark Python - Basic Transformations - Boolean Operators

Let us understand details about boolean operators while filtering data in Spark Data Frames. If we have to validate against multiple columns, then we need to use boolean operations such as AND or OR or both. Here are some examples where boolean operators are used:

  1. Get count of flights which are departed late at origin and reach the destination early or on time.
  2. Get count of flights which are departed early or on time but arrive late by at least 15 minutes.
  3. Get the number of flights which are departed late on Saturdays as well as on Sundays.

To start, let us set up the spark context for this Article so that we can execute the code provided.

Filtering Flights Departed Late at Origin and Reach Destination Early or On Time

To get the count of flights departed late at origin and reach the destination early or on time:

airtraffic.filter("IsDepDelayed = 'YES' AND IsArrDelayed = 'NO' AND Cancelled = 0").count()

Filtering Flights Departed Early or On Time But Arrive Late

To get the count of flights departed early or on time but arrive late by at least 15 minutes:

airtraffic.filter("IsDepDelayed = 'NO' AND ArrDelay >= 15").count()

Filtering Flights Departed Late on Weekends

To get the number of flights departed late on Sundays as well as on Saturdays:

airtraffic.filter((col("IsDepDelayed") == "YES") & 
                  (col("Cancelled") == 0) & 
                  ((date_format(to_date("FlightDate", "yyyyMMdd"), "EEEE") == "Saturday") | 
                   (date_format(to_date("FlightDate", "yyyyMMdd"), "EEEE") == "Sunday"))
                 ).count()

Watch the video tutorial here

Hands-On Tasks

  1. Read the data for the month of January 2008.
  2. Implement the given tasks using both SQL and API Style approaches.

Conclusion

In this article, we explored how to use boolean operators in Spark Data Frames for filtering data based on multiple conditions. Practice these tasks to enhance your understanding, and remember to engage with the community for further learning.