Apache Spark Python - Basic Transformations - Filtering Example using dates

Let us understand how to filter the data using dates leveraging appropriate date manipulation functions. Let us start spark context for this Notebook so that we can execute the code provided.

Date Manipulation Functions in Spark

In this section, we will focus on manipulating dates in Spark to filter the data effectively. We will be using functions like to_date, date_format, concat, and lpad to work with dates in the dataset.

# Import necessary functions
from pyspark.sql.functions import col, concat, lpad, date_format, to_date

Filtering Data Based on Date Conditions

Here, we will discuss filtering the data based on specific conditions. We will apply filters such as checking for delayed flights on Sundays using date manipulation functions.

# Applying a filter for delayed flights on Sundays
airtraffic. \
    withColumn("FlightDate",
               concat(col("Year"),
                      lpad(col("Month"), 2, "0"),
                      lpad(col("DayOfMonth"), 2, "0")
                     )
              ). \
    filter("date_format(to_date(FlightDate, 'yyyyMMdd'), 'EEEE') = 'Sunday'"). \
    count()

Watch the video tutorial here

Hands-On Tasks

In this section, we will perform specific tasks to apply the concepts of filtering data based on dates.

  1. Read the data for the month of January in 2008.
airtraffic_path = "/public/airtraffic_all/airtraffic-part/flightmonth=200801"
airtraffic = spark.read.parquet(airtraffic_path)
airtraffic.printSchema()
airtraffic.select('Year', 'Month', 'DayOfMonth').distinct().show(31)
airtraffic.select('Year', 'Month', 'DayOfMonth').distinct().count()
airtraffic.count()
  1. Get the number of flights departed late on Sundays using date functions.
# Example using API Style filtering
airtraffic. \
    withColumn("FlightDate",
               concat(col("Year"),
                      lpad(col("Month"), 2, "0"),
                      lpad(col("DayOfMonth"), 2, "0")
                     )
              ). \
    filter((col("IsDepDelayed") == "YES") &
           (date_format(to_date("FlightDate", "yyyyMMdd"), "EEEE") == "Sunday")
          ). \
    count()

Conclusion

In this article, we have covered the concepts of filtering data based on dates in Spark. We encourage readers to practice the provided tasks and engage with the community for further learning. Happy learning!