Apache Spark Python - Basic Transformations - Using BETWEEN Operator

Let us understand the usage of BETWEEN in conjunction with AND while filtering data from Data Frames. Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.

Using BETWEEN in SQL

In this section, we will learn how to use the BETWEEN operator in SQL to filter data within a specific range. Here’s an example of using BETWEEN with AND to filter data between two dates:

SELECT *
FROM table_name
WHERE date_column BETWEEN 'start_date' AND 'end_date';

Using BETWEEN in PySpark API

We will also explore the API Style of using BETWEEN with PySpark to filter data. Here’s an example of using the between function in PySpark to filter data between two numerical values:

from pyspark.sql.functions import col

data_frame.filter(col("column_name").between(start_value, end_value))

Click here to watch the video tutorial for a visual guide on implementing the concepts discussed in this article. Start your practical applications and stay engaged with the community for further learning opportunities.

Hands-On Tasks

Task 1

Read the data for the month of January 2008 and print the schema.

airtraffic_path = "/public/airtraffic_all/airtraffic-part/flightmonth=200801"
airtraffic = spark.read.parquet(airtraffic_path)
airtraffic.printSchema()

Task 2

Get the count of flights departed late between January 1st to January 9th, 2008 using the FlightDate column.

SQL Style:

airtraffic.withColumn("FlightDate", concat(col("Year"), lpad(col("Month"), 2, "0"), lpad(col("DayOfMonth"), 2, "0"))
.filter("IsDepDelayed = 'YES' AND Cancelled = 0 AND FlightDate BETWEEN 20080101 AND 20080109").count()

API Style:

airtraffic.withColumn("FlightDate", concat(col("Year"), lpad(col("Month"), 2, "0"), lpad(col("DayOfMonth"), 2, "0"))
.filter((col("IsDepDelayed") == "YES") & (col("Cancelled") == 0) & 
(col("FlightDate").between(20080101, 20080109))).count()

Task 3

Get the count of flights arrived late between 15 minutes and 60 minutes.

SQL Style:

airtraffic.filter("ArrDelay BETWEEN 15 AND 60").count()

API Style:

airtraffic.filter(col("ArrDelay").between(15, 60)).count()

Conclusion

In this article, we have covered the usage of BETWEEN operator in filtering data in both SQL and API styles. By practicing the hands-on tasks provided, you can gain a better understanding of how to effectively filter your data using BETWEEN and AND conditions. Keep exploring and practicing to enhance your skills in working with PySpark data frames.