Apache Spark Python - Basic Transformations - Basic Filtering of Data

In this article, we will explore basic filtering techniques using Spark Data Frame APIs. Filtering can be performed using the filter or where functions, which are synonyms. We can specify conditions in either SQL Style or Data Frame Style.

Filtering Syntax

SQL Style:

airtraffic.filter("IsArrDelayed = 'YES'").show()

Data Frame Style:

airtraffic.filter(airtraffic["IsArrDelayed"] == 'YES').show()

Supported Operations

  • !=
  • >
  • <
  • >=
  • <=
  • LIKE
  • BETWEEN with AND

Tasks

  1. Read January 2008 Data
# Read the data for January 2008
airtraffic_path = "/public/airtraffic_all/airtraffic-part/flightmonth=200801"
airtraffic = spark.read.parquet(airtraffic_path)
  1. Count of Cancelled Flights
# Get the count of cancelled flights
airtraffic.filter('Cancelled = 1').count()
  1. Departures from SFO
# Number of flights scheduled for departure from SFO airport
airtraffic.filter("Origin = 'SFO'").count()
  1. Flights Departed without Delay
# Number of flights that departed without any delay
airtraffic.filter("IsDepDelayed = 'NO'").count()

Watch the video tutorial here

Conclusion

Filtering in Spark Data Frames is a powerful technique to extract relevant information from large datasets. By using either SQL or Data Frame Style syntax, we can efficiently filter data based on specific conditions. Understanding these basic filtering operations is essential for data manipulation and analysis in Spark.