Data Engineering using Spark SQL - Basic Transformations - Filtering Data

Let us understand how we can filter the data in Spark SQL. Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.

Explanation for the video

[Please replace this text with the embedded YouTube video link]

Key Concepts Explanation

Description of the key concepts. Provide key concepts as subheadings and explain them with inline code examples. Multi-line code examples should be highlighted using code highlighter.

WHERE Clause

The WHERE clause is used to filter the data based on specific conditions. It allows us to compare columns, expressions, or literals with other columns, expressions, or literals using operators such as =, !=, >, <, etc.

SELECT * FROM orders WHERE order_status = 'COMPLETE'

Comparison Operators

All comparison operators like =, !=, >, < can be used for filtering data based on specific criteria.

SELECT * FROM orders WHERE order_status IN ('COMPLETE', 'CLOSED')

Hands-On Tasks

Description of the hands-on tasks. Provide a list of tasks that the reader can perform to apply the concepts discussed in the article.

  1. Execute a query to filter orders with order_status as ‘COMPLETE’ or ‘CLOSED’.
  2. Filter orders based on a specific month and order_status criteria.


In this article, we covered the basics of filtering data in Spark SQL using the WHERE clause and comparison operators. Practice these concepts and explore more filtering options to enhance your SQL skills.

Filtering Data with Null Values

Let’s prepare a table to showcase how to handle null values while filtering data.

CREATE TABLE students (
    student_id INT,
    student_first_name STRING,
    student_last_name STRING,
    student_phone_number STRING,
    student_address STRING
) STORED AS avro
INSERT INTO students VALUES (1, 'Scott', 'Tiger', NULL, NULL)

Handling Null Values

When dealing with null values, use IS NULL and IS NOT NULL operators for comparisons.

SELECT * FROM students WHERE student_phone_number IS NULL

Hands-On Tasks

  1. Create a table for students with relevant columns.
  2. Insert records with and without null values.
  3. Execute queries to filter data based on null values in certain columns.


Handling null values in data filtering is crucial for accurate results. By using IS NULL and IS NOT NULL, you can effectively manage null values while querying data in Spark SQL. Practice these concepts to master filtering with null values.

Watch the video tutorial here