Data Engineering Spark SQL - Windowing Functions - Introduction - Windowing Functions

Windowing Functions, also known as Analytic Functions in databases like Oracle, are powerful tools for performing complex calculations and analysis within a specified window of data. In this article, we will explore the following key concepts related to Windowing Functions in SQL:

Key Concepts Explanation

Prepare HR Database

Before we dive into Windowing Functions, we need to ensure that we have a relevant database, such as the HR Database, set up for our examples.

-- Code to prepare HR database
CREATE DATABASE IF NOT EXISTS hr_database;
USE hr_database;

Overview of Windowing Functions

Windowing Functions allow us to perform calculations across a set of rows related to the current row. They enable us to perform aggregations, rankings, and more within a specific window of data.

-- Example of Windowing Function
SELECT 
  employee_id,
  department_id,
  salary,
  AVG(salary) OVER (PARTITION BY department_id) as avg_salary_department
FROM employees;

Aggregations using Windowing Functions

Windowing Functions can be used to calculate aggregate functions like SUM, AVG, COUNT, etc., over a specified window of data, making it easier to analyze subsets of data.

-- Example of aggregation using Windowing Functions
SELECT
  department_id,
  SUM(salary) OVER (PARTITION BY department_id) as total_salary_department
FROM employees;

Getting LEAD and LAG values

LEAD and LAG functions in Windowing Functions allow us to access values from rows that come before or after the current row, providing valuable context for analysis.

-- Example of getting LEAD and LAG values
SELECT 
  employee_id,
  salary,
  LEAD(salary, 1) OVER (ORDER BY salary) as next_salary,
  LAG(salary, 1) OVER (ORDER BY salary) as previous_salary
FROM employees;

Getting first and last values

Windowing Functions can also be used to access the first and last values within a window, helping us identify the start and end points of a dataset.

-- Example of getting first and last values
SELECT
  employee_id,
  salary,
  FIRST_VALUE(salary) OVER (ORDER BY salary) as first_salary,
  LAST_VALUE(salary) OVER (ORDER BY salary) as last_salary
FROM employees;

Ranking using Windowing Functions

Ranking functions like ROW_NUMBER, RANK, and DENSE_RANK in Windowing Functions allow us to assign ranks to rows based on specified criteria, helping us identify top performers or outliers.

-- Example of ranking using Windowing Functions
SELECT
  employee_id,
  salary,
  ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) as rank_within_department
FROM employees;

Understanding order of execution of SQL

The order of execution of SQL statements involving Windowing Functions is crucial to achieving the desired results. Understanding how data is processed within the window can impact the outcome of our analyses.

-- Example of order of execution in SQL
SELECT
  employee_id,
  department_id,
  salary,
  AVG(salary) OVER (PARTITION BY department_id ORDER BY salary) as avg_salary_department
FROM employees;

Overview of Nested Sub Queries

Nested subqueries within Windowing Functions allow us to further filter or manipulate data within the specified window, providing more flexibility in our analyses.

-- Example of nested subqueries within Windowing Functions
SELECT
  employee_id,
  salary,
  (SELECT MAX(salary) FROM employees WHERE department_id = e.department_id) as max_salary_department
FROM employees e;

Filtering - Window Function Results

Filtering results from Windowing Functions can be achieved using the PARTITION BY and ORDER BY clauses, allowing us to focus on specific subsets of data within the window.

-- Example of filtering Window Function results
SELECT
  employee_id,
  salary,
  AVG(salary) OVER (PARTITION BY department_id ORDER BY salary ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as cumulative_avg_salary
FROM employees;

Hands-On Tasks

To practice and solidify your understanding of Windowing Functions, try the following tasks:

  1. Calculate the total salary for each department using Windowing Functions.
  2. Rank employees within each department based on their salaries.
  3. Find the highest and lowest salary in each department using Windowing Functions.

Conclusion

In this article, we have delved into the world of Windowing Functions in SQL, exploring their capabilities for data analysis and manipulation. By mastering these concepts and practicing hands-on tasks, you can enhance your analytical skills and make more informed decisions based on your data. Remember to engage with the community and continue learning to deepen your

Watch the video tutorial here