Data Engineering Spark SQL - Windowing Functions - Ranking using Windowing Functions

In this article, we will explore how to assign ranks using different rank functions in Spark SQL windowing functions.

Explanation for the video

[Insert Video Here] - The video provides a visual guide to understanding the concepts discussed in this article.

Key Concepts Explanation

Key Concept 1

Windowing functions in Spark SQL allow us to assign ranks to data based on specified criteria.

SELECT *,

  rank() OVER (

    PARTITION BY order_date

    ORDER BY revenue DESC

  ) AS rnk

FROM daily_product_revenue

ORDER BY order_date, revenue DESC

LIMIT 100

Key Concept 2

Three main functions to assign ranks are rank, dense_rank, and row_number. These functions have different behaviors when handling duplicate values.

Hands-On Tasks

Here are some tasks you can perform to apply windowing functions in Spark SQL:

  1. Assign ranks within a partition based on specific criteria.
  2. Use different rank functions like rank, dense_rank, and row_number to rank data.

Conclusion

In this article, we covered how to use windowing functions to assign ranks in Spark SQL. Practicing these techniques will help you gain a deeper understanding of ranking data in analytical SQL queries.

Ranking using Windowing Functions

Let us see how we can assign ranks using different rank functions.
Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labsto learn Spark SQL using our unique integrated LMS.

spark2-shell \

    --master yarn \

    --conf spark.ui.port=0 \

    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
pyspark2 \

    --master yarn \

    --conf spark.ui.port=0 \

    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
  • If we have to assign ranks globally, we just need to specify ORDER BY
  • If we have to assign ranks with in a key then we need to specify PARTITION BY and then ORDER BY.
  • By default ORDER BY will sort the data in ascending order. We can change the order by passing DESC after ORDER BY.
  • We have 3 main functions to assign ranks - rank, dense_rank and row_number. We will see the difference between the 3 in a moment.

Watch the video tutorial here