Data Engineering using Spark SQL - Basic Transformations - Sorting Data

Siva · May 22, 2024, 10:41am

Let us understand how to sort the data using Spark SQL.

To begin with, we need to start the Spark context for this Notebook so that we can execute the provided code. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.

Key Concept 1

We can perform global aggregations as well as aggregations by key. Here are some examples:

Get total number of orders.
Get revenue for a given order id.
Get the number of records with order_status either COMPLETED or CLOSED.

SELECT COUNT(order_id) AS total_orders FROM orders;

Key Concept 2

Aggregations by key using GROUP BY:

Get the number of orders by date or status.
Get revenue for each order_id.
Get daily product revenue using order date and product id as keys.

SELECT order_date, order_item_product_id, ROUND(SUM(order_item_subtotal), 2) AS revenue
FROM orders JOIN order_items
ON orders.order_id = order_items.order_item_order_id
GROUP BY order_date, order_item_product_id;

Hands-On Tasks

Here are some tasks you can perform to apply the concepts discussed:

Sort the data by order_customer_id.
Sort the data by order_customer_id and order_date.
Sort the data by order_customer_id in ascending order and order_date in descending order.

Conclusion

In this article, we covered how to sort data using Spark SQL. Sorting can be done by one field or multiple fields in ascending or descending order. The sorting order can be altered using DESC. It is essential to understand sorting as it plays a crucial role in data analysis and reporting.

Link to Video

Watch the video tutorial here