Data Engineering using Spark SQL - Basic Transformations - Projecting Data

Siva · May 22, 2024, 10:41am

In this article, we will delve into the fundamentals of projecting data in Spark SQL. The primary command used for projecting data is SELECT.

Key Concepts Explanation

Projecting Data

When using SELECT, we can either project all columns using * or specify certain columns by their names.

Providing Aliases

Aliases can be assigned to columns or expressions using the AS clause in the SELECT statement.

Distinct Records

The DISTINCT keyword is used to retrieve unique records from selected columns. Using DISTINCT * will return unique records based on all columns.

Omission of Columns

As of now, Spark SQL does not support projecting all columns except for one or a few. This feature is supported in Hive. In Hive, you can exclude specific columns from projection using the following syntax:

SET hive.support.quoted.identifiers=none;

SELECT `(order_id)?+.+` FROM orders;

Hands-On Tasks

Use SELECT * FROM orders LIMIT 10 to project all columns for the first 10 records.
Explore the schema of the orders table using DESCRIBE orders.
Project specific columns like order_customer_id, order_date, order_status using SELECT order_customer_id, order_date, order_status FROM orders.
Apply formatting functions and aliases in your projection.
Retrieve distinct values of a particular column using SELECT DISTINCT order_status.
Experiment with projecting only distinct records using SELECT DISTINCT * FROM orders LIMIT 10.

Conclusion

In this article, we have covered the basics of projecting data in Spark SQL, including syntax and practical examples. We encourage you to practice these concepts and join our community for further learning. Happy projecting!

Watch the video tutorial here