Data Engineering Spark SQL - Tables - DML & Partitioning - Exercise - Partitioned Tables

In this article, we will explore partitioned tables in Spark SQL. We will cover key concepts related to managing partitioned tables, including creating tables, inserting data, and validating the results. The accompanying video will provide a visual guide to complement the text.

Key Concepts Explanation

Partitioned Tables

Partitioned tables in Spark SQL allow for better performance and organization of data by dividing it into logical partitions based on specific criteria.

-- Create a partitioned table
CREATE TABLE nyse_eod_part (
    stockticker STRING,
    tradedate STRING,
    openprice FLOAT,
    highprice FLOAT,
    lowprice FLOAT,
    closeprice FLOAT,
    volume BIGINT
)
USING parquet
PARTITIONED BY (tradeyear INT)

Dynamic Partitioning

Dynamic partitioning in Spark SQL automatically determines the partition columns during data insertion, simplifying the process.

-- Insert data into partitioned table using dynamic partition mode
INSERT INTO TABLE nyse_eod_part
SELECT stockticker, tradedate, openprice, highprice, lowprice, closeprice, volume, year(to_date(cast(tradedate AS STRING), 'yyyyMMdd')) AS tradeyear
FROM stage_table

Hands-On Tasks

  1. Review the files under /data/nyse_all/nyse_data to determine data types.
  2. Create database YOUR_OS_USER_NAME_nyse if it does not exist.
  3. Create a non-partitioned stage table and load data into it.
  4. Validate the data by running simple select queries.
  5. Create a partitioned table nyse_eod_part with the specified schema and partition field.
  6. Set the required properties to enable dynamic partitioning.
  7. Insert data into the partitioned table using the dynamic partition mode.

Conclusion

In conclusion, partitioned tables are a powerful feature in Spark SQL for managing large datasets efficiently. By following the steps outlined in this article and practicing the hands-on tasks, you can gain a deeper understanding of partitioned tables and their benefits. Remember to validate your results to ensure accuracy in partitioning your data.

Click here to watch the video tutorial on Partitioned Tables

Exercise - Partitioned Tables

Use the provided exercise to self-evaluate your understanding of partitioned tables and practice working with them using Spark SQL. Remember to replace YOUR_OS_USER_NAME with your actual username.

// Exercise Instructions and Steps

Watch the video tutorial here