Data Engineering Spark SQL - Tables - DML & Partitioning - Adding Partitions to Tables

Siva · May 22, 2024, 10:41am

Let us understand how we can add static partitions to Partitioned tables in Spark Metastore.
Let us start spark context for this Notebook so that we can execute the code provided.

Key Concepts Explanation

Adding Partitions: We can add partitions using ALTER TABLE command with ADD PARTITION. For each partition created, a subdirectory will be created with the partition column name and corresponding value under the table directory.

Hands-On Tasks

Execute the script to add static partitions to a Partitioned table.
Run the provided code snippets to add partitions to the table.

Conclusion

In this article, we discussed how to add static partitions to partitioned tables in Spark Metastore. It is essential to understand the process of adding partitions for efficient data management.

[YouTube Video Placeholder]

Let us understand how we can add static partitions to Partitioned tables in Spark Metastore. Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.
val username = System.getProperty(“user.name”)
import org.apache.spark.sql.SparkSession

val username = System.getProperty(“user.name”)

val spark = SparkSession.

builder.

config("spark.ui.port", "0").

config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").

enableHiveSupport.

appName(s"${username} | Spark SQL - Managing Tables - DML and Partitioning").

master("yarn").

getOrCreate

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

We can add partitions using ALTER TABLE command with ADD PARTITION.
For each and every partition created, a subdirectory will be created using partition column name and corresponding value under the table directory.
Let us understand how to add partitions to orders_part table under itversity_retail database.

Here is the script to add static partitions to a Partitioned table where partition column type is string.
%%sql
USE itversity_retail

DROP TABLE IF EXISTS orders_part

CREATE TABLE orders_part (
order_id INT,
order_date STRING,
order_customer_id INT,
order_status STRING
) PARTITIONED BY (order_month STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
import sys.process._
s"hdfs dfs -ls /user/${username}/warehouse/${username}retail.db/orders_part" !
%%sql
ALTER TABLE orders_part ADD PARTITION (order_month=‘2013-07’)
import sys.process.
s"hdfs dfs -ls /user/${username}/warehouse/${username}_retail.db/orders_part" !

Here is the script to add static partitions to a Partitioned table where partition column type is integer. We can add one or more partitions at a time. For further demos we will be using this table
%%sql
USE itversity_retail
%%sql
DROP TABLE IF EXISTS orders_part
%%sql
CREATE TABLE orders_part (
order_id INT,
order_date STRING,
order_customer_id INT,
order_status STRING
) PARTITIONED BY (order_month INT)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
%%sql
DESCRIBE orders_part
import sys.process._
s"hdfs dfs -ls /user/${username}/warehouse/${username}retail.db/orders_part" !
%%sql
ALTER TABLE orders_part ADD PARTITION (order_month=201307)
%%sql
ALTER TABLE orders_part ADD
PARTITION (order_month=201308)
PARTITION (order_month=201309)
PARTITION (order_month=201310)
import sys.process.
s"hdfs dfs -ls /user/${username}/warehouse/${username}_retail.db/orders_part" !

Watch the video tutorial here