Data Engineering Spark SQL - Tables - DML & Partitioning - Inserting Data into Partitions

Let us understand how to use insert to get data into static partitions in Spark Metastore from an existing table called as orders. Let us start the Spark context for this Notebook so that we can execute the code provided.

Key Concepts Explanation

Spark Context Setup

To set up the Spark context, we start by defining the username and importing the necessary Spark session.

val username = System.getProperty("user.name")
import org.apache.spark.sql.SparkSession

val spark = SparkSession
    .builder()
    .config("spark.ui.port", "0")
    .config("spark.sql.warehouse.dir", s"/user/${username}/warehouse")
    .enableHiveSupport()
    .appName(s"${username} | Spark SQL - Managing Tables - DML and Partitioning")
    .master("yarn")
    .getOrCreate()

Using Spark SQL, Scala, Pyspark

If using CLIs, you can start Spark SQL, Scala, or Pyspark using the provided commands.

Spark SQL

spark2-sql --master yarn --conf spark.ui.port=0 --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Scala

spark2-shell --master yarn --conf spark.ui.port=0 --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Pyspark

pyspark2 --master yarn --conf spark.ui.port=0 --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Hands-On Tasks

  1. Create a table named orders_part with order_month as the partitioned column.
  2. Add static partitions for the months 201307, 201308, 201309, and 201310 using ALTER TABLE command.
  3. Pre-process the data to insert it into the created partitions using the LOAD command.

Conclusion

In conclusion, we have learned how to insert data into static partitions in a Spark Metastore table from an existing table. By following the provided steps and hands-on tasks, you can practice and enhance your skills in managing partitioned tables efficiently. Explore further by applying these concepts to real-world scenarios and engage with the community for continuous learning.

Watch the video tutorial here