Data Engineering Spark SQL - Managing Tables - DDL & DML - Loading Data Into Tables - Local

Siva · May 22, 2024, 10:41am

Let us understand how to load data into Spark Metastore tables from the local file system. We will go through the process step by step to accomplish this task efficiently.

Key Concepts Explanation

Spark Context Initialization

When working with Spark, it is essential to start the Spark context for the notebook. By initializing the Spark session, we can execute the Spark SQL commands and manage tables effectively. Below is an example of how to initialize the Spark session in a notebook.

val username = System.getProperty("user.name")
import org.apache.spark.sql.SparkSession

val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Managing Tables - Basic DDL and DML").
    master("yarn").
    getOrCreate

Table Creation and Data Loading

To load data into Spark Metastore tables, it is crucial to ensure that the table structure matches the data format and delimiters. Here is a script that creates a table and loads data into it.

USE itversity_retail

DROP TABLE orders

CREATE TABLE orders (
  order_id INT COMMENT 'Unique order id',
  order_date STRING COMMENT 'Date on which order is placed',
  order_customer_id INT COMMENT 'Customer id who placed the order',
  order_status STRING COMMENT 'Current status of the order'
) COMMENT 'Table to save order level details'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

LOAD DATA LOCAL INPATH '/data/retail_db/orders' INTO TABLE orders

Data Preview

After loading the data, we can run queries to preview the data. Here are some sample queries to check the loaded data.

SELECT * FROM orders LIMIT 10

SELECT count(1) FROM orders

Hands-On Tasks

Initialize the Spark context and execute the provided code snippet to manage tables effectively.
Create a table in the ‘itversity_retail’ database using the specified schema and delimiters.
Load data from the local file system ‘/data/retail_db/orders’ into the created table ‘orders’.
Execute queries to preview the loaded data and verify the results.

Conclusion

In this article, we explored the process of loading data into Spark Metastore tables from the local file system. By following the outlined steps and understanding the key concepts, readers can effectively manage tables and query data in a Spark environment. Practice these tasks to enhance your skills and feel free to engage with the community for further learning and support.

Watch the video tutorial here