Data Engineering Spark SQL - Managing Tables - DDL & DML - Creating External Tables

Let us understand how to create an external table in Spark Metastore using orders as an example. We will also see how to load data into the external table. Ensure to sign up for our 10 node state-of-the-art cluster/labs to learn Spark SQL using our unique integrated LMS.

Key Concepts Explanation

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
  • Add EXTERNAL keyword in the CREATE clause and LOCATION after STORED AS clause or just LOCATION as part of CREATE TABLE statement.
  • You can use the same LOAD command to get data from either the local file system or HDFS, which we have used for a Managed table.
  • Run DESCRIBE FORMATTED orders to check the metadata of the table and confirm whether it is a managed table or an external table.
  • Specify the location while creating external tables.

Hands-On Tasks

  1. Start spark context for this Notebook to execute the provided code.
  2. Create an external table named “orders” with the specified column details.
  3. Load data into the “orders” table using the provided data.

Conclusion

In this article, we learned how to create an external table in Spark Metastore using the example of “orders.” By following the step-by-step instructions provided, you can practice creating and loading data into external tables effectively. Join our community for further learning and support.

[Embed the video here]

Watch the video tutorial here