Apache Spark Python - Spark Metastore - Inserting into Existing Tables

Let us understand how we can insert data into existing tables using insertInto. We can use modes such as append and overwrite with insertInto. Default is append.

When we use insertInto, the following happens:

  • If the table does not exist, insertInto will throw an exception.
  • If the table exists, by default data will be appended.
  • We can alter the behavior by using the keyword argument overwrite. It is by default False, we can pass True to replace existing data.

Let us perform a few tasks to understand how to write a Data Frame into existing tables in the Metastore.

  1. Make sure the hr_db database and employees table in hr_db are created.
spark.catalog.setCurrentDatabase(f"{username}_hr_db")
spark.catalog.currentDatabase()
spark.catalog.listTables()
spark.catalog.listColumns('employees')
  1. Use the employees Data Frame and insert data into the employees table in the hr_db database. Ensure that existing data is overwritten.
employees = [(1, "Scott", "Tiger", 1000.0, "United States"),
             (2, "Henry", "Ford", 1250.0, "India"),
             (3, "Nick", "Junior", 750.0, "United Kingdom"),
             (4, "Bill", "Gomes", 1500.0, "Australia")]

spark.read.table('employees').schema
employeesDF = spark.createDataFrame(employees,
                                  schema="""employee_id INT, first_name STRING, 
                                  last_name STRING, salary FLOAT, nationality STRING""")

employeesDF.show()
employeesDF.schema
employeesDF.write.insertInto("employees", overwrite=True)
spark.read.table("employees").show()
spark.sql('SELECT * FROM employees').show()

Watch the video tutorial here

Inserting Data

Description of inserting data into existing tables using insertInto.

Overwriting Existing Data

Explains how to overwrite existing data during insertion.

Hands-On Tasks

  1. Create hr_db database and employees table in hr_db.
  2. Insert data into the employees table in the hr_db database and ensure existing data is overwritten.

Conclusion

In this article, we learned how to insert data into existing tables using insertInto in Spark SQL. Practice these tasks to reinforce your understanding. Join the community for further learning opportunities.