Apache Spark Python - Spark Metastore - Read and Process data from Metastore Tables

In this article, we will explore how to read tables from a Metastore using functions like spark.read.table and process data using Data Frame APIs.

Using Data Frame APIs - spark.read.table("table_name")

To read a table from the Metastore, we can use the spark.read.table function, passing the table name as a parameter. This will result in a Data Frame representing the table.

Prefixing Database Name

You can prefix the database name before the table name to read tables belonging to a specific database.

Processing Data

Once a Data Frame is created by reading the table, you can use Data Frame functions like filter, groupBy, sort, or orderBy to process the data within the Data Frame.

Watch the video tutorial here

Hands-On Tasks

  1. Create a database for airtraffic data.

    import getpass
    username = getpass.getuser()
    spark.sql(f"CREATE DATABASE IF NOT EXISTS {username}_airtraffic")
  2. Create a table named airport_codes for the file airport-codes.txt.

    airport_codes_path = f"/user/{username}/airtraffic_all/airport-codes"
    spark.sql(f'DROP TABLE {username}_airtraffic.airport_codes')
    airport_codes_df = spark.read.csv(airport_codes_path, sep="\t", header=True, inferSchema=True)
  3. Read data from the table and get the number of airports by state.

    airport_codes = spark.read.table("airport_codes")
    from pyspark.sql.functions import count, lit, col


In this article, we have learned how to read tables from a Metastore using Spark functions and process the data using Data Frame APIs. We encourage you to practice these tasks and engage with the community for further learning.