Apache Spark Python - Spark Metastore - Creating Temporary Views

So far we spoke about permanent metastore tables. Now let us understand how to create temporary views using a Data Frame.

  • We can create temporary view for a Data Frame using createTempView or createOrReplaceTempView.
  • createOrReplaceTempView will replace an existing view if it already exists.
  • While tables in Metastore are permanent, views are temporary.
  • Once the application exits, temporary views will be deleted or flushed out.

Let us start Spark context for this notebook so that we can execute the code provided. You can sign up for our 10-node state-of-the-art cluster/labs to learn Spark SQL using our unique integrated LMS.

from pyspark.sql import SparkSession
import getpass

username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Spark Metastore'). \
    master('yarn'). \
    getOrCreate()

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Set the number of shuffle partitions to 2:

spark.conf.set('spark.sql.shuffle.partitions', '2')

Watch the video tutorial here

Tasks

Let us perform a few tasks to create a temporary view and process the data using the temporary view.

Task 1:

  • Create a temporary view named airport_codes_v for the file airport-codes.txt. The file contains a header, and each field in each row is delimited by a tab character.
import getpass

username = getpass.getuser()
spark.catalog.setCurrentDatabase(f"{username}_airtraffic")
spark.catalog.currentDatabase()
spark.catalog.listTables()

airport_codes_path = f"/public/airtraffic_all/airport-codes"
airport_codes_df = spark. \
    read. \
    csv(airport_codes_path,
        sep="\t",
        header=True,
        inferSchema=True
       )

airport_codes_df.printSchema()
airport_codes_df.show()
airport_codes_df.createTempView("airport_codes_v")
spark.catalog.listTables()

Task 2:

  • Read data from the view and get the number of airports by state.
airport_codes = spark.read.table("airport_codes_v")
airport_codes. \
    groupBy("state"). \
    count(). \
    show()

spark.sql('''

    SELECT state, count(1) AS airport_count
    FROM airport_codes_v 
    GROUP BY state
    ORDER BY airport_count DESC

''').show()

Conclusion

In conclusion, temporary views in Spark SQL provide a flexible and efficient way to work with DataFrames. They allow us to easily query and process data without the need to create permanent tables in the metastore. By using createTempView or createOrReplaceTempView, we can create temporary views that are scoped to the Spark session and are automatically removed once the session ends. This makes them ideal for ad-hoc analysis, exploratory data analysis, and temporary data transformations. Temporary views, combined with the power of Spark SQL, provide a powerful tool for data manipulation and analysis in Spark.