So far we spoke about permanent metastore tables. Now let us understand how to create temporary views using a Data Frame.
- We can create temporary view for a Data Frame using
createTempView
orcreateOrReplaceTempView
. createOrReplaceTempView
will replace an existing view if it already exists.- While tables in Metastore are permanent, views are temporary.
- Once the application exits, temporary views will be deleted or flushed out.
Let us start Spark context for this notebook so that we can execute the code provided. You can sign up for our 10-node state-of-the-art cluster/labs to learn Spark SQL using our unique integrated LMS.
from pyspark.sql import SparkSession
import getpass
username = getpass.getuser()
spark = SparkSession. \
builder. \
config('spark.ui.port', '0'). \
config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
enableHiveSupport(). \
appName(f'{username} | Python - Spark Metastore'). \
master('yarn'). \
getOrCreate()
If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.
Using Spark SQL
spark2-sql \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Scala
spark2-shell \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Pyspark
pyspark2 \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Set the number of shuffle partitions to 2:
spark.conf.set('spark.sql.shuffle.partitions', '2')
Tasks
Let us perform a few tasks to create a temporary view and process the data using the temporary view.
Task 1:
- Create a temporary view named airport_codes_v for the file airport-codes.txt. The file contains a header, and each field in each row is delimited by a tab character.
import getpass
username = getpass.getuser()
spark.catalog.setCurrentDatabase(f"{username}_airtraffic")
spark.catalog.currentDatabase()
spark.catalog.listTables()
airport_codes_path = f"/public/airtraffic_all/airport-codes"
airport_codes_df = spark. \
read. \
csv(airport_codes_path,
sep="\t",
header=True,
inferSchema=True
)
airport_codes_df.printSchema()
airport_codes_df.show()
airport_codes_df.createTempView("airport_codes_v")
spark.catalog.listTables()
Task 2:
- Read data from the view and get the number of airports by state.
airport_codes = spark.read.table("airport_codes_v")
airport_codes. \
groupBy("state"). \
count(). \
show()
spark.sql('''
SELECT state, count(1) AS airport_count
FROM airport_codes_v
GROUP BY state
ORDER BY airport_count DESC
''').show()
Conclusion
In conclusion, temporary views in Spark SQL provide a flexible and efficient way to work with DataFrames. They allow us to easily query and process data without the need to create permanent tables in the metastore. By using createTempView or createOrReplaceTempView, we can create temporary views that are scoped to the Spark session and are automatically removed once the session ends. This makes them ideal for ad-hoc analysis, exploratory data analysis, and temporary data transformations. Temporary views, combined with the power of Spark SQL, provide a powerful tool for data manipulation and analysis in Spark.