In this article, we will explore how to read tables from a Metastore using functions like spark.read.table
and process data using Data Frame APIs.
Using Data Frame APIs - spark.read.table("table_name")
To read a table from the Metastore, we can use the spark.read.table
function, passing the table name as a parameter. This will result in a Data Frame representing the table.
Prefixing Database Name
You can prefix the database name before the table name to read tables belonging to a specific database.
Processing Data
Once a Data Frame is created by reading the table, you can use Data Frame functions like filter
, groupBy
, sort
, or orderBy
to process the data within the Data Frame.
Hands-On Tasks
-
Create a database for airtraffic data.
import getpass username = getpass.getuser() spark.sql(f"CREATE DATABASE IF NOT EXISTS {username}_airtraffic") spark.catalog.setCurrentDatabase(f"{username}_airtraffic") spark.catalog.currentDatabase()
-
Create a table named airport_codes for the file airport-codes.txt.
airport_codes_path = f"/user/{username}/airtraffic_all/airport-codes" spark.sql(f'DROP TABLE {username}_airtraffic.airport_codes') airport_codes_df = spark.read.csv(airport_codes_path, sep="\t", header=True, inferSchema=True) airport_codes_df.write.saveAsTable(f"{username}_airtraffic.airport_codes")
-
Read data from the table and get the number of airports by state.
airport_codes = spark.read.table("airport_codes") airport_codes.groupBy("state").count().show() from pyspark.sql.functions import count, lit, col airport_codes.groupBy("state").agg(count(lit(1)).alias('airport_count')).orderBy(col('airport_count').desc()).show()
Conclusion
In this article, we have learned how to read tables from a Metastore using Spark functions and process the data using Data Frame APIs. We encourage you to practice these tasks and engage with the community for further learning.