Apache Spark Python - Basic Transformations - Data Frame for basic transformations

Let us understand how to build the Data Frame to explore basic transformations. We will be creating a data frame using air traffic data.

  • Our air traffic data is in parquet file format.
  • We can use spark.read.parquet to create a data frame by passing the appropriate path that contains air traffic data.
  • We will build the Data Frame using 2008 January data. We will also preview the schema as well as the data using basic Data Frame functions to begin with.

To start a spark context for this notebook, you can execute the following code snippet:

from pyspark.sql import SparkSession

import getpass

username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Basic Transformations'). \
    master('yarn'). \
    getOrCreate()

If you are going to use CLIs, you can use Spark SQL using one of the following approaches.

Using Spark SQL CLI

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala CLI

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark CLI

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Tasks

Let us perform some tasks to understand filtering in detail. Solve all the problems by passing conditions using both SQL Style as well as API Style.

  • Read the data for the month of 2008 January. We will be using only 2008 January data for the demos.
hdfs dfs -ls /public/airtraffic_all/airtraffic-part/flightmonth=200801
airtraffic_path = "/public/airtraffic_all/airtraffic-part/flightmonth=200801"
airtraffic = spark. \
    read. \
    parquet(airtraffic_path)

airtraffic.printSchema()
airtraffic.select('Year', 'Month', 'DayOfMonth').distinct().show()
airtraffic.select('Year', 'Month', 'DayOfMonth').distinct().show(31)
airtraffic.select('Year', 'Month', 'DayOfMonth').distinct().count()
airtraffic.count()
airtraffic.show()

Watch the video tutorial here

Conclusion

In this article, we learned how to create a Data Frame from a Parquet file containing air traffic data. We also previewed the schema and the data to understand the structure of the Data Frame. This is the first step in exploring and transforming data using Spark.