Apache Spark Python - Processing Column Data - Create Dummy Data Frame

In this article, we will demonstrate how to create a dummy DataFrame using PySpark to explore various Spark functions. This guide includes setting up the Spark context, creating the DataFrame, and performing basic operations to understand Spark SQL better.

Setting Up the Spark Context

Before creating a DataFrame, we need to start the Spark context. This can be done using the following code snippet:

from pyspark.sql import SparkSession
import getpass

username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Processing Column Data'). \
    master('yarn'). \
    getOrCreate()

If you prefer using command-line interfaces (CLIs), you can start Spark SQL using one of the following approaches:

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using PySpark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Creating a Dummy DataFrame

Let us create a simple DataFrame with a single column and a single record. This example mimics the Oracle dual table:

l = [('X', )]
df = spark.createDataFrame(l, "dummy STRING")
df.printSchema()
df.show()

Using Spark Functions

Once the DataFrame is created, we can explore various Spark functions. For example, to get the current date, we can use the current_date() function:

from pyspark.sql.functions import current_date

df.select(current_date()).show()

We can also alias the column name for better readability:

df.select(current_date().alias("current_date")).show()

Creating a DataFrame with Employee Data

For a more detailed exploration, let’s create a DataFrame using a collection of employee records. This DataFrame will be used to demonstrate various Spark functions for processing column data.

employees = [
    (1, "Scott", "Tiger", 1000.0, "united states", "+1 123 456 7890", "123 45 6789"),
    (2, "Henry", "Ford", 1250.0, "India", "+91 234 567 8901", "456 78 9123"),
    (3, "Nick", "Junior", 750.0, "united KINGDOM", "+44 111 111 1111", "222 33 4444"),
    (4, "Bill", "Gomes", 1500.0, "AUSTRALIA", "+61 987 654 3210", "789 12 6118")
]

employeesDF = spark.createDataFrame(employees, schema="""employee_id INT, first_name STRING, last_name STRING, salary FLOAT, nationality STRING, phone_number STRING, ssn STRING""")
employeesDF.printSchema()
employeesDF.show(truncate=False)

Watch the video tutorial here

Conclusion

Creating a DataFrame with dummy data in Spark is a fundamental step for exploring various Spark SQL functions. By following this guide, you can set up your Spark context, create simple and complex DataFrames, and perform basic operations to understand how Spark SQL works.