Apache Spark Python - Data Processing Overview - Overview of Data Frame APIs

In this article, we will provide an overview of Data Frame APIs to process data in Data Frames efficiently. Understanding these APIs is crucial for performing various data transformations and aggregations in PySpark.

Row Level Transformations

Row-level transformations or projections of data in Data Frames can be performed using select, selectExpr, withColumn, and drop.

Selecting Specific Columns

You can select specific columns from a Data Frame using the select method:

# Selecting specific columns
employeesDF.select("first_name", "last_name").show()

Dropping a Column

To drop a column from a Data Frame, use the drop method:

# Dropping a column
employeesDF.drop("nationality").show()

Adding a New Column

You can add a new column to a Data Frame using the withColumn method:

# Adding a new column
employeesDF.withColumn('full_name', concat('first_name', lit(' '), 'last_name')).show()

Using selectExpr to Concatenate Columns

The selectExpr method allows you to use SQL expressions for transformations:

# Using selectExpr to concatenate columns
employeesDF.selectExpr('*', 'concat(first_name, " ", last_name) AS full_name').show()

Filtering and Aggregations

Filtering and aggregating data in Data Frames can be done using filter, where, groupBy, agg, and other functions.

Filtering Data

You can filter data based on a condition using the filter method:

# Filtering data based on a condition
employeesDF.filter(col("salary") > 1000).show()

Grouping and Aggregating Data

To group and aggregate data, use the groupBy and agg methods:

# Grouping and aggregating data
employeesDF.groupBy("nationality").agg(avg("salary")).show()

Sorting Data

You can sort data using the orderBy method:

# Sorting data
employeesDF.orderBy("salary").show()

Watch the video tutorial here

Hands-On Tasks

  1. Project only the first name and last name of employees.
  2. Drop the ‘nationality’ column from the Data Frame.

Conclusion

In this article, we covered the basic Data Frame APIs for processing data efficiently in PySpark. It is essential to practice these concepts and explore more advanced functionalities to become proficient in handling Data Frames effectively.

If you want to explore more, don’t hesitate to join our community and engage with other data enthusiasts.

Let’s dive into the world of Data Frames and master PySpark!