Apache Spark Python - Basic Transformations - Total Aggregations

Let us go through the details related to total aggregations using Spark.

  • We can perform total aggregations directly on Dataframe or we can perform aggregations after grouping by a key(s).

  • Here are the functions which we typically use to perform aggregations.

    • count

    • sum, avg

    • min, max

In this section, we will break down the key concepts related to total aggregations using Spark.

Aggregation Functions

Aggregation functions are used to perform calculations on groups of rows of a DataFrame. Here are the commonly used aggregation functions:

# Counting total number of rows
airtraffic.count()

Distinct Values

Calculating the number of distinct values in a DataFrame is essential. Here’s how you can do it:

# Counting distinct values
airtraffic. \
    select('Year', 'Month', 'DayOfMonth'). \
    distinct(). \
    count()

Total Bonus Amount

Calculating the total bonus amount from a dataset can be done using the sum function:

# Calculating total bonus amount
employeesDF. \
    select(((sum(coalesce(col('bonus').cast('int'), lit(0)) * col('salary'))) / lit(100)).alias('total_bonus')). \
    show()

Revenue Calculation

Determining the revenue generated for a given order from a dataset can be achieved using the sum function:

# Calculating order revenue
order_items. \
    filter(col('order_item_order_id') == lit(int(order_id))). \
    select(sum('order_item_subtotal').alias('order_revenue')). \
    show()

Watch the video tutorial here

Hands-On Tasks

Here are some hands-on tasks for you to apply the concepts discussed above:

  1. Calculate the total number of rows in the airtraffic DataFrame.
  2. Find the distinct count of dates from the airtraffic DataFrame.
  3. Calculate the total bonus amount from the employeesDF DataFrame.
  4. Determine the revenue generated for a specific order from the order_items dataset.

Conclusion

In this article, we discussed the key concepts related to total aggregations using Spark. We covered aggregation functions, distinct values, calculating bonus amounts, and revenue calculations. It is essential to practice these concepts hands-on to gain a better understanding. Feel free to engage with the community for further learning.