Apache Spark Python - Basic Transformations - Aggregate data using cube

Let us go through the details related to advanced aggregations using cube in Spark.
Let us start spark context for this Article so that we can execute the code provided.

from pyspark.sql import SparkSession

import getpass

username = getpass.getuser()

spark = SparkSession. \

builder. \

config('spark.ui.port', '0'). \

config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \

enableHiveSupport(). \

appName(f'{username} | Python - Basic Transformations'). \

master('yarn'). \

getOrCreate()

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL


--master yarn \

--conf spark.ui.port=0 \

--conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala


--master yarn \

--conf spark.ui.port=0 \

--conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark


--master yarn \

--conf spark.ui.port=0 \

--conf spark.sql.warehouse.dir=/user/${USER}/warehouse

orders = spark.read.json(‘/public/retail_db_json/orders’)
orders.show()
orders.printSchema()
orders.count()

Example Aggregations Using cube

  • Get count of orders by date using cube.
from pyspark.sql.functions import count, lit
orders. \
    groupBy('order_date'). \
    agg(count(lit(1)).alias('order_count')). \
    show()

orders. \
    cube('order_date'). \
    agg(count(lit(1)).alias('order_count')). \
    orderBy('order_date'). \
    show()
  • Get count of orders by month as well as date using cube. You will see additional records per month.
from pyspark.sql.functions import date_format
orders. \
    groupBy(date_format('order_date', 'yyyyMM').alias('order_month'), 'order_date'). \
    agg(count(lit(1)).alias('order_count')). \
    orderBy('order_month', 'order_date'). \
    show()

orders. \
    rollup(date_format('order_date', 'yyyyMM').alias('order_month'), 'order_date'). \
    agg(count(lit(1)).alias('order_count')). \
    orderBy('order_month', 'order_date'). \
    show()
  • Get count of orders by year, month as well as date using cube. You will see additional records per month as well as per year.
from pyspark.sql.functions import year
orders. \
    groupBy(
        year('order_date').alias('order_year'),
        date_format('order_date', 'yyyyMM').alias('order_month'), 
        'order_date'
    ). \
    agg(count(lit(1)).alias('order_count')). \
    orderBy('order_year', 'order_month', 'order_date'). \
    show()

orders. \
    rollup(
        year('order_date').alias('order_year'),
        date_format('order_date', 'yyyyMM').alias('order_month'), 
        'order_date'
    ). \
    agg(count(lit(1)).alias('order_count')). \
    orderBy('order_year', 'order_month', 'order_date'). \
    show()

Watch the video tutorial here

Hands-On Tasks

  1. Get count of orders by date using cube.
  2. Get count of orders by month as well as date using cube.
  3. Get count of orders by year, month as well as date using cube.

Conclusion

In this article, we explored how to aggregate data using cube in Spark. The cube function allows us to generate multidimensional aggregated results, making it a powerful tool for advanced analytics. We encourage you to practice these concepts and engage with the community for further learning.