Let us go through the details related to advanced aggregations using cube
in Spark.
Let us start spark context for this Article so that we can execute the code provided.
from pyspark.sql import SparkSession
import getpass
username = getpass.getuser()
spark = SparkSession. \
builder. \
config('spark.ui.port', '0'). \
config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
enableHiveSupport(). \
appName(f'{username} | Python - Basic Transformations'). \
master('yarn'). \
getOrCreate()
If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.
Using Spark SQL
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Scala
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Pyspark
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
orders = spark.read.json(‘/public/retail_db_json/orders’)
orders.show()
orders.printSchema()
orders.count()
Example Aggregations Using cube
- Get count of orders by date using cube.
from pyspark.sql.functions import count, lit
orders. \
groupBy('order_date'). \
agg(count(lit(1)).alias('order_count')). \
show()
orders. \
cube('order_date'). \
agg(count(lit(1)).alias('order_count')). \
orderBy('order_date'). \
show()
- Get count of orders by month as well as date using cube. You will see additional records per month.
from pyspark.sql.functions import date_format
orders. \
groupBy(date_format('order_date', 'yyyyMM').alias('order_month'), 'order_date'). \
agg(count(lit(1)).alias('order_count')). \
orderBy('order_month', 'order_date'). \
show()
orders. \
rollup(date_format('order_date', 'yyyyMM').alias('order_month'), 'order_date'). \
agg(count(lit(1)).alias('order_count')). \
orderBy('order_month', 'order_date'). \
show()
- Get count of orders by year, month as well as date using cube. You will see additional records per month as well as per year.
from pyspark.sql.functions import year
orders. \
groupBy(
year('order_date').alias('order_year'),
date_format('order_date', 'yyyyMM').alias('order_month'),
'order_date'
). \
agg(count(lit(1)).alias('order_count')). \
orderBy('order_year', 'order_month', 'order_date'). \
show()
orders. \
rollup(
year('order_date').alias('order_year'),
date_format('order_date', 'yyyyMM').alias('order_month'),
'order_date'
). \
agg(count(lit(1)).alias('order_count')). \
orderBy('order_year', 'order_month', 'order_date'). \
show()
Hands-On Tasks
- Get count of orders by date using cube.
- Get count of orders by month as well as date using cube.
- Get count of orders by year, month as well as date using cube.
Conclusion
In this article, we explored how to aggregate data using cube
in Spark. The cube
function allows us to generate multidimensional aggregated results, making it a powerful tool for advanced analytics. We encourage you to practice these concepts and engage with the community for further learning.