Apache Spark Python - Processing Column Data - Using CASE and WHEN

Let us understand how to perform conditional operations using CASE and WHEN in Spark.

  • CASE and WHEN is typically used to apply transformations based upon conditions. We can use CASE and WHEN similar to SQL using expr or selectExpr.

  • If we want to use APIs, Spark provides functions such as when and otherwise. when is available as part of pyspark.sql.functions. On top of column type that is generated using when, we should be able to invoke otherwise.

Using expr for Conditional Transformations

CASE and WHEN can be used to define conditional transformations in Spark. Here is an example using expr with a code highlighter:

employeesDF. \
    withColumn(
        'bonus', 
        expr("""
            CASE WHEN bonus IS NULL OR bonus = '' THEN 0
            ELSE bonus
            END
            """)
    ). \
    show()

Using when Function for Conditional Transformations

Spark also provides the when function for conditional transformations. Here is an example using when with a code highlighter:

from pyspark.sql.functions import when

employeesDF. \
    withColumn(
        'bonus',
        when((col('bonus').isNull()) | (col('bonus') == lit('')), 0).otherwise(col('bonus'))
    ). \
    show()

Watch the video tutorial here

Hands-On Tasks

  1. Transform the ‘bonus’ column to 0 if it is null or empty, otherwise keep the bonus amount.
  2. Create a dataframe using a list called ‘persons’ and categorize them based on age ranges.

Conclusion

In this article, we explored how to use CASE and WHEN in Spark for conditional transformations. It is a powerful feature that allows us to apply transformations based on specific conditions. Try out the hands-on tasks to practice what you’ve learned and feel free to engage with the community for further learning.