Apache Spark Python - Processing Column Data - Dealing with Nulls

Let us understand how to deal with nulls using functions that are available in Spark.

  • We can use coalesce to return the first non-null value.
  • Traditional SQL style functions such as nvl can be used with expr or selectExpr.

Using coalesce Function

coalesce: Returns the first non-null value.

from pyspark.sql.functions import coalesce

employeesDF.withColumn('bonus', coalesce('bonus', 0)).show()

Using nvl Function with expr

nvl: Traditional SQL function can be used with expr.

from pyspark.sql.functions import expr

employeesDF.withColumn('bonus', expr("nvl(bonus, 0)")).show()

Watch the video tutorial here

Hands-On Tasks

  1. Use coalesce function to replace null values in the ‘bonus’ column with 0.
  2. Apply the nvl function with an expression to handle null values in the ‘bonus’ column.

Conclusion

In this article, we discussed how to deal with null values in Spark using functions like coalesce and traditional SQL functions like nvl. It is important to handle nulls effectively to ensure accurate data processing. Practice these concepts and engage with the community for further learning opportunities.