Apache Spark Python - Processing Column Data - Trimming Characters from Strings

In this article, we will learn how to trim unwanted characters from strings using Spark Functions. Trimming is commonly used to remove unnecessary characters from fixed-length records, especially in Mainframe processing.

Removing Leading, Trailing, or Both Spaces

In Spark functions, we can use ltrim, rtrim, and trim to remove leading, trailing, or both leading and trailing spaces from strings.

from pyspark.sql.functions import col, ltrim, rtrim, trim

df.withColumn("ltrim", ltrim(col("dummy"))). \
   withColumn("rtrim", rtrim(col("dummy"))). \
   withColumn("trim", trim(col("dummy"))). \
   show()

Customizing Trimming with Expressions

We can use expressions or Spark SQL functions like expr to achieve trimming functionality with more customization options.

df.withColumn("ltrim", expr("ltrim(dummy)")). \
   withColumn("rtrim", expr("rtrim('.', rtrim(dummy))")). \
   withColumn("trim", trim(col("dummy"))). \
   show()

Watch the video tutorial here

Hands-On Tasks

  1. Create a Dataframe with one column and one record.
  2. Apply trim functions to trim spaces.

Conclusion

In this article, we discussed how to trim unwanted characters from strings using Spark Functions. It is essential to understand the various trim functions available and how to apply them to manipulate string data effectively.

Now you can practice trimming strings in your Spark environment and explore more about string manipulation functions in Spark. Remember, hands-on practice is crucial for gaining a deeper understanding of the concepts discussed.