Apache Spark Python - Processing Column Data - Extracting Strings using split

Description: Learn how to extract substrings from a main string using the split function in PySpark. This article provides detailed explanations, code examples, and hands-on tasks to help you understand and apply the concept.

Splitting Strings into Arrays

The split function is used to extract information from variable length columns with delimiters. It converts each string into an array, allowing you to access elements using an index.

Example:

df.select(split(lit("Hello World, how are you"), " ")).show(truncate=False)

Exploding Arrays into Rows

You can use explode in conjunction with split to split the array into records in a DataFrame. This can be useful for tasks like word count or phone number analysis.

Example:

df.select(explode(split(lit("Hello World, how are you"), " ")).alias('word')).show(truncate=False)

Watch the video tutorial here

Hands-On Tasks

  1. Create a list of employees with name, SSN, and phone numbers.
  2. Create a DataFrame with columns for name, SSN, and phone numbers.
  3. Extract area code and last 4 digits from the phone numbers.
  4. Extract last 4 digits from the SSN.
  5. Perform phone count analysis on the DataFrame.

Conclusion

In this article, you learned how to use the split function in PySpark to extract substrings from strings. By following the hands-on tasks and examples provided, you can apply this concept to real-world data processing scenarios. Practice these tasks and engage with the community for further learning.