Apache Spark Python - Processing Column Data - Using to_date and to_timestamp

Let us understand how to convert non standard dates and timestamps to standard dates and timestamps.

  • yyyy-MM-dd is the standard date format
  • yyyy-MM-dd HH:mm:ss.SSS is the standard timestamp format
  • Most of the date manipulation functions expect date and time using standard format. However, we might not have data in the expected standard format.
  • In those scenarios we can use to_date and to_timestamp to convert non standard dates and timestamps to standard ones respectively.

Using to_date for Dates

When working with date and timestamp data, it is important to convert non-standard formats to standard formats using to_date and to_timestamp functions.

from pyspark.sql.functions import lit, to_date
df.select(to_date(lit('02-03-2021'), 'dd-MM-yyyy').alias('to_date')).show()

Using to_timestamp for Timestamps

Use the to_timestamp function to convert non-standard timestamps to standard timestamps.

from pyspark.sql.functions import to_timestamp
df.select(to_timestamp(lit('02-Mar-2021 17:30:15'), 'dd-MMM-yyyy HH:mm:ss').alias('to_date')).show()

Watch the video tutorial here

Hands-On Tasks

Perform the following tasks to practice using to_date and to_timestamp functions:

  1. Create a Dataframe with date and time columns.
  2. Convert non-standard dates and timestamps to standard dates and timestamps in the Dataframe.

Conclusion

In this article, we have explored how to convert non-standard date and timestamp formats to standard formats using to_date and to_timestamp functions in PySpark. By following the provided instructions and code examples, you can effectively work with date and timestamp data in your PySpark applications. Remember to practice and engage with the community for further learning opportunities.