Apache Spark Python - Processing Column Data - Using Date and Time Trunc Functions

In Data Warehousing, we often run date reports such as week to date, month to date, year to date, etc. Let us understand how we can handle such requirements using appropriate functions over Spark Data Frames.

We can use trunc or date_trunc to get the beginning date of the week, month, current year, etc. by passing a date or timestamp to it.

  • We can use trunc to get the beginning date of the month or year by passing a date or timestamp to it. For example, trunc(current_date(), "MM") will give the first of the current month.

  • We can use date_trunc to get the beginning date of the month or year as well as the beginning time of the day or hour by passing a timestamp to it.

    • Get the beginning date based on month: date_trunc("MM", current_timestamp())

    • Get the beginning time based on the day: date_trunc("DAY", current_timestamp())

Let us start the Spark context for this Notebook so that we can execute the code provided.

Using trunc for Date Truncation

To use trunc or date_trunc functions in Spark, you can pass a date or timestamp as an argument to get the starting date or time based on the provided unit.

from pyspark.sql.functions import trunc, date_trunc

# Example of using trunc to get the beginning month date
date_trunc_example = date_trunc("MM", current_timestamp())

Using date_trunc for Date and Time Truncation

You can perform tasks to understand trunc and date_trunc functions better by creating a Dataframe and applying these functions to the date and time columns.

# Create a Dataframe with date and time columns
datetimesDF = spark.createDataFrame(datetimes, schema="date STRING, time STRING")

# Using trunc to get beginning month date and beginning year date
datetimesDF.withColumn("date_trunc", trunc("date", "MM")).withColumn("time_trunc", trunc("time", "yy"))

# Using date_trunc to get beginning hour time
datetimesDF.withColumn("date_trunc", date_trunc('MM', "date")).withColumn("time_trunc", date_trunc('yy', "time"))

# Using date_trunc to get different truncations based on the hour
datetimesDF.withColumn("date_dt", date_trunc("HOUR", "date")).withColumn("time_dt", date_trunc("HOUR", "time")).withColumn("time_dt1", date_trunc("dd", "time"))

Watch the video tutorial to see these concepts in action and deepen your understanding of using trunc and date_trunc functions in Spark for date and time manipulation. The tutorial provides step-by-step explanations and practical examples, helping you grasp the concepts effectively. Click the link below to watch the tutorial:

Watch the video tutorial here

Hands-On Tasks

Perform the following tasks to understand trunc and date_trunc functions in Spark:

  1. Create a Dataframe by the name datetimesDF with columns date and time.
  2. Get the beginning month date using the date field and beginning year date using the time field.
  3. Get the beginning hour time using the date and time field.

Conclusion

In this article, we learned how to use trunc and date_trunc functions in Spark to manipulate date and time values effectively. By practicing the hands-on tasks, you can gain a better understanding of these functions and apply them to your data processing tasks. Join the community and keep learning!