Apache Spark Python - Processing Column Data - Dealing with Unix Timestamp

Let us understand how to deal with Unix Timestamp in Spark.

  • It is an integer and started from January 1st 1970 Midnight UTC.
  • Beginning time is also known as epoch and is incremented by 1 every second.
  • We can convert Unix Timestamp to regular date or timestamp and vice versa.
  • We can use unix_timestamp to convert regular date or timestamp to a unix timestamp value. For example unix_timestamp(lit("2019-11-19 00:00:00")).
  • We can use from_unixtime to convert unix timestamp to regular date or timestamp. For example from_unixtime(lit(1574101800)).
  • We can also pass format to both the functions.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.

from pyspark.sql import SparkSession
import getpass

username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Processing Column Data'). \
    master('yarn'). \
    getOrCreate()

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Watch the video tutorial here

Tasks

Let us perform few tasks to understand how to deal with Unix Timestamp.

  1. Create a Dataframe by name datetimesDF with columns dateid, date and time.
    datetimes = [(20140228, "2014-02-28", "2014-02-28 10:00:00.123"),
                      (20160229, "2016-02-29", "2016-02-29 08:08:08.999"),
                      (20171031, "2017-10-31", "2017-12-31 11:59:59.123"),
                      (20191130, "2019-11-30", "2019-08-31 00:00:00.000")
                 ]
    datetimesDF = spark.createDataFrame(datetimes).toDF("dateid", "date", "time")
    datetimesDF.show(truncate=False)
    
  2. Get unix timestamp for dateid, date and time.
    from pyspark.sql.functions import unix_timestamp, col
    datetimesDF. \
        withColumn("unix_date_id", unix_timestamp(col("dateid").cast("string"), "yyyyMMdd")). \
        withColumn("unix_date", unix_timestamp("date", "yyyy-MM-dd")). \
        withColumn("unix_time", unix_timestamp("time")). \
        show()
    
  3. Create a Dataframe by name unixtimesDF with one column unixtime using 4 values. You can use the unix timestamp generated for the time column in the previous task.
    unixtimes = [(1393561800, ),
                 (1456713488, ),
                 (1514701799, ),
                 (1567189800, )
                ]
    unixtimesDF = spark.createDataFrame(unixtimes).toDF("unixtime")
    unixtimesDF.show()
    unixtimesDF.printSchema()
    
  4. Get date in yyyyMMdd format and also complete timestamp.
    from pyspark.sql.functions import from_unixtime
    unixtimesDF. \
        withColumn("date", from_unixtime("unixtime", "yyyyMMdd")). \
        withColumn("time", from_unixtime("unixtime")). \
        show()
    

Conclusion

In conclusion, you have learned how to deal with Unix Timestamps in Spark by converting them to regular date or timestamp values and vice versa. It is essential to understand Unix Timestamps for precise time calculations in Spark applications. Practice the tasks provided and continue exploring Spark SQL to enhance your skills further. Join the community to engage with other learners and experts for valuable insights and guidance.