Timestamp related


Hello everyone, I am trying to address the best way to deal with timestamps. Here is the approach I am planning to use to convert any timestamp format to native Hive/Spark SQL compatible timestamp. Please share your thoughts if you have any other easier method to handle the timestamp formats. Thanks in advance!!

In Spark, we need the timestamp to be in the following format

yyyy-MM-dd HH:mm:ss

If the input data has the above format, then no problem, we can just read the data as string, and when we apply the timestamp() function, the string will be converted to timestamp column in SQL, so that we can perform time based operations. But if the timestamp is in some other format (like MM-dd-yyyy HH:mm:ss.S), then we can handle that as shown below:

Assume that you have the following data:
1,01-31-2013 13:01:01.0
2,01-31-2013 13:01:01.0
3,1-2-2013 13:01:01.0

We need to create a DF with the columns ID, and Timestamp, so that we can perform time related operations:

#Create an RDD
rdd1 = sc.parallelize(["1,01-31-2013 13:01:01.0","2,01-31-2013 13:01:01.0","3,1-2-2013 13:01:01.0"])

#Create a list of tuples
rdd2 = rdd1.map(lambda x: x.split(",")).map(lambda x: [int(x[0]), x[1]])

#Create a data frame
df =  sqlContext.createDataFrame(rdd2, schema = ["ID","DATE_TIME"])

#Register the data frame as a table

#unix_timestamp() function will help us to deal with any timestamp format
df_1 = sqlContext.sql("""SELECT ID, unix_timestamp(DATE_TIME,"MM-dd-yyyy HH:mm:ss.S") as DATE_TIME_UNIX from df""")

#Let us print the contents of df_1

#will print:

|  1|     1359655261|
|  2|     1359655261|
|  3|     1357149661|

#Then we can convert this back to normal timestamp format:


sqlContext.sql("""select ID, from_unixtime(DATE_TIME_UNIX, 'yyyy-MM-dd HH:mm:ss') as DATE_TIME from df_1""").show()

#Will show:
| ID|          DATE_TIME|
|  1|2013-01-31 13:01:01|
|  2|2013-01-31 13:01:01|
|  3|2013-01-02 13:01:01|

Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs

  • Click here for access to state of the art 13 node Hadoop and Spark Cluster



Thanks for sharing, i think there should be some simple techniques to do it like UDF or something. If i got to know anything will share.

Keep sharing.