Why Spark Sql hash() returns the same hash value though the keys are different in some cases

apache-spark
spark-sql

#1

Hello All,

I am calculating the hash value of few columns and determining whether its an Insert/Delete/Update Record but found a scenario which is little weird since some of the records returns same hash value though the key’s are totally different.

For the instance,

scala> spark.sql(“select hash(‘40514XXXXX’),hash(‘41751XXXX’)”).show()
±--------------±--------------+
|hash(40514XXXX)|hash(41751XXXX)|
±--------------±--------------+
| 976573657| 976573657|
±--------------±--------------+

scala> spark.sql(“select hash(‘14589’),hash(‘40004XXXX’)”).show()
±----------±--------------+
|hash(14589)|hash(40004XXXX)|
±----------±--------------+
| 777096871| 777096871|
±----------±--------------+
I do understand that hash() returns an integer, are these reached the max value?.


Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs

  • Click here for access to state of the art 13 node Hadoop and Spark Cluster