How to join two large tables in Spark?

hive
apache-spark
scala

#1

Hi, Can anyone has an idea how to join two large tables in Spark where one table is having 2TB while other is having 5TB. Thru internet I come to know that “broadcast join” is good for joining two large tables considering one will be larger and other will be smaller, but didn’t get any info on size of smaller table data size. It is been common interview question that is facing often. Can anyone give an idea about it asap.

Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs

  • Click here for access to state of the art 13 node Hadoop and Spark Cluster