I am trying to join two RDDs,
First table contains data of employees and their respective departments with time they are in that department.
I made RDD from this data which contains emp_no ,dept_name.
second table contains information about employees and their titles during time period.
10001,senior staff,to_time, from_time
From this i made RDD which has emp_no,title
Now as you can see both RDD has multiple entries for same key.
So when i go to join these RDDs and try to print result , spark is freezing in between .
i was waiting for more than 15 mins but still no result and these datasets are not that large .
When i remove duplicate keys from both rdds i get result within a minute.
can some one explain to me what is issue when i try to join two RDDs with duplicate keys.??