Master Data Management problem scenario using Spark Scala


#1

Problem:
I want to compare attributes of each record of Input table (A) with corresponding attribute of Master Table(B) in its all records and based on weightage
associated with column, generate score.

In below example, record 1 of A will be matched with all records (1,2,3,4) of B. Compare score with threshold. If it is a match, get masterId from B and attach to input record.

Please suggest optimised solution. I’m having an unoptimized solution though which is taking lot of time.

Unoptimized Solution. I am doing cross join (AB) and the comparing attribute by its corresponding attribute. If Match, (1 * Score for that attribute) will be added to final score
If not match, (0
Score for that attribute), i.e 0 will be added to final score. Lastly, I’m getting the maximum final score by grouping by (DOB+GENDER+SSN+FN+LN+Patient_id+Client_Id).
Get master Id and attach it with input A.

Ex:
25-01-1991 M
1234 Abhinav
Kashyap P123
C123
is matched with:
25-01-1991 M
2415 Abhinav
Kashyap P123
C123 M123

score is (125) + (10) + (025) + (10) + (15) + (115) + (1*15) = 60

Input Table (A):

DOB GENDER
SSN FN LN Patient_ID Client_ID
25-01-1991 M
1234 Abhinav
Kashyap P123
C123
26-01-1991 M
2345 Abhi
Kashyap P123
C123
25-01-1991 M
7872 Yuva
Kosuru P124
C122

Master_Table (B):

DOB GENDER
SSN FN LN Patient_ID Client_ID Master_Id
25-01-1991 M
2415 Abhinav
Kashyap P123
C123 M123
26-01-1991 M
2345 Abhi
Kashyap P123
C123 M234
25-01-1991 M
7872 Yuva
Kosuru P124
C122 M345
25-01-1991 M
1234 Abhinav
Kash P123
C123 M456

Weightage ©:

Attribute_Name Weightage
DOB 25
GENDER 0
SSN 25
FN 0
LN 5
Patient_ID 15
Client_ID 15

Threshold (D):

Low_Score High_Score
Action
100 100000
Match
85 99 Exception
0 84 New Record

Final Output:

DOB GENDER
SSN FN LN Patient_ID Client_ID Master_Id
25-01-1991 M
1234 Abhinav
Kashyap P123
C123 M456
26-01-1991 M
2345 Abhi
Kashyap P123
C123 M234
25-01-1991 M
7872 Yuva
Kosuru P124
C122 M345


#2

Only way to optimise these type of problems is by narrowing down your Cartesian products. For example if you want to check with in a state or region, amount of data that needs to be generated with additional criteria with state or region will be much less than with out that criteria.

@kmln, do you have any thoughts around this?