Hi I am looking for inputs on the approach to solve this with big data, maybe Hive and/or HBase and Spark.
I’ll try not to disclose any info that I am not supposed to and focus on technical aspects. We want to use big data as we want scalability. Number of csv files/companies could increase in the future.
Say an organisation receive 3 customer information csv file from 3 companies,(see them as tables below)
companyA provides csvA(custA_id, cust_name, ssn, tel_no, address, DOB, driver_licence_no, passport_no)
companyB provides csvB(custB_id, cust_name, ssn, tel_no, address, DOB, driver_licence_no, passport_no)
companyC provides csvC(custC_id, cust_name, ssn, tel_no, address, DOB, driver_licence_no, passport_no)
The organisation wants to find out all the customers that have registered with more than one company above - meaning, a person could have different customer name in 2 or 3 above companies, but some of those customer attributes could match. Say Jane Eve in file csvA and Jane Jackson in csvB although they have different names, but they have the same address and DOB so they should be considered the same person and we want to record them in another result table.
So here’s the criteria:
- if name and address the same they are the same person.
- if address and DOB and tel_no the same they are the same person.
- if tel_no and (passport_no or driver_license_no) the same they are the same person.
this criteria list could get bigger as business side submits more requirements.
(note that between csvA and csvB a person could be matched by name/address, and then between csvB and csvC the same person could be match by tel_no and passport_no.
The output we want is like this, a row in the below table means customer with cust_no Axxx/Bxxxx/Cxxxx from companyA/B/C are the same person.(sorry could’n line them up)
match_id CustA_no CustB_no CustC_no
m12345 A00001 B00003
m12226 A00008 B00004 C00005
m33332 A98777 C34533
m23457 A78932 B87677 C34556
The tables are huge, around 5 millions of records each. Each week there will be 3 new complete csv files comes in and we’ll need to do a fresh re-match for the files. Please assume data are already cleansed, formatted(like name, address etc), so they can be compared easily.
It looks like I should use Hive to do this ? It seems table bucketing should be used ? What would be your approach/design ? Please let me know if I need to provide more info.