Real world scenario to be solved with big data, find possible matches in 3 tables

Hi I am looking for inputs on the approach to solve this with big data, maybe Hive and/or HBase and Spark.
I’ll try not to disclose any info that I am not supposed to and focus on technical aspects. We want to use big data as we want scalability. Number of csv files/companies could increase in the future.
Say an organisation receive 3 customer information csv file from 3 companies,(see them as tables below)
companyA provides csvA(custA_id, cust_name, ssn, tel_no, address, DOB, driver_licence_no, passport_no)
companyB provides csvB(custB_id, cust_name, ssn, tel_no, address, DOB, driver_licence_no, passport_no)
companyC provides csvC(custC_id, cust_name, ssn, tel_no, address, DOB, driver_licence_no, passport_no)
The organisation wants to find out all the customers that have registered with more than one company above - meaning, a person could have different customer name in 2 or 3 above companies, but some of those customer attributes could match. Say Jane Eve in file csvA and Jane Jackson in csvB although they have different names, but they have the same address and DOB so they should be considered the same person and we want to record them in another result table.
So here’s the criteria:

  1. if name and address the same they are the same person.
  2. if address and DOB and tel_no the same they are the same person.
  3. if tel_no and (passport_no or driver_license_no) the same they are the same person.
    this criteria list could get bigger as business side submits more requirements.
    (note that between csvA and csvB a person could be matched by name/address, and then between csvB and csvC the same person could be match by tel_no and passport_no.
    The output we want is like this, a row in the below table means customer with cust_no Axxx/Bxxxx/Cxxxx from companyA/B/C are the same person.(sorry could’n line them up)
    match_id CustA_no CustB_no CustC_no
    m12345 A00001 B00003
    m12226 A00008 B00004 C00005
    m33332 A98777 C34533
    m23457 A78932 B87677 C34556

The tables are huge, around 5 millions of records each. Each week there will be 3 new complete csv files comes in and we’ll need to do a fresh re-match for the files. Please assume data are already cleansed, formatted(like name, address etc), so they can be compared easily.

It looks like I should use Hive to do this ? It seems table bucketing should be used ? What would be your approach/design ? Please let me know if I need to provide more info.

Quick thought - for each matching criteria, if I create a HBase table, say table1 use (name, address) as row key, scan csv A/B/C file, for those with the same (name/address), the recA, recB, recC from the above 3 files have the same (name/address), they’ll be saved as cf:A, cf:B, cf:C. Then we retrieve by rowKey:cf, if there are more than 1 cell return then it means there are matches. Similarly we create another H table using another criteria (tel-no/psp#) as rowKey for matches on tel-no and passport number. Maynot be a good approach but this will work ?

Hi Jerry,

Thank you for the post.
If possible Can you provide 2 or 3 records for each file???


I am pretty sure, we should be able to do this using Spark DataFrame’s, provided your company is using Apache Spark. It would be better than using Hive because of the in-memory computation and Catalyst optimizer run by Spark on DataFrame’s. Also, 5 million records will not be so huge that it would not fit into the memory of a production cluster(@itversity @gnanaprakasam @venkatreddy-amalla please correct me if wrong).

If you can provide us with some sample data and the output required, it would be an interesting use case to solve,