Automated Patient Record Deduplication for a US Healthcare Platform
AI
Healthcare
AI consulting
Engineering
34,860
Patient records unified into one master database
97.5%
Of all merges completed with zero human input
31 min
Total execution time vs. 2,905 manual staff hours
$0
Data loss — zero records dropped or falsely merged
About the Client
A US-based multi-specialty healthcare platform had accumulated patient records across two separate legacy MS SQL Server systems over years of parallel operation — one serving as the primary clinical database, the other a legacy system from an earlier infrastructure era.
Together they held 34,860 patient records, spread across 90+ tables with cross-dependencies, and backed by 3 distinct source databases covering clinical data, pharmaceutical records, and pharmacy inventory.
As the client prepared to consolidate onto a modern PostgreSQL infrastructure, they hit a fundamental problem: the two patient databases contained overlapping entries with no reliable shared identifier. Customer IDs were locally unique within each system but not globally unique across both. SSNs — the natural deduplication key — were encrypted with different keys in each database, making direct comparison impossible.
On top of that, 32 columns were absent in one of the legacy systems, and an initial scan revealed 1,522 data conflicts across the records — 1,406 middle name discrepancies and 116 gender value mismatches.
A manual record-by-record review was estimated at 2,905 staff hours and 6–12 months of analyst time. Neither the timeline nor the cost was acceptable. Before any infrastructure consolidation could proceed, the client needed an automated, auditable, HIPAA-compliant method to merge both datasets into a single master database — without accidentally combining records that belonged to two different people.
Together they held 34,860 patient records, spread across 90+ tables with cross-dependencies, and backed by 3 distinct source databases covering clinical data, pharmaceutical records, and pharmacy inventory.
As the client prepared to consolidate onto a modern PostgreSQL infrastructure, they hit a fundamental problem: the two patient databases contained overlapping entries with no reliable shared identifier. Customer IDs were locally unique within each system but not globally unique across both. SSNs — the natural deduplication key — were encrypted with different keys in each database, making direct comparison impossible.
On top of that, 32 columns were absent in one of the legacy systems, and an initial scan revealed 1,522 data conflicts across the records — 1,406 middle name discrepancies and 116 gender value mismatches.
A manual record-by-record review was estimated at 2,905 staff hours and 6–12 months of analyst time. Neither the timeline nor the cost was acceptable. Before any infrastructure consolidation could proceed, the client needed an automated, auditable, HIPAA-compliant method to merge both datasets into a single master database — without accidentally combining records that belonged to two different people.
Location
United States
Services
Data Engineering, ETL, Database Migration, Deduplication
Tech Stack
Python, Apache Airflow, MS SQL Server, PostgreSQL, Supabase, Docker
2,905 hrs
Of manual analyst work eliminated by automation
1,522
data conflicts across the records
The Problems
1
No Reliable Shared Identifier
Customer IDs were locally unique within each database but not globally unique across both. SSNs — the natural deduplication key — were encrypted with different keys in each system, making direct cryptographic comparison impossible.
2
High-Frequency Name Collisions
Common names (e.g., "Rodriguez, Maria") appeared across 15+ distinct patients in the dataset. Name-alone matching would generate unacceptably high false-positive merge rates — combining different people into one record.
3
Missing Critical Fields
Some patient records lacked date of birth entirely. The system had to handle absent data gracefully — neither assuming identity nor erroneously discarding valid records.
4
1,522 Data Conflicts Between Sources
Field-level conflicts were detected across both systems: 1,406 middle name discrepancies (e.g., "Robert" vs "Bob", "Catherine" vs "Cathy") and 116 gender value conflicts. These needed to be logged and surfaced without blocking the merge.
5
Conservative Healthcare Data Standard
In a clinical context, merging two records that belong to different people is far more dangerous than keeping a duplicate. The algorithm had to be calibrated toward safety — defaulting to "keep separate" whenever confidence was anything less than high.
6
HIPAA Audit Trail for Every Decision
Every merge decision — automated or manually reviewed — required a complete, attributable record: source database, original IDs, reviewer identity, timestamp, and rationale. No merge could be anonymous.
The Solutions
3-Level Fuzzy Matching Algorithm
Since SSN and CustomerID were unusable, we built a composite identity-matching engine based on First Name + Last Name + Middle Name + Date of Birth. Each candidate pair received a confidence score:
- Score 80 (High confidence): First, last, and middle names all match → automated merge
- Score 50 (Uncertain): First and last match, but middle name is missing in one record → human review
- Score 20 (Low confidence): First and last match, but middle names differ → treated as different patients
Only Score 80 matches were merged automatically. All others were routed to human review.
- Score 80 (High confidence): First, last, and middle names all match → automated merge
- Score 50 (Uncertain): First and last match, but middle name is missing in one record → human review
- Score 20 (Low confidence): First and last match, but middle names differ → treated as different patients
Only Score 80 matches were merged automatically. All others were routed to human review.
Human-in-the-Loop Review Workflow
The 2.5% of ambiguous cases (1,122 patient pairs) were surfaced in a structured review table, displaying both records side by side with full field detail, conflict type, and match score. A clinical expert could approve or reject each pairing — or, once a business rule was established, apply it to an entire conflict class in a single operation.
Conflict Detection & Classification Engine
The system automatically categorized every conflict by type (DifferentDOB, MissingDOB_A, MissingDOB_B, MiddleNameMismatch, GenderConflict) and logged the count and severity of each. This enabled the client to analyze conflict patterns and define precise business rules before approving any batch.
Business Rule Encoding & Bulk Approval
After reviewing the ambiguous case breakdown, the client established a single governing rule: "Different date of birth = different patient." With that rule codified, all 1,122 ambiguous cases were resolved with a single SQL statement — eliminating weeks of individual review and reducing Phase 2 from months to 30 minutes.
3-Phase Migration Architecture
- Phase 1 — Automated Merge (1.2 seconds): 33,997 high-confidence records merged without human input
- Phase 2 — Human Review (30 minutes): 1,122 ambiguous pairs reviewed and approved via business rule
- Phase 3 — Final Load (<1 second): Approved records loaded with full FK integrity validation
- Phase 2 — Human Review (30 minutes): 1,122 ambiguous pairs reviewed and approved via business rule
- Phase 3 — Final Load (<1 second): Approved records loaded with full FK integrity validation
HIPAA-Compliant Audit Trail on Every Record
Every entry in the unified database carries: source table, original CustomerID from Database A, original CustomerID from Database B, merge timestamp, reviewer name, review date, and review notes. Full chain of custody — ready for compliance audit at any time.
Achieved results
The results
34,860s
unique patient records unified into a single master database
97.5%
of records merged automatically — zero human input required for 33,997 patients
100%
merge accuracy — zero false matches, zero records from different patients combined
31 min
total execution time vs. an estimated 2,905 manual staff hours
~$130,225
saved compared to manual deduplication at $30–60/hr analyst rates
Zero
data loss — full referential integrity validated post-merge
Book a live demo to see it in action
RELEVANT CASES
Check our more healthcare stories
AI-powered athletic footwear solution for professional athletes
6
Months to MVP release
Strava & Nike
EU competitor
Growing patient acquisition and revenue in virtual OB-GYN care
462
New patients acquired
$7M funding
Raised to scale nationwide
Cutting care delivery costs by 40% with AI-driven virtual-first healthcare
96% reduction
In cardiovascular disease amputations
Contact Us
Please provide your info and inquiry details below.