Automated Patient Record Deduplication for a US Healthcare Platform

AI Healthcare AI consulting Engineering

34,860

Patient records unified into one master database

97.5%

Of all merges completed with zero human input

31 min

Total execution time vs. 2,905 manual staff hours

Data loss — zero records dropped or falsely merged

About the Client

A US-based multi-specialty healthcare platform had accumulated patient records across two separate legacy MS SQL Server systems over years of parallel operation — one serving as the primary clinical database, the other a legacy system from an earlier infrastructure era.

Together they held 34,860 patient records, spread across 90+ tables with cross-dependencies, and backed by 3 distinct source databases covering clinical data, pharmaceutical records, and pharmacy inventory.

As the client prepared to consolidate onto a modern PostgreSQL infrastructure, they hit a fundamental problem: the two patient databases contained overlapping entries with no reliable shared identifier. Customer IDs were locally unique within each system but not globally unique across both. SSNs — the natural deduplication key — were encrypted with different keys in each database, making direct comparison impossible.

On top of that, 32 columns were absent in one of the legacy systems, and an initial scan revealed 1,522 data conflicts across the records — 1,406 middle name discrepancies and 116 gender value mismatches.

A manual record-by-record review was estimated at 2,905 staff hours and 6–12 months of analyst time. Neither the timeline nor the cost was acceptable. Before any infrastructure consolidation could proceed, the client needed an automated, auditable, HIPAA-compliant method to merge both datasets into a single master database — without accidentally combining records that belonged to two different people.

Location

United States

Services

Data Engineering, ETL, Database Migration, Deduplication

Tech Stack

Python, Apache Airflow, MS SQL Server, PostgreSQL, Supabase, Docker

2,905 hrs

Of manual analyst work eliminated by automation

1,522

data conflicts across the records

The Problems

No Reliable Shared Identifier

Customer IDs were locally unique within each database but not globally unique across both. SSNs — the natural deduplication key — were encrypted with different keys in each system, making direct cryptographic comparison impossible.

High-Frequency Name Collisions

Common names (e.g., "Rodriguez, Maria") appeared across 15+ distinct patients in the dataset. Name-alone matching would generate unacceptably high false-positive merge rates — combining different people into one record.

Missing Critical Fields

Some patient records lacked date of birth entirely. The system had to handle absent data gracefully — neither assuming identity nor erroneously discarding valid records.

1,522 Data Conflicts Between Sources

Field-level conflicts were detected across both systems: 1,406 middle name discrepancies (e.g., "Robert" vs "Bob", "Catherine" vs "Cathy") and 116 gender value conflicts. These needed to be logged and surfaced without blocking the merge.

Conservative Healthcare Data Standard

In a clinical context, merging two records that belong to different people is far more dangerous than keeping a duplicate. The algorithm had to be calibrated toward safety — defaulting to "keep separate" whenever confidence was anything less than high.

HIPAA Audit Trail for Every Decision

Every merge decision — automated or manually reviewed — required a complete, attributable record: source database, original IDs, reviewer identity, timestamp, and rationale. No merge could be anonymous.

The Solutions

3-Level Fuzzy Matching Algorithm

Since SSN and CustomerID were unusable, we built a composite identity-matching engine based on First Name + Last Name + Middle Name + Date of Birth. Each candidate pair received a confidence score:

- Score 80 (High confidence): First, last, and middle names all match → automated merge

- Score 50 (Uncertain): First and last match, but middle name is missing in one record → human review

- Score 20 (Low confidence): First and last match, but middle names differ → treated as different patients

Only Score 80 matches were merged automatically. All others were routed to human review.

Human-in-the-Loop Review Workflow

The 2.5% of ambiguous cases (1,122 patient pairs) were surfaced in a structured review table, displaying both records side by side with full field detail, conflict type, and match score. A clinical expert could approve or reject each pairing — or, once a business rule was established, apply it to an entire conflict class in a single operation.

Conflict Detection & Classification Engine

The system automatically categorized every conflict by type (DifferentDOB, MissingDOB_A, MissingDOB_B, MiddleNameMismatch, GenderConflict) and logged the count and severity of each. This enabled the client to analyze conflict patterns and define precise business rules before approving any batch.

Business Rule Encoding & Bulk Approval

After reviewing the ambiguous case breakdown, the client established a single governing rule: "Different date of birth = different patient." With that rule codified, all 1,122 ambiguous cases were resolved with a single SQL statement — eliminating weeks of individual review and reducing Phase 2 from months to 30 minutes.

3-Phase Migration Architecture

- Phase 1 — Automated Merge (1.2 seconds): 33,997 high-confidence records merged without human input

- Phase 2 — Human Review (30 minutes): 1,122 ambiguous pairs reviewed and approved via business rule

- Phase 3 — Final Load (<1 second): Approved records loaded with full FK integrity validation

HIPAA-Compliant Audit Trail on Every Record

Every entry in the unified database carries: source table, original CustomerID from Database A, original CustomerID from Database B, merge timestamp, reviewer name, review date, and review notes. Full chain of custody — ready for compliance audit at any time.