Migrating Legacy Databases for a Multi-Specialty Healthcare Platform

AI Healthcare AI consulting Engineering

Specialized Airflow pipelines, each handling a different data domain

Additional cross-database tables with complex foreign key dependencies

Missing columns between source systems due to schema drift

100%

Validation pass rate, 5-level checks run after every single table

About the Client

A US-based multi-specialty healthcare platform had grown over the years by operating across three separate legacy databases on MS SQL Server — one for clinical records, one for pharmaceutical data, and one for pharmacy inventory.

As the platform scaled, maintaining these siloed systems became a critical operational and compliance liability.

Data inconsistencies between sources, schema drift, and the inability to run unified reporting across all three systems put both care quality and audit readiness at risk.

Why they reached out to us

- Three isolated MS SQL Server databases were creating data inconsistencies across clinical, pharmaceutical, and inventory workflows — there was no single source of truth

- Schema differences between sources (including up to 32 missing columns per table), integer-based legacy IDs, and cross-database foreign key dependencies made off-the-shelf migration tools inadequate

- HIPAA compliance required a tamper-proof audit trail of every data transformation, with rollback capability at every stage

- Any production downtime would directly disrupt patient care and clinical operations — a hard constraint.

Legacy MS SQL Server databases with incompatible schemas consolidated into one

90+

Tables with cross-database dependencies requiring orchestrated migration

Missing columns per table due to schema drift between source systems

Zero

Production downtime — legacy systems stayed live throughout the entire migration

100% HIPAA

Full compliance with audit trail on every single data transformation

10x

Faster data loading via PostgreSQL COPY vs. standard INSERT statements

The Problems

Heterogeneous Legacy Sources

Three databases with different schemas, validation rules, and up to 32 missing columns between sources made direct table-to-table migration impossible. Each source required custom transformation logic before a single row could land in the target system.

Cross-Database Foreign Key Dependencies

Tables from Database A referenced tables in Database B. Without a controlled migration order and a shared ID-resolution layer, loading any table out of sequence would result in orphaned records and broken relationships.

Integer-to-UUID Primary Key Transformation

All legacy primary keys used integer IDs. The target system required globally unique UUIDs — meaning every ID needed to be converted, every foreign key resolved to the new UUID, and every cross-table relationship rebuilt without losing a single link.

Zero Downtime Requirement

The platform couldn't afford any interruption to live operations. The legacy systems had to stay fully operational and serving production traffic throughout the entire migration window — ruling out the standard "stop, dump, restore" approach.

The Solutions

Apache Airflow Orchestration - 5 Specialized Pipelines

We built the entire ETL backbone on Apache Airflow, with 5 dedicated DAGs — one per data domain (clinical normalization, core clinical tables, pharmaceutical directories, prescriptions, pharmacy inventory). Each DAG visualized task dependencies in real time, auto-retried failed steps, and logged every operation for audit. The client had live access to the Airflow UI throughout the migration.

4-Stage ETL Architecture

Every table passed through the same four-stage pipeline:

- Extract: Chunked reads from MS SQL Server (10–50K rows per chunk) with cursor-based pagination and auto-reconnect on failure

- Transform: Type conversion (e.g. TINYINT → BOOLEAN), field renaming (PascalCase → camelCase), UUID generation for primary keys, and FK resolution via UUID mapping

- Load: PostgreSQL COPY command (10x faster than INSERT), with adaptive chunk sizing and automatic transaction management

- Validate: 5-level post-load checks — row count match, NULL primary key detection, zero orphaned FK records, type correctness, and legacy ID preservation

Configuration-Driven Migration Engine

Adding a new table to the migration required zero code changes — only a config entry describing the table name, field mappings, type conversions, chunk size, and load strategy. This reduced onboarding time for each new table from hours to minutes and made the system fully reproducible across dev, test, and production environments.

Cross-Database UUID Mapping via Airflow XCom

To resolve foreign keys across three separate databases, we built a shared UUID mapping store passed between tasks using Airflow's XCom mechanism. This allowed Task B to look up UUIDs generated by Task A — without any additional database calls — enabling accurate cross-DB FK resolution even when tasks ran in separate processes.

Many-to-Many Relationship Reconstruction

When legacy data stored relationships implicitly (e.g., a location record with num_chairs: 5 rather than 5 chair records), the pipelines automatically expanded and denormalized these into the correct target schema — creating normalized relationship tables from aggregated source data.

Zero-Downtime Migration with Full Rollback

Legacy systems continued serving production traffic throughout the entire migration. Every DAG was designed to be re-runnable: truncate strategy preserved indexes for fast reruns, and the full XCom mapping state made rollback possible at any pipeline stage without data loss.