Design a Large-Scale Data Migration System

Table of Contents

1. Requirements & Scope (5 min)
- Functional Requirements
- Non-Functional Requirements
2. Estimation (3 min)
3. API Design (3 min)
- Migration Control API
4. Data Model (3 min)
5. High-Level Design (12 min)
- Migration Phases
- Components
6. Deep Dives (15 min)
7. Extensions (2 min)

This content is password protected

1. Requirements & Scope (5 min)

Functional Requirements

Migrate data from a legacy datastore (e.g., MySQL) to a new datastore (e.g., PostgreSQL, DynamoDB, or a new schema in the same engine) with zero downtime
Maintain full data consistency between source and destination throughout the migration — no data loss, no corruption
Provide a validation framework that continuously compares source and destination data and reports discrepancies
Support rollback at any stage — if the new system has issues, revert to the old system without data loss
Handle schema transformations during migration (column renames, type changes, data enrichment, denormalization)

Non-Functional Requirements

Availability: Zero downtime. The system must remain fully operational throughout the entire migration, which may take days to weeks.
Latency: No user-facing latency increase during migration. Reads and writes continue at normal speed.
Consistency: Strong consistency. After migration completes, the new datastore must have 100.000% of the data. Not 99.99% — 100%.
Scale: 50 TB of data, 500M rows across 200 tables. 10K writes/sec, 100K reads/sec. Migration must complete within 2 weeks.
Durability: The migration process itself must be resumable. If the migration pipeline crashes, it resumes from where it left off, not from scratch.

2. Estimation (3 min)

Backfill Phase

50 TB of data to migrate
Network throughput (source → destination): 1 Gbps sustained = 125 MB/sec
Time to backfill: 50 TB / 125 MB/sec = 400,000 sec ≈ 4.6 days at full throughput
With throttling to avoid impacting production (50% utilization): ~9 days

Change Data Capture (CDC) During Backfill

10K writes/sec to source DB during backfill
CDC event size: ~500 bytes average
CDC throughput: 10K × 500 bytes = 5 MB/sec (easily handled by Kafka)
Events generated during 9-day backfill: 10K/sec × 86400 × 9 = 7.8 billion events
At 500 bytes each: 3.9 TB of CDC data (Kafka with 14-day retention can handle this)

Validation Phase

Compare 500M rows between source and destination
Comparison rate: 50K rows/sec (read from both, hash, compare)
Time: 500M / 50K = 10,000 sec ≈ 2.8 hours per full validation pass
Run 3 passes for confidence: ~8 hours

Total Timeline

Backfill: 9 days
CDC catch-up: 1-2 hours (processing backlog of changes accumulated during backfill)
Validation: 1 day (3 passes + fixing discrepancies)
Shadow reads: 2-3 days (comparison in production)
Cutover: minutes
Total: ~2 weeks

3. API Design (3 min)

This is an internal infrastructure system, not a user-facing API. But it needs operational APIs for engineers to manage the migration.

Migration Control API

// Create a migration plan
POST /v1/migrations
  Body: {
    "name": "mysql_to_postgres_users",
    "source": {
      "type": "mysql",
      "connection": "mysql://source-host:3306/users_db",
      "tables": ["users", "profiles", "addresses"]
    },
    "destination": {
      "type": "postgresql",
      "connection": "postgresql://dest-host:5432/users_db"
    },
    "transform_config": "s3://migrations/transforms/users_v2.yaml",
    "parallelism": 16,
    "throttle_pct": 50
  }
  Response 201: { "migration_id": "mig_abc123", "status": "created" }

// Start/pause/resume/abort a migration
POST /v1/migrations/{migration_id}/start
POST /v1/migrations/{migration_id}/pause
POST /v1/migrations/{migration_id}/resume
POST /v1/migrations/{migration_id}/abort

// Get migration status
GET /v1/migrations/{migration_id}/status
  Response 200: {
    "migration_id": "mig_abc123",
    "phase": "backfill",                // backfill | cdc_catchup | validation | shadow | cutover | done
    "progress": {
      "tables_completed": 147,
      "tables_total": 200,
      "rows_migrated": 320000000,
      "rows_total": 500000000,
      "bytes_migrated": "32TB",
      "current_lag": "2.3s",            // CDC lag behind source
      "errors": 0
    }
  }

// Trigger validation
POST /v1/migrations/{migration_id}/validate
  Body: { "mode": "full", "sample_rate": 1.0 }

// Execute cutover
POST /v1/migrations/{migration_id}/cutover
  Body: { "strategy": "instant" }       // instant | gradual

4. Data Model (3 min)

Migration State (PostgreSQL — the migration controller’s own DB)

Table: migrations
  migration_id     (PK) | varchar(20)
  name                   | varchar(100)
  source_config          | jsonb
  destination_config     | jsonb
  transform_config_url   | varchar(500)
  phase                  | enum('created','backfill','cdc_catchup','validation','shadow','cutover','done','aborted')
  created_at             | timestamp
  updated_at             | timestamp

Table: migration_table_progress
  migration_id     (FK) | varchar(20)
  table_name       (PK) | varchar(100)
  status                 | enum('pending','in_progress','completed','failed')
  rows_migrated          | bigint
  rows_total             | bigint
  last_checkpoint        | varchar(200)   -- cursor/offset for resumability
  last_error             | text
  updated_at             | timestamp

Table: validation_results
  validation_id    (PK) | varchar(20)
  migration_id     (FK) | varchar(20)
  table_name             | varchar(100)
  total_rows_compared    | bigint
  mismatches_found       | bigint
  missing_in_dest        | bigint
  missing_in_source      | bigint
  sample_mismatches      | jsonb          -- first 100 mismatched row IDs
  completed_at           | timestamp

CDC Offset Tracking (Kafka Consumer Offsets + PostgreSQL)

Table: cdc_checkpoints
  migration_id     (FK) | varchar(20)
  source_table           | varchar(100)
  binlog_file            | varchar(100)   -- MySQL binlog position
  binlog_position        | bigint
  kafka_topic            | varchar(200)
  kafka_offset           | bigint
  updated_at             | timestamp

Why PostgreSQL for Migration State?

The migration controller itself needs ACID guarantees for checkpoint management
Resumability depends on accurate checkpoint tracking
Low write volume (metadata only, not the actual data being migrated)

5. High-Level Design (12 min)

Migration Phases

Phase 1: Backfill (Historical Data)

Source DB (MySQL)
  → Backfill Workers (16 parallel):
    For each table:
      1. SELECT * FROM table WHERE id > {last_checkpoint} ORDER BY id LIMIT 10000
      2. Transform rows (schema mapping, type conversion, enrichment)
      3. Batch INSERT into destination DB
      4. Update checkpoint in migration_table_progress
      5. Throttle: sleep if approaching throughput limit
    Repeat until all rows processed

  → Simultaneously, start CDC:
    MySQL binlog → Debezium (CDC connector) → Kafka topics
    (events accumulate in Kafka; not consumed yet until backfill completes)

Phase 2: CDC Catch-Up

Kafka (CDC events accumulated during backfill)
  → CDC Consumer Workers:
    1. Read events from Kafka (in order, per partition)
    2. For each event (INSERT/UPDATE/DELETE):
       - Transform to destination schema
       - Apply to destination DB
       - Handle conflicts: if row was already backfilled, CDC UPDATE takes precedence
         (CDC event has a later timestamp)
    3. Track consumer offset in cdc_checkpoints
    4. Continue until lag < 100ms (real-time)

Phase 3: Validation

Validation Workers:
  For each table:
    1. Read batch of rows from source (by primary key range)
    2. Read same batch from destination
    3. Hash each row (MD5 of serialized field values)
    4. Compare hashes:
       - Match → continue
       - Mismatch → log to validation_results, attempt repair
    5. Check for orphans:
       - Rows in source but not in destination → re-migrate
       - Rows in destination but not in source → investigate (possible delete missed by CDC)
  Run until 0 mismatches on full table scan

Phase 4: Shadow Reads

Application code (using migration library):
  On read:
    1. Read from source DB (primary, serves the response)
    2. Async: read same query from destination DB
    3. Compare results
    4. Log any mismatches (but serve source result to user)
    5. Metrics: match_rate should be > 99.99%

Duration: 2-3 days. If match rate is consistently 100%, proceed to cutover.

Phase 5: Cutover

Option A: Instant Cutover
  1. Pause writes briefly (< 1 second — use a feature flag)
  2. Drain remaining CDC events (lag should be < 100ms)
  3. Final validation pass on recently changed rows
  4. Flip the feature flag: reads and writes now go to destination
  5. Resume writes
  6. Keep CDC running in reverse (destination → source) for rollback capability

Option B: Gradual Cutover (lower risk)
  1. Route 1% of reads to destination, 99% to source
  2. Monitor error rates, latency
  3. Gradually increase: 5% → 10% → 25% → 50% → 100%
  4. Once 100% reads are on destination:
     - Switch writes to destination
     - Reverse CDC: destination → source (for rollback)
  5. Observation period: 48 hours
  6. Decommission source

Components

Migration Controller: Orchestrates the entire migration lifecycle. Manages state transitions, schedules workers, handles errors. Single leader process (with standby).
Backfill Workers: Parallel readers/writers that bulk-copy historical data. Resumable via checkpoints. Configurable parallelism and throttling.
CDC Pipeline (Debezium + Kafka): Captures real-time changes from source DB’s transaction log. Debezium for MySQL/PostgreSQL binlog reading. Kafka for durable, ordered event storage.
CDC Consumer Workers: Apply CDC events to the destination. Handle ordering, deduplication, and conflict resolution.
Validation Workers: Continuously compare source and destination data. Report discrepancies. Auto-repair when possible.
Migration Library (application-side): Thin library integrated into the application. Handles shadow reads, traffic routing (source vs. destination), and feature flag integration.
Monitoring Dashboard: Real-time metrics: backfill progress, CDC lag, validation results, shadow read match rate, error counts.

6. Deep Dives (15 min)

Deep Dive 1: CDC vs. Dual-Write — Why CDC Wins

Dual-Write approach (rejected):

Application writes to both source and destination simultaneously:
  1. Write to source DB
  2. Write to destination DB
  3. If either fails → ???

Problems:
  1. Consistency: What if write #1 succeeds but #2 fails?
     - Retry #2? What if it fails again? Source and dest are now diverged.
     - Roll back #1? Now you've lost the user's write.
     - Queue #2 for later? Now you have eventual consistency with unbounded lag.

  2. Ordering: Two concurrent requests (A, B) may arrive at source as A→B
     but at destination as B→A. Now data is inconsistent.
     - Solution: serialize all writes through a single coordinator. But that's
       a single point of failure and a massive bottleneck.

  3. Schema transformation: If the destination has a different schema,
     the application must know both schemas and do the mapping.
     This couples the migration to the application code.

  4. Transactions: A source transaction that modifies 5 tables must also
     be atomic in the destination. Distributed transactions across two
     different databases is fragile and slow (2PC).

CDC approach (chosen):

Application writes ONLY to source DB (no code changes needed):
  Source DB → Transaction log (binlog) → Debezium → Kafka → CDC Consumer → Destination DB

Why this works:
  1. Consistency: CDC reads from the transaction log, which is the source of truth.
     Every committed transaction is captured exactly once.

  2. Ordering: Binlog preserves transaction order. CDC events in Kafka maintain this order.
     Destination applies changes in the same order as source.

  3. Schema transformation: The CDC consumer handles transformation.
     Application code is completely unaware of the migration.

  4. Transactions: CDC captures transaction boundaries. The consumer can apply
     multi-table changes atomically in the destination.

  5. Resumability: If CDC consumer crashes, it resumes from the last committed
     Kafka offset. No data loss, no duplication (with exactly-once semantics).

Trade-off: CDC has inherent lag (typically 100ms-2sec). This is acceptable because
during migration, the source is still serving reads. Lag only matters at cutover.

Handling the backfill + CDC overlap:

Problem: During backfill, both backfill workers and CDC are writing to the destination.
A row backfilled at T=100 might be updated at T=50 in the CDC stream (from before backfill read it).

Solution: Last-writer-wins with timestamp comparison:
  1. Every row in the destination has a _migration_ts field (hidden from application)
  2. Backfill writes set _migration_ts = source row's updated_at
  3. CDC writes set _migration_ts = CDC event timestamp
  4. On conflict (same primary key):
     IF new._migration_ts > existing._migration_ts → overwrite
     ELSE → skip (existing data is newer)
  5. This ensures the most recent version of every row wins,
     regardless of whether it came from backfill or CDC.

Deep Dive 2: Validation Framework

Why validation is critical:

Bugs in the transformation logic could silently corrupt data
CDC might miss events due to binlog rotation or connector bugs
Network issues could cause partial writes
Schema differences might cause truncation (e.g., VARCHAR(100) → VARCHAR(50))

Multi-layer validation:

Layer 1: Row Count Validation
  SELECT COUNT(*) FROM source.users  → 50,000,000
  SELECT COUNT(*) FROM dest.users    → 50,000,000
  If mismatch → investigate. Quick check, but doesn't catch data corruption.

Layer 2: Checksum Validation (table-level)
  Source: SELECT MD5(GROUP_CONCAT(MD5(CONCAT(id, name, email, ...)) ORDER BY id))
          FROM users WHERE id BETWEEN 1 AND 100000
  Destination: Same query
  Compare checksums per batch (100K rows). Binary search on mismatch to find
  the exact rows that differ.

  Cost: Full table scan on both databases. Run during low-traffic hours.

Layer 3: Row-Level Validation (for mismatched batches)
  For each mismatched batch from Layer 2:
    Read individual rows from both source and destination
    Compare field by field
    Log: { row_id, field, source_value, dest_value }
    Determine root cause: transformation bug? Missed CDC event? Truncation?

Layer 4: Application-Level Validation (Shadow Reads)
  Compare actual query results, not just raw data:
    Application issues: SELECT * FROM users WHERE email = '[email protected]'
    Run on both source and destination
    Compare result sets
  This catches issues that raw data comparison misses:
    - Index differences causing different query plans
    - Collation differences (case sensitivity)
    - Timezone handling differences

Automated repair:

For rows missing in destination:
  1. Re-read from source
  2. Transform and write to destination
  3. Verify write succeeded

For rows with mismatched values:
  1. If CDC event exists for this row → re-apply CDC event
  2. If no CDC event → re-read from source and overwrite destination
  3. Log repair action for audit

For rows missing in source but present in destination:
  1. Check if row was deleted in source after backfill
  2. If yes → apply delete to destination
  3. If no → orphan row, investigate manually

Deep Dive 3: Rollback Strategy

The nightmare scenario: After cutover, a critical bug is discovered in the new system. Need to switch back to the old system immediately.

Rollback design:

During cutover, we started reverse CDC:
  Destination DB → Debezium → Kafka → Reverse CDC Consumer → Source DB

This keeps the source DB up-to-date with all changes made to the destination
after cutover.

Rollback procedure:
  1. Flip feature flag: route all traffic back to source
  2. Wait for reverse CDC to drain (lag → 0)
  3. Destination is now idle; source has all the latest data
  4. Investigate the issue with the destination
  5. Fix the issue
  6. Re-run validation
  7. Re-attempt cutover when ready

Rollback time: < 1 minute (just a feature flag flip)
Data loss: 0 (reverse CDC keeps source in sync)

Rollback window:

We maintain the reverse CDC pipeline for a "rollback window" (e.g., 7 days after cutover).
During this window:
  - Source DB continues to receive reverse CDC events
  - Source DB can serve production traffic at any time
  - Both databases are operational

After the rollback window:
  - Confidence is high → decommission source
  - Stop reverse CDC
  - Archive source DB (cold storage, 90-day retention for compliance)
  - Migration is officially complete

Feature flag design for zero-downtime cutover:

Migration Library (in application code):
  read_target = feature_flag("migration_users_read")    // "source" or "destination"
  write_target = feature_flag("migration_users_write")  // "source" or "destination"

  def get_user(user_id):
    if read_target == "source":
      return source_db.query("SELECT * FROM users WHERE id = ?", user_id)
    elif read_target == "destination":
      return dest_db.query("SELECT * FROM users WHERE id = ?", user_id)
    elif read_target == "shadow":
      result = source_db.query(...)                     // serve this
      async: compare with dest_db.query(...)            // log discrepancies
      return result

  This allows:
    1. Gradual read migration (1% → 100%)
    2. Instant rollback (flip flag to "source")
    3. Shadow mode for validation
    4. Per-table control (users might migrate before orders)

7. Extensions (2 min)

Cross-region migration: Migrate data from one AWS region to another. Same approach but CDC goes over cross-region replication link. Higher latency (50-100ms) means cutover requires a longer pause or accepting a brief inconsistency window. Use DynamoDB Global Tables or Aurora Global Database for managed cross-region sync.
Multi-tenancy aware migration: Migrate tenants one at a time (shard-level migration). Each tenant is independent — if one tenant’s migration fails, others are unaffected. Supports canary deployments: migrate 1 tenant, validate, then batch-migrate the rest.
Schema evolution during migration: Support not just a single schema change but ongoing schema evolution. Use a schema registry (Avro/Protobuf) to version transformation logic. If the destination schema changes mid-migration, update the transformer without restarting the backfill.
Data masking and compliance: During migration, apply PII masking or encryption transformations. Useful when migrating from a legacy system without encryption to a new system with column-level encryption. Integrate with a secrets manager for key management.
Automated migration testing: Before running on production, replay the migration against a staging copy of the source DB. Validate outputs, measure performance, and identify edge cases (null values, Unicode, maximum field lengths) in a safe environment.

1. Requirements & Scope (5 min)#

Functional Requirements#

Non-Functional Requirements#

2. Estimation (3 min)#

Backfill Phase#

Change Data Capture (CDC) During Backfill#

Validation Phase#

Total Timeline#

3. API Design (3 min)#

Migration Control API#

4. Data Model (3 min)#

Migration State (PostgreSQL — the migration controller’s own DB)#

CDC Offset Tracking (Kafka Consumer Offsets + PostgreSQL)#

Why PostgreSQL for Migration State?#

5. High-Level Design (12 min)#

Migration Phases#

Components#

6. Deep Dives (15 min)#

Deep Dive 1: CDC vs. Dual-Write — Why CDC Wins#

Deep Dive 2: Validation Framework#

Deep Dive 3: Rollback Strategy#

7. Extensions (2 min)#