1. Requirements & Scope (5 min)

Functional Requirements

  1. Migrate data from a legacy datastore (e.g., MySQL) to a new datastore (e.g., PostgreSQL, DynamoDB, or a new schema in the same engine) with zero downtime
  2. Maintain full data consistency between source and destination throughout the migration — no data loss, no corruption
  3. Provide a validation framework that continuously compares source and destination data and reports discrepancies
  4. Support rollback at any stage — if the new system has issues, revert to the old system without data loss
  5. Handle schema transformations during migration (column renames, type changes, data enrichment, denormalization)

Non-Functional Requirements

  • Availability: Zero downtime. The system must remain fully operational throughout the entire migration, which may take days to weeks.
  • Latency: No user-facing latency increase during migration. Reads and writes continue at normal speed.
  • Consistency: Strong consistency. After migration completes, the new datastore must have 100.000% of the data. Not 99.99% — 100%.
  • Scale: 50 TB of data, 500M rows across 200 tables. 10K writes/sec, 100K reads/sec. Migration must complete within 2 weeks.
  • Durability: The migration process itself must be resumable. If the migration pipeline crashes, it resumes from where it left off, not from scratch.

2. Estimation (3 min)

Backfill Phase

  • 50 TB of data to migrate
  • Network throughput (source → destination): 1 Gbps sustained = 125 MB/sec
  • Time to backfill: 50 TB / 125 MB/sec = 400,000 sec ≈ 4.6 days at full throughput
  • With throttling to avoid impacting production (50% utilization): ~9 days

Change Data Capture (CDC) During Backfill

  • 10K writes/sec to source DB during backfill
  • CDC event size: ~500 bytes average
  • CDC throughput: 10K × 500 bytes = 5 MB/sec (easily handled by Kafka)
  • Events generated during 9-day backfill: 10K/sec × 86400 × 9 = 7.8 billion events
  • At 500 bytes each: 3.9 TB of CDC data (Kafka with 14-day retention can handle this)

Validation Phase

  • Compare 500M rows between source and destination
  • Comparison rate: 50K rows/sec (read from both, hash, compare)
  • Time: 500M / 50K = 10,000 sec ≈ 2.8 hours per full validation pass
  • Run 3 passes for confidence: ~8 hours

Total Timeline

  • Backfill: 9 days
  • CDC catch-up: 1-2 hours (processing backlog of changes accumulated during backfill)
  • Validation: 1 day (3 passes + fixing discrepancies)
  • Shadow reads: 2-3 days (comparison in production)
  • Cutover: minutes
  • Total: ~2 weeks

3. API Design (3 min)

This is an internal infrastructure system, not a user-facing API. But it needs operational APIs for engineers to manage the migration.

Migration Control API

// Create a migration plan
POST /v1/migrations
  Body: {
    "name": "mysql_to_postgres_users",
    "source": {
      "type": "mysql",
      "connection": "mysql://source-host:3306/users_db",
      "tables": ["users", "profiles", "addresses"]
    },
    "destination": {
      "type": "postgresql",
      "connection": "postgresql://dest-host:5432/users_db"
    },
    "transform_config": "s3://migrations/transforms/users_v2.yaml",
    "parallelism": 16,
    "throttle_pct": 50
  }
  Response 201: { "migration_id": "mig_abc123", "status": "created" }

// Start/pause/resume/abort a migration
POST /v1/migrations/{migration_id}/start
POST /v1/migrations/{migration_id}/pause
POST /v1/migrations/{migration_id}/resume
POST /v1/migrations/{migration_id}/abort

// Get migration status
GET /v1/migrations/{migration_id}/status
  Response 200: {
    "migration_id": "mig_abc123",
    "phase": "backfill",                // backfill | cdc_catchup | validation | shadow | cutover | done
    "progress": {
      "tables_completed": 147,
      "tables_total": 200,
      "rows_migrated": 320000000,
      "rows_total": 500000000,
      "bytes_migrated": "32TB",
      "current_lag": "2.3s",            // CDC lag behind source
      "errors": 0
    }
  }

// Trigger validation
POST /v1/migrations/{migration_id}/validate
  Body: { "mode": "full", "sample_rate": 1.0 }

// Execute cutover
POST /v1/migrations/{migration_id}/cutover
  Body: { "strategy": "instant" }       // instant | gradual

4. Data Model (3 min)

Migration State (PostgreSQL — the migration controller’s own DB)

Table: migrations
  migration_id     (PK) | varchar(20)
  name                   | varchar(100)
  source_config          | jsonb
  destination_config     | jsonb
  transform_config_url   | varchar(500)
  phase                  | enum('created','backfill','cdc_catchup','validation','shadow','cutover','done','aborted')
  created_at             | timestamp
  updated_at             | timestamp

Table: migration_table_progress
  migration_id     (FK) | varchar(20)
  table_name       (PK) | varchar(100)
  status                 | enum('pending','in_progress','completed','failed')
  rows_migrated          | bigint
  rows_total             | bigint
  last_checkpoint        | varchar(200)   -- cursor/offset for resumability
  last_error             | text
  updated_at             | timestamp

Table: validation_results
  validation_id    (PK) | varchar(20)
  migration_id     (FK) | varchar(20)
  table_name             | varchar(100)
  total_rows_compared    | bigint
  mismatches_found       | bigint
  missing_in_dest        | bigint
  missing_in_source      | bigint
  sample_mismatches      | jsonb          -- first 100 mismatched row IDs
  completed_at           | timestamp

CDC Offset Tracking (Kafka Consumer Offsets + PostgreSQL)

Table: cdc_checkpoints
  migration_id     (FK) | varchar(20)
  source_table           | varchar(100)
  binlog_file            | varchar(100)   -- MySQL binlog position
  binlog_position        | bigint
  kafka_topic            | varchar(200)
  kafka_offset           | bigint
  updated_at             | timestamp

Why PostgreSQL for Migration State?

  • The migration controller itself needs ACID guarantees for checkpoint management
  • Resumability depends on accurate checkpoint tracking
  • Low write volume (metadata only, not the actual data being migrated)

5. High-Level Design (12 min)

Migration Phases

Phase 1: Backfill (Historical Data)

Source DB (MySQL)
  → Backfill Workers (16 parallel):
    For each table:
      1. SELECT * FROM table WHERE id > {last_checkpoint} ORDER BY id LIMIT 10000
      2. Transform rows (schema mapping, type conversion, enrichment)
      3. Batch INSERT into destination DB
      4. Update checkpoint in migration_table_progress
      5. Throttle: sleep if approaching throughput limit
    Repeat until all rows processed

  → Simultaneously, start CDC:
    MySQL binlog → Debezium (CDC connector) → Kafka topics
    (events accumulate in Kafka; not consumed yet until backfill completes)

Phase 2: CDC Catch-Up

Kafka (CDC events accumulated during backfill)
  → CDC Consumer Workers:
    1. Read events from Kafka (in order, per partition)
    2. For each event (INSERT/UPDATE/DELETE):
       - Transform to destination schema
       - Apply to destination DB
       - Handle conflicts: if row was already backfilled, CDC UPDATE takes precedence
         (CDC event has a later timestamp)
    3. Track consumer offset in cdc_checkpoints
    4. Continue until lag < 100ms (real-time)

Phase 3: Validation

Validation Workers:
  For each table:
    1. Read batch of rows from source (by primary key range)
    2. Read same batch from destination
    3. Hash each row (MD5 of serialized field values)
    4. Compare hashes:
       - Match → continue
       - Mismatch → log to validation_results, attempt repair
    5. Check for orphans:
       - Rows in source but not in destination → re-migrate
       - Rows in destination but not in source → investigate (possible delete missed by CDC)
  Run until 0 mismatches on full table scan

Phase 4: Shadow Reads

Application code (using migration library):
  On read:
    1. Read from source DB (primary, serves the response)
    2. Async: read same query from destination DB
    3. Compare results
    4. Log any mismatches (but serve source result to user)
    5. Metrics: match_rate should be > 99.99%

Duration: 2-3 days. If match rate is consistently 100%, proceed to cutover.

Phase 5: Cutover

Option A: Instant Cutover
  1. Pause writes briefly (< 1 second — use a feature flag)
  2. Drain remaining CDC events (lag should be < 100ms)
  3. Final validation pass on recently changed rows
  4. Flip the feature flag: reads and writes now go to destination
  5. Resume writes
  6. Keep CDC running in reverse (destination → source) for rollback capability

Option B: Gradual Cutover (lower risk)
  1. Route 1% of reads to destination, 99% to source
  2. Monitor error rates, latency
  3. Gradually increase: 5% → 10% → 25% → 50% → 100%
  4. Once 100% reads are on destination:
     - Switch writes to destination
     - Reverse CDC: destination → source (for rollback)
  5. Observation period: 48 hours
  6. Decommission source

Components

  1. Migration Controller: Orchestrates the entire migration lifecycle. Manages state transitions, schedules workers, handles errors. Single leader process (with standby).
  2. Backfill Workers: Parallel readers/writers that bulk-copy historical data. Resumable via checkpoints. Configurable parallelism and throttling.
  3. CDC Pipeline (Debezium + Kafka): Captures real-time changes from source DB’s transaction log. Debezium for MySQL/PostgreSQL binlog reading. Kafka for durable, ordered event storage.
  4. CDC Consumer Workers: Apply CDC events to the destination. Handle ordering, deduplication, and conflict resolution.
  5. Validation Workers: Continuously compare source and destination data. Report discrepancies. Auto-repair when possible.
  6. Migration Library (application-side): Thin library integrated into the application. Handles shadow reads, traffic routing (source vs. destination), and feature flag integration.
  7. Monitoring Dashboard: Real-time metrics: backfill progress, CDC lag, validation results, shadow read match rate, error counts.

6. Deep Dives (15 min)

Deep Dive 1: CDC vs. Dual-Write — Why CDC Wins

Dual-Write approach (rejected):

Application writes to both source and destination simultaneously:
  1. Write to source DB
  2. Write to destination DB
  3. If either fails → ???

Problems:
  1. Consistency: What if write #1 succeeds but #2 fails?
     - Retry #2? What if it fails again? Source and dest are now diverged.
     - Roll back #1? Now you've lost the user's write.
     - Queue #2 for later? Now you have eventual consistency with unbounded lag.

  2. Ordering: Two concurrent requests (A, B) may arrive at source as A→B
     but at destination as B→A. Now data is inconsistent.
     - Solution: serialize all writes through a single coordinator. But that's
       a single point of failure and a massive bottleneck.

  3. Schema transformation: If the destination has a different schema,
     the application must know both schemas and do the mapping.
     This couples the migration to the application code.

  4. Transactions: A source transaction that modifies 5 tables must also
     be atomic in the destination. Distributed transactions across two
     different databases is fragile and slow (2PC).

CDC approach (chosen):

Application writes ONLY to source DB (no code changes needed):
  Source DB → Transaction log (binlog) → Debezium → Kafka → CDC Consumer → Destination DB

Why this works:
  1. Consistency: CDC reads from the transaction log, which is the source of truth.
     Every committed transaction is captured exactly once.

  2. Ordering: Binlog preserves transaction order. CDC events in Kafka maintain this order.
     Destination applies changes in the same order as source.

  3. Schema transformation: The CDC consumer handles transformation.
     Application code is completely unaware of the migration.

  4. Transactions: CDC captures transaction boundaries. The consumer can apply
     multi-table changes atomically in the destination.

  5. Resumability: If CDC consumer crashes, it resumes from the last committed
     Kafka offset. No data loss, no duplication (with exactly-once semantics).

Trade-off: CDC has inherent lag (typically 100ms-2sec). This is acceptable because
during migration, the source is still serving reads. Lag only matters at cutover.

Handling the backfill + CDC overlap:

Problem: During backfill, both backfill workers and CDC are writing to the destination.
A row backfilled at T=100 might be updated at T=50 in the CDC stream (from before backfill read it).

Solution: Last-writer-wins with timestamp comparison:
  1. Every row in the destination has a _migration_ts field (hidden from application)
  2. Backfill writes set _migration_ts = source row's updated_at
  3. CDC writes set _migration_ts = CDC event timestamp
  4. On conflict (same primary key):
     IF new._migration_ts > existing._migration_ts → overwrite
     ELSE → skip (existing data is newer)
  5. This ensures the most recent version of every row wins,
     regardless of whether it came from backfill or CDC.

Deep Dive 2: Validation Framework

Why validation is critical:

  • Bugs in the transformation logic could silently corrupt data
  • CDC might miss events due to binlog rotation or connector bugs
  • Network issues could cause partial writes
  • Schema differences might cause truncation (e.g., VARCHAR(100) → VARCHAR(50))

Multi-layer validation:

Layer 1: Row Count Validation
  SELECT COUNT(*) FROM source.users  → 50,000,000
  SELECT COUNT(*) FROM dest.users    → 50,000,000
  If mismatch → investigate. Quick check, but doesn't catch data corruption.

Layer 2: Checksum Validation (table-level)
  Source: SELECT MD5(GROUP_CONCAT(MD5(CONCAT(id, name, email, ...)) ORDER BY id))
          FROM users WHERE id BETWEEN 1 AND 100000
  Destination: Same query
  Compare checksums per batch (100K rows). Binary search on mismatch to find
  the exact rows that differ.

  Cost: Full table scan on both databases. Run during low-traffic hours.

Layer 3: Row-Level Validation (for mismatched batches)
  For each mismatched batch from Layer 2:
    Read individual rows from both source and destination
    Compare field by field
    Log: { row_id, field, source_value, dest_value }
    Determine root cause: transformation bug? Missed CDC event? Truncation?

Layer 4: Application-Level Validation (Shadow Reads)
  Compare actual query results, not just raw data:
    Application issues: SELECT * FROM users WHERE email = '[email protected]'
    Run on both source and destination
    Compare result sets
  This catches issues that raw data comparison misses:
    - Index differences causing different query plans
    - Collation differences (case sensitivity)
    - Timezone handling differences

Automated repair:

For rows missing in destination:
  1. Re-read from source
  2. Transform and write to destination
  3. Verify write succeeded

For rows with mismatched values:
  1. If CDC event exists for this row → re-apply CDC event
  2. If no CDC event → re-read from source and overwrite destination
  3. Log repair action for audit

For rows missing in source but present in destination:
  1. Check if row was deleted in source after backfill
  2. If yes → apply delete to destination
  3. If no → orphan row, investigate manually

Deep Dive 3: Rollback Strategy

The nightmare scenario: After cutover, a critical bug is discovered in the new system. Need to switch back to the old system immediately.

Rollback design:

During cutover, we started reverse CDC:
  Destination DB → Debezium → Kafka → Reverse CDC Consumer → Source DB

This keeps the source DB up-to-date with all changes made to the destination
after cutover.

Rollback procedure:
  1. Flip feature flag: route all traffic back to source
  2. Wait for reverse CDC to drain (lag → 0)
  3. Destination is now idle; source has all the latest data
  4. Investigate the issue with the destination
  5. Fix the issue
  6. Re-run validation
  7. Re-attempt cutover when ready

Rollback time: < 1 minute (just a feature flag flip)
Data loss: 0 (reverse CDC keeps source in sync)

Rollback window:

We maintain the reverse CDC pipeline for a "rollback window" (e.g., 7 days after cutover).
During this window:
  - Source DB continues to receive reverse CDC events
  - Source DB can serve production traffic at any time
  - Both databases are operational

After the rollback window:
  - Confidence is high → decommission source
  - Stop reverse CDC
  - Archive source DB (cold storage, 90-day retention for compliance)
  - Migration is officially complete

Feature flag design for zero-downtime cutover:

Migration Library (in application code):
  read_target = feature_flag("migration_users_read")    // "source" or "destination"
  write_target = feature_flag("migration_users_write")  // "source" or "destination"

  def get_user(user_id):
    if read_target == "source":
      return source_db.query("SELECT * FROM users WHERE id = ?", user_id)
    elif read_target == "destination":
      return dest_db.query("SELECT * FROM users WHERE id = ?", user_id)
    elif read_target == "shadow":
      result = source_db.query(...)                     // serve this
      async: compare with dest_db.query(...)            // log discrepancies
      return result

  This allows:
    1. Gradual read migration (1% → 100%)
    2. Instant rollback (flip flag to "source")
    3. Shadow mode for validation
    4. Per-table control (users might migrate before orders)

7. Extensions (2 min)

  • Cross-region migration: Migrate data from one AWS region to another. Same approach but CDC goes over cross-region replication link. Higher latency (50-100ms) means cutover requires a longer pause or accepting a brief inconsistency window. Use DynamoDB Global Tables or Aurora Global Database for managed cross-region sync.
  • Multi-tenancy aware migration: Migrate tenants one at a time (shard-level migration). Each tenant is independent — if one tenant’s migration fails, others are unaffected. Supports canary deployments: migrate 1 tenant, validate, then batch-migrate the rest.
  • Schema evolution during migration: Support not just a single schema change but ongoing schema evolution. Use a schema registry (Avro/Protobuf) to version transformation logic. If the destination schema changes mid-migration, update the transformer without restarting the backfill.
  • Data masking and compliance: During migration, apply PII masking or encryption transformations. Useful when migrating from a legacy system without encryption to a new system with column-level encryption. Integrate with a secrets manager for key management.
  • Automated migration testing: Before running on production, replay the migration against a staging copy of the source DB. Validate outputs, measure performance, and identify edge cases (null values, Unicode, maximum field lengths) in a safe environment.