1. Requirements & Scope (5 min)
Functional Requirements
- Migrate data from a legacy datastore (e.g., MySQL) to a new datastore (e.g., PostgreSQL, DynamoDB, or a new schema in the same engine) with zero downtime
- Maintain full data consistency between source and destination throughout the migration — no data loss, no corruption
- Provide a validation framework that continuously compares source and destination data and reports discrepancies
- Support rollback at any stage — if the new system has issues, revert to the old system without data loss
- Handle schema transformations during migration (column renames, type changes, data enrichment, denormalization)
Non-Functional Requirements
- Availability: Zero downtime. The system must remain fully operational throughout the entire migration, which may take days to weeks.
- Latency: No user-facing latency increase during migration. Reads and writes continue at normal speed.
- Consistency: Strong consistency. After migration completes, the new datastore must have 100.000% of the data. Not 99.99% — 100%.
- Scale: 50 TB of data, 500M rows across 200 tables. 10K writes/sec, 100K reads/sec. Migration must complete within 2 weeks.
- Durability: The migration process itself must be resumable. If the migration pipeline crashes, it resumes from where it left off, not from scratch.
2. Estimation (3 min)
Backfill Phase
- 50 TB of data to migrate
- Network throughput (source → destination): 1 Gbps sustained = 125 MB/sec
- Time to backfill: 50 TB / 125 MB/sec = 400,000 sec ≈ 4.6 days at full throughput
- With throttling to avoid impacting production (50% utilization): ~9 days
Change Data Capture (CDC) During Backfill
- 10K writes/sec to source DB during backfill
- CDC event size: ~500 bytes average
- CDC throughput: 10K × 500 bytes = 5 MB/sec (easily handled by Kafka)
- Events generated during 9-day backfill: 10K/sec × 86400 × 9 = 7.8 billion events
- At 500 bytes each: 3.9 TB of CDC data (Kafka with 14-day retention can handle this)
Validation Phase
- Compare 500M rows between source and destination
- Comparison rate: 50K rows/sec (read from both, hash, compare)
- Time: 500M / 50K = 10,000 sec ≈ 2.8 hours per full validation pass
- Run 3 passes for confidence: ~8 hours
Total Timeline
- Backfill: 9 days
- CDC catch-up: 1-2 hours (processing backlog of changes accumulated during backfill)
- Validation: 1 day (3 passes + fixing discrepancies)
- Shadow reads: 2-3 days (comparison in production)
- Cutover: minutes
- Total: ~2 weeks
3. API Design (3 min)
This is an internal infrastructure system, not a user-facing API. But it needs operational APIs for engineers to manage the migration.
Migration Control API
// Create a migration plan
POST /v1/migrations
Body: {
"name": "mysql_to_postgres_users",
"source": {
"type": "mysql",
"connection": "mysql://source-host:3306/users_db",
"tables": ["users", "profiles", "addresses"]
},
"destination": {
"type": "postgresql",
"connection": "postgresql://dest-host:5432/users_db"
},
"transform_config": "s3://migrations/transforms/users_v2.yaml",
"parallelism": 16,
"throttle_pct": 50
}
Response 201: { "migration_id": "mig_abc123", "status": "created" }
// Start/pause/resume/abort a migration
POST /v1/migrations/{migration_id}/start
POST /v1/migrations/{migration_id}/pause
POST /v1/migrations/{migration_id}/resume
POST /v1/migrations/{migration_id}/abort
// Get migration status
GET /v1/migrations/{migration_id}/status
Response 200: {
"migration_id": "mig_abc123",
"phase": "backfill", // backfill | cdc_catchup | validation | shadow | cutover | done
"progress": {
"tables_completed": 147,
"tables_total": 200,
"rows_migrated": 320000000,
"rows_total": 500000000,
"bytes_migrated": "32TB",
"current_lag": "2.3s", // CDC lag behind source
"errors": 0
}
}
// Trigger validation
POST /v1/migrations/{migration_id}/validate
Body: { "mode": "full", "sample_rate": 1.0 }
// Execute cutover
POST /v1/migrations/{migration_id}/cutover
Body: { "strategy": "instant" } // instant | gradual
4. Data Model (3 min)
Migration State (PostgreSQL — the migration controller’s own DB)
Table: migrations
migration_id (PK) | varchar(20)
name | varchar(100)
source_config | jsonb
destination_config | jsonb
transform_config_url | varchar(500)
phase | enum('created','backfill','cdc_catchup','validation','shadow','cutover','done','aborted')
created_at | timestamp
updated_at | timestamp
Table: migration_table_progress
migration_id (FK) | varchar(20)
table_name (PK) | varchar(100)
status | enum('pending','in_progress','completed','failed')
rows_migrated | bigint
rows_total | bigint
last_checkpoint | varchar(200) -- cursor/offset for resumability
last_error | text
updated_at | timestamp
Table: validation_results
validation_id (PK) | varchar(20)
migration_id (FK) | varchar(20)
table_name | varchar(100)
total_rows_compared | bigint
mismatches_found | bigint
missing_in_dest | bigint
missing_in_source | bigint
sample_mismatches | jsonb -- first 100 mismatched row IDs
completed_at | timestamp
CDC Offset Tracking (Kafka Consumer Offsets + PostgreSQL)
Table: cdc_checkpoints
migration_id (FK) | varchar(20)
source_table | varchar(100)
binlog_file | varchar(100) -- MySQL binlog position
binlog_position | bigint
kafka_topic | varchar(200)
kafka_offset | bigint
updated_at | timestamp
Why PostgreSQL for Migration State?
- The migration controller itself needs ACID guarantees for checkpoint management
- Resumability depends on accurate checkpoint tracking
- Low write volume (metadata only, not the actual data being migrated)
5. High-Level Design (12 min)
Migration Phases
Phase 1: Backfill (Historical Data)
Source DB (MySQL)
→ Backfill Workers (16 parallel):
For each table:
1. SELECT * FROM table WHERE id > {last_checkpoint} ORDER BY id LIMIT 10000
2. Transform rows (schema mapping, type conversion, enrichment)
3. Batch INSERT into destination DB
4. Update checkpoint in migration_table_progress
5. Throttle: sleep if approaching throughput limit
Repeat until all rows processed
→ Simultaneously, start CDC:
MySQL binlog → Debezium (CDC connector) → Kafka topics
(events accumulate in Kafka; not consumed yet until backfill completes)
Phase 2: CDC Catch-Up
Kafka (CDC events accumulated during backfill)
→ CDC Consumer Workers:
1. Read events from Kafka (in order, per partition)
2. For each event (INSERT/UPDATE/DELETE):
- Transform to destination schema
- Apply to destination DB
- Handle conflicts: if row was already backfilled, CDC UPDATE takes precedence
(CDC event has a later timestamp)
3. Track consumer offset in cdc_checkpoints
4. Continue until lag < 100ms (real-time)
Phase 3: Validation
Validation Workers:
For each table:
1. Read batch of rows from source (by primary key range)
2. Read same batch from destination
3. Hash each row (MD5 of serialized field values)
4. Compare hashes:
- Match → continue
- Mismatch → log to validation_results, attempt repair
5. Check for orphans:
- Rows in source but not in destination → re-migrate
- Rows in destination but not in source → investigate (possible delete missed by CDC)
Run until 0 mismatches on full table scan
Phase 4: Shadow Reads
Application code (using migration library):
On read:
1. Read from source DB (primary, serves the response)
2. Async: read same query from destination DB
3. Compare results
4. Log any mismatches (but serve source result to user)
5. Metrics: match_rate should be > 99.99%
Duration: 2-3 days. If match rate is consistently 100%, proceed to cutover.
Phase 5: Cutover
Option A: Instant Cutover
1. Pause writes briefly (< 1 second — use a feature flag)
2. Drain remaining CDC events (lag should be < 100ms)
3. Final validation pass on recently changed rows
4. Flip the feature flag: reads and writes now go to destination
5. Resume writes
6. Keep CDC running in reverse (destination → source) for rollback capability
Option B: Gradual Cutover (lower risk)
1. Route 1% of reads to destination, 99% to source
2. Monitor error rates, latency
3. Gradually increase: 5% → 10% → 25% → 50% → 100%
4. Once 100% reads are on destination:
- Switch writes to destination
- Reverse CDC: destination → source (for rollback)
5. Observation period: 48 hours
6. Decommission source
Components
- Migration Controller: Orchestrates the entire migration lifecycle. Manages state transitions, schedules workers, handles errors. Single leader process (with standby).
- Backfill Workers: Parallel readers/writers that bulk-copy historical data. Resumable via checkpoints. Configurable parallelism and throttling.
- CDC Pipeline (Debezium + Kafka): Captures real-time changes from source DB’s transaction log. Debezium for MySQL/PostgreSQL binlog reading. Kafka for durable, ordered event storage.
- CDC Consumer Workers: Apply CDC events to the destination. Handle ordering, deduplication, and conflict resolution.
- Validation Workers: Continuously compare source and destination data. Report discrepancies. Auto-repair when possible.
- Migration Library (application-side): Thin library integrated into the application. Handles shadow reads, traffic routing (source vs. destination), and feature flag integration.
- Monitoring Dashboard: Real-time metrics: backfill progress, CDC lag, validation results, shadow read match rate, error counts.
6. Deep Dives (15 min)
Deep Dive 1: CDC vs. Dual-Write — Why CDC Wins
Dual-Write approach (rejected):
Application writes to both source and destination simultaneously:
1. Write to source DB
2. Write to destination DB
3. If either fails → ???
Problems:
1. Consistency: What if write #1 succeeds but #2 fails?
- Retry #2? What if it fails again? Source and dest are now diverged.
- Roll back #1? Now you've lost the user's write.
- Queue #2 for later? Now you have eventual consistency with unbounded lag.
2. Ordering: Two concurrent requests (A, B) may arrive at source as A→B
but at destination as B→A. Now data is inconsistent.
- Solution: serialize all writes through a single coordinator. But that's
a single point of failure and a massive bottleneck.
3. Schema transformation: If the destination has a different schema,
the application must know both schemas and do the mapping.
This couples the migration to the application code.
4. Transactions: A source transaction that modifies 5 tables must also
be atomic in the destination. Distributed transactions across two
different databases is fragile and slow (2PC).
CDC approach (chosen):
Application writes ONLY to source DB (no code changes needed):
Source DB → Transaction log (binlog) → Debezium → Kafka → CDC Consumer → Destination DB
Why this works:
1. Consistency: CDC reads from the transaction log, which is the source of truth.
Every committed transaction is captured exactly once.
2. Ordering: Binlog preserves transaction order. CDC events in Kafka maintain this order.
Destination applies changes in the same order as source.
3. Schema transformation: The CDC consumer handles transformation.
Application code is completely unaware of the migration.
4. Transactions: CDC captures transaction boundaries. The consumer can apply
multi-table changes atomically in the destination.
5. Resumability: If CDC consumer crashes, it resumes from the last committed
Kafka offset. No data loss, no duplication (with exactly-once semantics).
Trade-off: CDC has inherent lag (typically 100ms-2sec). This is acceptable because
during migration, the source is still serving reads. Lag only matters at cutover.
Handling the backfill + CDC overlap:
Problem: During backfill, both backfill workers and CDC are writing to the destination.
A row backfilled at T=100 might be updated at T=50 in the CDC stream (from before backfill read it).
Solution: Last-writer-wins with timestamp comparison:
1. Every row in the destination has a _migration_ts field (hidden from application)
2. Backfill writes set _migration_ts = source row's updated_at
3. CDC writes set _migration_ts = CDC event timestamp
4. On conflict (same primary key):
IF new._migration_ts > existing._migration_ts → overwrite
ELSE → skip (existing data is newer)
5. This ensures the most recent version of every row wins,
regardless of whether it came from backfill or CDC.
Deep Dive 2: Validation Framework
Why validation is critical:
- Bugs in the transformation logic could silently corrupt data
- CDC might miss events due to binlog rotation or connector bugs
- Network issues could cause partial writes
- Schema differences might cause truncation (e.g., VARCHAR(100) → VARCHAR(50))
Multi-layer validation:
Layer 1: Row Count Validation
SELECT COUNT(*) FROM source.users → 50,000,000
SELECT COUNT(*) FROM dest.users → 50,000,000
If mismatch → investigate. Quick check, but doesn't catch data corruption.
Layer 2: Checksum Validation (table-level)
Source: SELECT MD5(GROUP_CONCAT(MD5(CONCAT(id, name, email, ...)) ORDER BY id))
FROM users WHERE id BETWEEN 1 AND 100000
Destination: Same query
Compare checksums per batch (100K rows). Binary search on mismatch to find
the exact rows that differ.
Cost: Full table scan on both databases. Run during low-traffic hours.
Layer 3: Row-Level Validation (for mismatched batches)
For each mismatched batch from Layer 2:
Read individual rows from both source and destination
Compare field by field
Log: { row_id, field, source_value, dest_value }
Determine root cause: transformation bug? Missed CDC event? Truncation?
Layer 4: Application-Level Validation (Shadow Reads)
Compare actual query results, not just raw data:
Application issues: SELECT * FROM users WHERE email = '[email protected]'
Run on both source and destination
Compare result sets
This catches issues that raw data comparison misses:
- Index differences causing different query plans
- Collation differences (case sensitivity)
- Timezone handling differences
Automated repair:
For rows missing in destination:
1. Re-read from source
2. Transform and write to destination
3. Verify write succeeded
For rows with mismatched values:
1. If CDC event exists for this row → re-apply CDC event
2. If no CDC event → re-read from source and overwrite destination
3. Log repair action for audit
For rows missing in source but present in destination:
1. Check if row was deleted in source after backfill
2. If yes → apply delete to destination
3. If no → orphan row, investigate manually
Deep Dive 3: Rollback Strategy
The nightmare scenario: After cutover, a critical bug is discovered in the new system. Need to switch back to the old system immediately.
Rollback design:
During cutover, we started reverse CDC:
Destination DB → Debezium → Kafka → Reverse CDC Consumer → Source DB
This keeps the source DB up-to-date with all changes made to the destination
after cutover.
Rollback procedure:
1. Flip feature flag: route all traffic back to source
2. Wait for reverse CDC to drain (lag → 0)
3. Destination is now idle; source has all the latest data
4. Investigate the issue with the destination
5. Fix the issue
6. Re-run validation
7. Re-attempt cutover when ready
Rollback time: < 1 minute (just a feature flag flip)
Data loss: 0 (reverse CDC keeps source in sync)
Rollback window:
We maintain the reverse CDC pipeline for a "rollback window" (e.g., 7 days after cutover).
During this window:
- Source DB continues to receive reverse CDC events
- Source DB can serve production traffic at any time
- Both databases are operational
After the rollback window:
- Confidence is high → decommission source
- Stop reverse CDC
- Archive source DB (cold storage, 90-day retention for compliance)
- Migration is officially complete
Feature flag design for zero-downtime cutover:
Migration Library (in application code):
read_target = feature_flag("migration_users_read") // "source" or "destination"
write_target = feature_flag("migration_users_write") // "source" or "destination"
def get_user(user_id):
if read_target == "source":
return source_db.query("SELECT * FROM users WHERE id = ?", user_id)
elif read_target == "destination":
return dest_db.query("SELECT * FROM users WHERE id = ?", user_id)
elif read_target == "shadow":
result = source_db.query(...) // serve this
async: compare with dest_db.query(...) // log discrepancies
return result
This allows:
1. Gradual read migration (1% → 100%)
2. Instant rollback (flip flag to "source")
3. Shadow mode for validation
4. Per-table control (users might migrate before orders)
7. Extensions (2 min)
- Cross-region migration: Migrate data from one AWS region to another. Same approach but CDC goes over cross-region replication link. Higher latency (50-100ms) means cutover requires a longer pause or accepting a brief inconsistency window. Use DynamoDB Global Tables or Aurora Global Database for managed cross-region sync.
- Multi-tenancy aware migration: Migrate tenants one at a time (shard-level migration). Each tenant is independent — if one tenant’s migration fails, others are unaffected. Supports canary deployments: migrate 1 tenant, validate, then batch-migrate the rest.
- Schema evolution during migration: Support not just a single schema change but ongoing schema evolution. Use a schema registry (Avro/Protobuf) to version transformation logic. If the destination schema changes mid-migration, update the transformer without restarting the backfill.
- Data masking and compliance: During migration, apply PII masking or encryption transformations. Useful when migrating from a legacy system without encryption to a new system with column-level encryption. Integrate with a secrets manager for key management.
- Automated migration testing: Before running on production, replay the migration against a staging copy of the source DB. Validate outputs, measure performance, and identify edge cases (null values, Unicode, maximum field lengths) in a safe environment.