1. Requirements & Scope (5 min)
Functional Requirements
- Service Discovery: Services register themselves on startup and discover other services by name. Registry is always up-to-date with running instances.
- Configuration Management: Centrally manage and distribute configuration to all services. Support versioning, rollback, and environment-specific overrides.
- Health Checking: Continuously monitor service health. Automatically remove unhealthy instances from the service registry. Support liveness and readiness probes.
- Rolling Deployments: Orchestrate zero-downtime deployments by gradually replacing old instances with new ones, with automatic rollback on failure.
- Load Balancing Policy: Define and enforce load balancing policies (round-robin, least-connections, weighted) and circuit breaking rules at the control plane level.
Non-Functional Requirements
- Availability: 99.999% — if the control plane goes down, services can’t discover each other, configs can’t update, and deployments halt. Data plane must continue to function independently during control plane outages.
- Latency: Service discovery lookups < 5ms. Config pushes reach all nodes within 30 seconds. Health check detection < 10 seconds.
- Consistency: Service registry must be strongly consistent (a deregistered service must never receive traffic). Config updates must be atomically applied (no partial config states).
- Scale: 50K service instances across 500 services. 10 regions. 1M service discovery lookups/sec. 100K config reads/sec.
- Partition Tolerance: Control plane must handle network partitions gracefully. The data plane (actual service-to-service traffic) must continue even if the control plane is unreachable.
2. Estimation (3 min)
Service Registry
- 500 services × 100 instances each = 50K registered instances
- Each registration: ~1KB (service name, host, port, metadata, health status, version)
- Total registry size: 50K × 1KB = 50MB — trivially fits in memory on every node
- Registration/deregistration events: ~10K/hour (deploys, scaling, failures)
Service Discovery
- 50K instances, each resolving other services ~20 times/sec = 1M lookups/sec
- With local caching (refresh every 5-10 seconds): actual control plane queries = 50K instances / 5 sec = 10K queries/sec — very manageable
Health Checking
- 50K instances, health checked every 5 seconds = 10K health checks/sec
- Each health check: ~200 bytes response
- Network: 10K × 200 bytes = 2MB/sec — trivial
Configuration
- 500 services × 5 config keys each = 2,500 config entries
- Total config data: 2,500 × 10KB average = 25MB
- Config change events: ~50/day (most configs rarely change)
- Config reads: 50K instances poll every 30 seconds = 1,700 reads/sec
Key Insight
The control plane is a low-throughput, high-availability system. Data volumes are small (< 100MB total state). The hard problem is availability, consistency, and graceful degradation — not scale.
3. API Design (3 min)
Service Registration
POST /v1/services/{service_name}/instances
Body: {
"instance_id": "i-abc123",
"host": "10.0.1.42",
"port": 8080,
"metadata": {
"version": "2.3.1",
"region": "us-east-1",
"az": "us-east-1a",
"canary": false
},
"health_check": {
"type": "http",
"path": "/healthz",
"interval_seconds": 5,
"timeout_seconds": 2,
"unhealthy_threshold": 3
}
}
Response 201: { "instance_id": "i-abc123", "lease_ttl": 30 }
// Heartbeat (renew lease)
PUT /v1/services/{service_name}/instances/{instance_id}/heartbeat
Response 200: { "lease_ttl": 30 }
// If heartbeat not sent within lease_ttl → instance auto-deregistered
Service Discovery
GET /v1/services/{service_name}/instances
Query: ?healthy=true®ion=us-east-1&version=2.3.x
Response: {
"instances": [
{"instance_id": "i-abc", "host": "10.0.1.42", "port": 8080, "weight": 100, ...},
{"instance_id": "i-def", "host": "10.0.1.43", "port": 8080, "weight": 100, ...}
],
"version": 42, // for long-polling / watch
"lb_policy": "least_connections"
}
// Watch (long-poll for changes):
GET /v1/services/{service_name}/instances?watch=true&after_version=42
→ Hangs until version > 42, then returns new instance list
Configuration
PUT /v1/config/{service_name}/{key}
Body: { "value": "new_value", "comment": "increase timeout for holiday traffic" }
Response 200: { "version": 15, "previous_value": "old_value" }
GET /v1/config/{service_name}
Response: { "configs": { "db_pool_size": "20", "timeout_ms": "5000", ... }, "version": 15 }
// Watch for config changes:
GET /v1/config/{service_name}?watch=true&after_version=15
Deployment
POST /v1/deployments
Body: {
"service_name": "user-service",
"target_version": "2.4.0",
"strategy": "rolling",
"max_unavailable": "25%",
"max_surge": "25%",
"health_check_grace_period_seconds": 30,
"auto_rollback": true
}
Response 201: { "deployment_id": "dep_xyz", "status": "in_progress" }
GET /v1/deployments/{deployment_id}
Response: {
"status": "in_progress", // in_progress, completed, rolling_back, failed
"progress": { "desired": 100, "updated": 45, "available": 95, "unavailable": 5 },
"events": [...]
}
4. Data Model (3 min)
Service Registry (etcd / Consul KV)
// Stored as key-value pairs in etcd:
Key: /services/{service_name}/instances/{instance_id}
Value: {
"host": "10.0.1.42",
"port": 8080,
"metadata": { "version": "2.3.1", "region": "us-east-1", ... },
"health_status": "healthy", // healthy, unhealthy, unknown
"last_heartbeat": "2026-02-22T10:30:00Z",
"registered_at": "2026-02-22T08:00:00Z"
}
Lease TTL: 30 seconds (auto-delete if heartbeat not renewed)
Configuration (etcd / Consul KV)
Key: /config/{service_name}/{key}
Value: {
"value": "20",
"version": 15,
"updated_by": "[email protected]",
"updated_at": "2026-02-22T09:00:00Z",
"comment": "increase pool size for holiday traffic"
}
// etcd provides native versioning via revision numbers
// Watch API notifies subscribers of any change under a prefix
Deployment State (PostgreSQL — for durability and complex queries)
Table: deployments
deployment_id (PK) | uuid
service_name | varchar(100)
target_version | varchar(50)
strategy | enum('rolling', 'blue_green', 'canary')
status | enum('pending', 'in_progress', 'completed', 'rolling_back', 'failed')
desired_instances | int
config | jsonb -- max_unavailable, max_surge, etc.
started_at | timestamp
completed_at | timestamp
Table: deployment_events
event_id (PK) | uuid
deployment_id (FK) | uuid
event_type | varchar(50) -- instance_started, health_check_passed, rollback_triggered
instance_id | varchar(50)
message | text
timestamp | timestamp
Why These Choices
- etcd for service registry and config: Purpose-built for this. Raft consensus for strong consistency. Watch API for real-time change notifications. TTL leases for automatic deregistration. Used by Kubernetes, proven at scale.
- PostgreSQL for deployment state: Deployments are complex state machines that benefit from ACID transactions, complex queries (“show all failed deployments in the last 7 days”), and durability.
- Not ZooKeeper: etcd has a simpler API (gRPC + REST), better performance for watches, and a more modern operational model. ZooKeeper requires managing sessions and ephemeral nodes — etcd leases are cleaner.
5. High-Level Design (12 min)
Architecture
┌─────────────────────────────────┐
│ Control Plane │
│ │
│ ┌───────────┐ ┌─────────────┐ │
│ │ Registry │ │ Config │ │
│ │ Service │ │ Service │ │
│ └─────┬─────┘ └──────┬──────┘ │
│ │ │ │
│ ┌─────┴───────────────┴──────┐ │
│ │ etcd Cluster │ │
│ │ (3 or 5 node Raft) │ │
│ └────────────────────────────┘ │
│ │
│ ┌────────────┐ ┌────────────┐ │
│ │ Health │ │ Deployment │ │
│ │ Checker │ │ Controller │ │
│ └────────────┘ └────────────┘ │
└──────────────┬───────────────────┘
│
───────────────┼───────────────────
│
┌──────────────┴───────────────────┐
│ Data Plane │
│ │
│ Service A Service B ... │
│ (sidecar) (sidecar) │
│ instance 1 instance 1 │
│ instance 2 instance 2 │
│ ... ... │
└───────────────────────────────────┘
Control Plane vs Data Plane Separation
Control Plane (our design):
→ Manages the "what should be" state
→ Service registry, config, health policy, deployment orchestration
→ Pushes state to data plane sidecars
→ Can go down briefly without affecting running traffic
Data Plane (sidecars / service mesh):
→ Handles actual service-to-service traffic
→ Uses locally cached routing tables from control plane
→ Makes load balancing and circuit breaking decisions locally
→ Continues to function with stale data during control plane outages
Service Discovery with Caching
Service A wants to call Service B:
1. Sidecar proxy on Service A checks local cache for Service B instances
2. If cache is fresh (< 10 seconds): route to cached instance list
3. If cache is stale: async refresh from control plane
→ GET /v1/services/service-b/instances?watch=true&after_version=42
→ Long-poll returns immediately if there's a new version
→ Update local cache
4. If control plane is unreachable: use stale cache (graceful degradation)
→ Log warning, alert ops
5. Load balance across healthy instances using configured policy
Components
- etcd Cluster (3 or 5 nodes): Core state store. Raft consensus. Holds service registry and configuration. Provides watch API for change notifications.
- Registry Service: API layer over etcd for service registration/discovery. Manages lease renewal. Provides aggregated views (all instances of a service).
- Config Service: API layer for configuration management. Supports versioning, rollback, environment overrides. Validates configs before applying.
- Health Checker: Actively probes all registered instances. Updates health status in registry. Supports HTTP, TCP, and gRPC health checks.
- Deployment Controller: Orchestrates rolling deployments. State machine managing deployment lifecycle. Integrates with health checker for readiness gates.
- Sidecar Proxy (data plane): Runs alongside each service instance. Caches service registry and config locally. Handles load balancing, circuit breaking, retries, and mTLS.
- Admin Dashboard: Visualize service topology, health status, config diffs, deployment progress.
6. Deep Dives (15 min)
Deep Dive 1: Leader Election and Consensus
Why it matters: The control plane itself is distributed (3-5 nodes). Only one node should run the Health Checker and Deployment Controller at a time (to avoid duplicate health checks and conflicting deployment decisions). This requires leader election.
etcd-based leader election:
Leader Election Protocol:
1. Each control plane node creates a lease in etcd (TTL = 15 seconds)
2. Attempt to create a key with the lease:
PUT /election/health-checker
Value: "node-1"
LeaseID: lease_abc
IfNotExists: true // only succeeds if key doesn't exist
3. If successful → this node is the leader
→ Start running the Health Checker
→ Renew lease every 5 seconds (keepalive)
4. If key already exists → this node is a follower
→ Watch the key for deletion
→ On deletion (leader's lease expired) → attempt to acquire
5. If leader crashes → lease expires in 15 seconds → key deleted
→ Follower nodes race to acquire
→ Exactly one wins (etcd's atomic compare-and-swap)
→ New leader starts within 15-20 seconds of failure
Raft consensus within etcd:
etcd uses Raft for internal replication:
→ 5-node cluster tolerates 2 node failures
→ Writes require majority (3/5) agreement
→ Reads can be served from any node (with linearizable reads from leader)
Write path:
Client → etcd Leader → replicate to followers (2/4) → commit → ack to client
Latency: ~5-10ms (within same region)
In a network partition:
→ Partition with majority (3+ nodes) continues to operate
→ Partition with minority (2 nodes) rejects writes
→ On heal: minority nodes catch up from leader's log
Multiple leader roles (avoid single point of failure):
Instead of one "leader" for everything:
→ Health Checker Leader: runs health probes
→ Deployment Controller Leader: manages deployments
→ Config Sync Leader: pushes config to edge caches
Each role has independent leader election.
If node-1 is Health Checker leader and crashes:
→ node-2 takes over Health Checker
→ node-3 remains Deployment Controller leader (unaffected)
Deep Dive 2: Health Checking and Failure Detection
Challenge: Distinguishing between “service is down” and “network is unreliable.” A health check failure from one checker doesn’t mean the service is unhealthy — the checker’s network might be the problem.
Multi-probe health checking:
Health Check Architecture:
→ 3 health checker instances (in different AZs)
→ Each checks every instance every 5 seconds
→ Instance is marked unhealthy only if 2/3 checkers agree
Decision matrix:
Checker A: healthy | Checker B: unhealthy | Checker C: healthy → HEALTHY (1 checker has network issue)
Checker A: unhealthy| Checker B: unhealthy | Checker C: healthy → UNHEALTHY (2/3 agree)
Checker A: unhealthy| Checker B: unhealthy | Checker C: unhealthy→ UNHEALTHY (all agree)
Health check types:
1. Liveness probe (is the process alive?):
→ TCP connect to port
→ HTTP GET /healthz → expects 200
→ Failure → restart the instance
2. Readiness probe (can it serve traffic?):
→ HTTP GET /ready → checks DB connection, cache connection, etc.
→ Failure → remove from load balancer (but don't restart)
→ Used during startup (warm-up period) and during degraded states
3. Deep health check (is it performing well?):
→ HTTP GET /health/deep → checks latency percentiles, error rates
→ Returns degraded state if p99 > threshold
→ Used for load shedding (reduce traffic to degraded instances)
Graceful removal and drain:
When an instance is detected as unhealthy:
1. Mark as "draining" in service registry (timestamp: now)
2. Stop sending NEW requests to this instance
3. Wait for in-flight requests to complete (drain period: 30 seconds)
4. After drain: mark as "unhealthy"
5. If health recovers during drain: cancel removal, mark healthy again
This prevents abrupt connection drops for in-flight requests.
Deep Dive 3: Circuit Breaking and Load Balancing Policies
Circuit breaker (configured at control plane, enforced at data plane):
Control Plane defines policy:
PUT /v1/policies/circuit-breaker/user-service
Body: {
"error_threshold_percent": 50, // open circuit if >50% errors
"window_seconds": 10, // in a 10-second window
"min_requests": 20, // need at least 20 requests to evaluate
"open_duration_seconds": 30, // stay open for 30 seconds
"half_open_requests": 5 // allow 5 probe requests in half-open
}
Sidecar enforces locally:
State machine: CLOSED → OPEN → HALF_OPEN → CLOSED
CLOSED (normal):
→ Track error rate in sliding window
→ If error_rate > 50% AND requests > 20: → OPEN
OPEN (circuit tripped):
→ Reject all requests immediately (fail-fast, return 503)
→ After 30 seconds → HALF_OPEN
HALF_OPEN (testing recovery):
→ Allow 5 requests through
→ If all succeed → CLOSED (service recovered)
→ If any fail → OPEN (still unhealthy, wait another 30 seconds)
Load balancing policies (control plane configurable):
Per-service policy configured in control plane, enforced by sidecars:
1. Round-robin (default):
→ Simple rotation across healthy instances
→ Good for homogeneous instances
2. Least connections:
→ Route to instance with fewest active requests
→ Better for heterogeneous latency (slow queries vs fast)
3. Weighted round-robin:
→ Instances with higher weight get more traffic
→ Used for canary deployments: new version weight=10, old version weight=90
4. Locality-aware:
→ Prefer instances in same AZ (lowest latency)
→ Fallback to same region, then cross-region
→ Configurable: "same-az: 80%, same-region: 15%, cross-region: 5%"
5. Consistent hashing:
→ Route requests with same key to same instance
→ Useful for stateful services or cache affinity
Rolling deployment orchestration:
Deployment Controller state machine:
1. Validate: target version image exists, health check endpoint responds
2. Scale up: add max_surge% new instances (new version)
3. Wait for readiness: new instances must pass readiness probe
4. Shift traffic: update weights in load balancer (canary: 5% → 25% → 50% → 100%)
5. Monitor: check error rate, latency, and guardrail metrics at each step
6. If degradation detected:
→ Auto-rollback: shift traffic back to old version
→ Scale down new instances
→ Alert on-call
7. Scale down: remove old version instances
8. Complete: update service version in registry
7. Extensions (2 min)
- Service mesh integration: The control plane naturally extends into a full service mesh (like Istio/Linkerd). Add mTLS between services (automatic certificate rotation via control plane), distributed tracing injection, and request-level access policies.
- Multi-cluster federation: Federate control planes across multiple Kubernetes clusters or data centers. A global control plane aggregates service registries from regional control planes. Services can discover and route to instances in other regions.
- Chaos engineering integration: Control plane orchestrates chaos experiments — inject latency into specific routes, kill random instances, simulate network partitions. Validates that circuit breakers and fallbacks work correctly.
- GitOps configuration: Store all configuration in a Git repository. Control plane watches the repo and auto-applies changes on merge. Provides audit trail, peer review, and easy rollback (git revert).
- Adaptive rate limiting: Control plane monitors global request rates and dynamically adjusts per-service rate limits based on current capacity. During an incident, automatically shed load to protect core services.