Design a Distributed System Control Plane

Table of Contents

1. Requirements & Scope (5 min)
- Functional Requirements
- Non-Functional Requirements
2. Estimation (3 min)
3. API Design (3 min)
4. Data Model (3 min)
5. High-Level Design (12 min)
6. Deep Dives (15 min)
7. Extensions (2 min)

This content is password protected

1. Requirements & Scope (5 min)

Functional Requirements

Service Discovery: Services register themselves on startup and discover other services by name. Registry is always up-to-date with running instances.
Configuration Management: Centrally manage and distribute configuration to all services. Support versioning, rollback, and environment-specific overrides.
Health Checking: Continuously monitor service health. Automatically remove unhealthy instances from the service registry. Support liveness and readiness probes.
Rolling Deployments: Orchestrate zero-downtime deployments by gradually replacing old instances with new ones, with automatic rollback on failure.
Load Balancing Policy: Define and enforce load balancing policies (round-robin, least-connections, weighted) and circuit breaking rules at the control plane level.

Non-Functional Requirements

Availability: 99.999% — if the control plane goes down, services can’t discover each other, configs can’t update, and deployments halt. Data plane must continue to function independently during control plane outages.
Latency: Service discovery lookups < 5ms. Config pushes reach all nodes within 30 seconds. Health check detection < 10 seconds.
Consistency: Service registry must be strongly consistent (a deregistered service must never receive traffic). Config updates must be atomically applied (no partial config states).
Scale: 50K service instances across 500 services. 10 regions. 1M service discovery lookups/sec. 100K config reads/sec.
Partition Tolerance: Control plane must handle network partitions gracefully. The data plane (actual service-to-service traffic) must continue even if the control plane is unreachable.

2. Estimation (3 min)

Service Registry

500 services × 100 instances each = 50K registered instances
Each registration: ~1KB (service name, host, port, metadata, health status, version)
Total registry size: 50K × 1KB = 50MB — trivially fits in memory on every node
Registration/deregistration events: ~10K/hour (deploys, scaling, failures)

Service Discovery

50K instances, each resolving other services ~20 times/sec = 1M lookups/sec
With local caching (refresh every 5-10 seconds): actual control plane queries = 50K instances / 5 sec = 10K queries/sec — very manageable

Health Checking

50K instances, health checked every 5 seconds = 10K health checks/sec
Each health check: ~200 bytes response
Network: 10K × 200 bytes = 2MB/sec — trivial

Configuration

500 services × 5 config keys each = 2,500 config entries
Total config data: 2,500 × 10KB average = 25MB
Config change events: ~50/day (most configs rarely change)
Config reads: 50K instances poll every 30 seconds = 1,700 reads/sec

Key Insight

The control plane is a low-throughput, high-availability system. Data volumes are small (< 100MB total state). The hard problem is availability, consistency, and graceful degradation — not scale.

3. API Design (3 min)

Service Registration

POST /v1/services/{service_name}/instances
  Body: {
    "instance_id": "i-abc123",
    "host": "10.0.1.42",
    "port": 8080,
    "metadata": {
      "version": "2.3.1",
      "region": "us-east-1",
      "az": "us-east-1a",
      "canary": false
    },
    "health_check": {
      "type": "http",
      "path": "/healthz",
      "interval_seconds": 5,
      "timeout_seconds": 2,
      "unhealthy_threshold": 3
    }
  }
  Response 201: { "instance_id": "i-abc123", "lease_ttl": 30 }

// Heartbeat (renew lease)
PUT /v1/services/{service_name}/instances/{instance_id}/heartbeat
  Response 200: { "lease_ttl": 30 }

// If heartbeat not sent within lease_ttl → instance auto-deregistered

Service Discovery

GET /v1/services/{service_name}/instances
  Query: ?healthy=true&region=us-east-1&version=2.3.x
  Response: {
    "instances": [
      {"instance_id": "i-abc", "host": "10.0.1.42", "port": 8080, "weight": 100, ...},
      {"instance_id": "i-def", "host": "10.0.1.43", "port": 8080, "weight": 100, ...}
    ],
    "version": 42,            // for long-polling / watch
    "lb_policy": "least_connections"
  }

// Watch (long-poll for changes):
GET /v1/services/{service_name}/instances?watch=true&after_version=42
  → Hangs until version > 42, then returns new instance list

Configuration

PUT /v1/config/{service_name}/{key}
  Body: { "value": "new_value", "comment": "increase timeout for holiday traffic" }
  Response 200: { "version": 15, "previous_value": "old_value" }

GET /v1/config/{service_name}
  Response: { "configs": { "db_pool_size": "20", "timeout_ms": "5000", ... }, "version": 15 }

// Watch for config changes:
GET /v1/config/{service_name}?watch=true&after_version=15

Deployment

POST /v1/deployments
  Body: {
    "service_name": "user-service",
    "target_version": "2.4.0",
    "strategy": "rolling",
    "max_unavailable": "25%",
    "max_surge": "25%",
    "health_check_grace_period_seconds": 30,
    "auto_rollback": true
  }
  Response 201: { "deployment_id": "dep_xyz", "status": "in_progress" }

GET /v1/deployments/{deployment_id}
  Response: {
    "status": "in_progress",       // in_progress, completed, rolling_back, failed
    "progress": { "desired": 100, "updated": 45, "available": 95, "unavailable": 5 },
    "events": [...]
  }

4. Data Model (3 min)

Service Registry (etcd / Consul KV)

// Stored as key-value pairs in etcd:
Key: /services/{service_name}/instances/{instance_id}
Value: {
  "host": "10.0.1.42",
  "port": 8080,
  "metadata": { "version": "2.3.1", "region": "us-east-1", ... },
  "health_status": "healthy",       // healthy, unhealthy, unknown
  "last_heartbeat": "2026-02-22T10:30:00Z",
  "registered_at": "2026-02-22T08:00:00Z"
}
Lease TTL: 30 seconds (auto-delete if heartbeat not renewed)

Configuration (etcd / Consul KV)

Key: /config/{service_name}/{key}
Value: {
  "value": "20",
  "version": 15,
  "updated_by": "[email protected]",
  "updated_at": "2026-02-22T09:00:00Z",
  "comment": "increase pool size for holiday traffic"
}
// etcd provides native versioning via revision numbers
// Watch API notifies subscribers of any change under a prefix

Deployment State (PostgreSQL — for durability and complex queries)

Table: deployments
  deployment_id    (PK) | uuid
  service_name           | varchar(100)
  target_version         | varchar(50)
  strategy               | enum('rolling', 'blue_green', 'canary')
  status                 | enum('pending', 'in_progress', 'completed', 'rolling_back', 'failed')
  desired_instances      | int
  config                 | jsonb  -- max_unavailable, max_surge, etc.
  started_at             | timestamp
  completed_at           | timestamp

Table: deployment_events
  event_id         (PK) | uuid
  deployment_id    (FK) | uuid
  event_type             | varchar(50)  -- instance_started, health_check_passed, rollback_triggered
  instance_id            | varchar(50)
  message                | text
  timestamp              | timestamp

Why These Choices

etcd for service registry and config: Purpose-built for this. Raft consensus for strong consistency. Watch API for real-time change notifications. TTL leases for automatic deregistration. Used by Kubernetes, proven at scale.
PostgreSQL for deployment state: Deployments are complex state machines that benefit from ACID transactions, complex queries (“show all failed deployments in the last 7 days”), and durability.
Not ZooKeeper: etcd has a simpler API (gRPC + REST), better performance for watches, and a more modern operational model. ZooKeeper requires managing sessions and ephemeral nodes — etcd leases are cleaner.

5. High-Level Design (12 min)

Architecture

                    ┌─────────────────────────────────┐
                    │          Control Plane           │
                    │                                  │
                    │  ┌───────────┐  ┌─────────────┐ │
                    │  │  Registry  │  │   Config     │ │
                    │  │  Service   │  │   Service    │ │
                    │  └─────┬─────┘  └──────┬──────┘ │
                    │        │               │         │
                    │  ┌─────┴───────────────┴──────┐ │
                    │  │       etcd Cluster          │ │
                    │  │   (3 or 5 node Raft)        │ │
                    │  └────────────────────────────┘ │
                    │                                  │
                    │  ┌────────────┐ ┌────────────┐  │
                    │  │  Health     │ │ Deployment │  │
                    │  │  Checker    │ │ Controller │  │
                    │  └────────────┘ └────────────┘  │
                    └──────────────┬───────────────────┘
                                   │
                    ───────────────┼───────────────────
                                   │
                    ┌──────────────┴───────────────────┐
                    │          Data Plane               │
                    │                                   │
                    │  Service A    Service B    ...     │
                    │  (sidecar)    (sidecar)            │
                    │  instance 1   instance 1           │
                    │  instance 2   instance 2           │
                    │  ...          ...                  │
                    └───────────────────────────────────┘

Control Plane vs Data Plane Separation

Control Plane (our design):
  → Manages the "what should be" state
  → Service registry, config, health policy, deployment orchestration
  → Pushes state to data plane sidecars
  → Can go down briefly without affecting running traffic

Data Plane (sidecars / service mesh):
  → Handles actual service-to-service traffic
  → Uses locally cached routing tables from control plane
  → Makes load balancing and circuit breaking decisions locally
  → Continues to function with stale data during control plane outages

Service Discovery with Caching

Service A wants to call Service B:

1. Sidecar proxy on Service A checks local cache for Service B instances
2. If cache is fresh (< 10 seconds): route to cached instance list
3. If cache is stale: async refresh from control plane
   → GET /v1/services/service-b/instances?watch=true&after_version=42
   → Long-poll returns immediately if there's a new version
   → Update local cache
4. If control plane is unreachable: use stale cache (graceful degradation)
   → Log warning, alert ops
5. Load balance across healthy instances using configured policy

Components

etcd Cluster (3 or 5 nodes): Core state store. Raft consensus. Holds service registry and configuration. Provides watch API for change notifications.
Registry Service: API layer over etcd for service registration/discovery. Manages lease renewal. Provides aggregated views (all instances of a service).
Config Service: API layer for configuration management. Supports versioning, rollback, environment overrides. Validates configs before applying.
Health Checker: Actively probes all registered instances. Updates health status in registry. Supports HTTP, TCP, and gRPC health checks.
Deployment Controller: Orchestrates rolling deployments. State machine managing deployment lifecycle. Integrates with health checker for readiness gates.
Sidecar Proxy (data plane): Runs alongside each service instance. Caches service registry and config locally. Handles load balancing, circuit breaking, retries, and mTLS.
Admin Dashboard: Visualize service topology, health status, config diffs, deployment progress.

6. Deep Dives (15 min)

Deep Dive 1: Leader Election and Consensus

Why it matters: The control plane itself is distributed (3-5 nodes). Only one node should run the Health Checker and Deployment Controller at a time (to avoid duplicate health checks and conflicting deployment decisions). This requires leader election.

etcd-based leader election:

Leader Election Protocol:
  1. Each control plane node creates a lease in etcd (TTL = 15 seconds)
  2. Attempt to create a key with the lease:
     PUT /election/health-checker
     Value: "node-1"
     LeaseID: lease_abc
     IfNotExists: true    // only succeeds if key doesn't exist

  3. If successful → this node is the leader
     → Start running the Health Checker
     → Renew lease every 5 seconds (keepalive)

  4. If key already exists → this node is a follower
     → Watch the key for deletion
     → On deletion (leader's lease expired) → attempt to acquire

  5. If leader crashes → lease expires in 15 seconds → key deleted
     → Follower nodes race to acquire
     → Exactly one wins (etcd's atomic compare-and-swap)
     → New leader starts within 15-20 seconds of failure

Raft consensus within etcd:

etcd uses Raft for internal replication:
  → 5-node cluster tolerates 2 node failures
  → Writes require majority (3/5) agreement
  → Reads can be served from any node (with linearizable reads from leader)

Write path:
  Client → etcd Leader → replicate to followers (2/4) → commit → ack to client
  Latency: ~5-10ms (within same region)

In a network partition:
  → Partition with majority (3+ nodes) continues to operate
  → Partition with minority (2 nodes) rejects writes
  → On heal: minority nodes catch up from leader's log

Multiple leader roles (avoid single point of failure):

Instead of one "leader" for everything:
  → Health Checker Leader: runs health probes
  → Deployment Controller Leader: manages deployments
  → Config Sync Leader: pushes config to edge caches

Each role has independent leader election.
If node-1 is Health Checker leader and crashes:
  → node-2 takes over Health Checker
  → node-3 remains Deployment Controller leader (unaffected)

Deep Dive 2: Health Checking and Failure Detection

Challenge: Distinguishing between “service is down” and “network is unreliable.” A health check failure from one checker doesn’t mean the service is unhealthy — the checker’s network might be the problem.

Multi-probe health checking:

Health Check Architecture:
  → 3 health checker instances (in different AZs)
  → Each checks every instance every 5 seconds
  → Instance is marked unhealthy only if 2/3 checkers agree

Decision matrix:
  Checker A: healthy  | Checker B: unhealthy | Checker C: healthy  → HEALTHY (1 checker has network issue)
  Checker A: unhealthy| Checker B: unhealthy | Checker C: healthy  → UNHEALTHY (2/3 agree)
  Checker A: unhealthy| Checker B: unhealthy | Checker C: unhealthy→ UNHEALTHY (all agree)

Health check types:

1. Liveness probe (is the process alive?):
   → TCP connect to port
   → HTTP GET /healthz → expects 200
   → Failure → restart the instance

2. Readiness probe (can it serve traffic?):
   → HTTP GET /ready → checks DB connection, cache connection, etc.
   → Failure → remove from load balancer (but don't restart)
   → Used during startup (warm-up period) and during degraded states

3. Deep health check (is it performing well?):
   → HTTP GET /health/deep → checks latency percentiles, error rates
   → Returns degraded state if p99 > threshold
   → Used for load shedding (reduce traffic to degraded instances)

Graceful removal and drain:

When an instance is detected as unhealthy:
  1. Mark as "draining" in service registry (timestamp: now)
  2. Stop sending NEW requests to this instance
  3. Wait for in-flight requests to complete (drain period: 30 seconds)
  4. After drain: mark as "unhealthy"
  5. If health recovers during drain: cancel removal, mark healthy again

This prevents abrupt connection drops for in-flight requests.

Deep Dive 3: Circuit Breaking and Load Balancing Policies

Circuit breaker (configured at control plane, enforced at data plane):

Control Plane defines policy:
  PUT /v1/policies/circuit-breaker/user-service
  Body: {
    "error_threshold_percent": 50,     // open circuit if >50% errors
    "window_seconds": 10,              // in a 10-second window
    "min_requests": 20,                // need at least 20 requests to evaluate
    "open_duration_seconds": 30,       // stay open for 30 seconds
    "half_open_requests": 5            // allow 5 probe requests in half-open
  }

Sidecar enforces locally:
  State machine: CLOSED → OPEN → HALF_OPEN → CLOSED

  CLOSED (normal):
    → Track error rate in sliding window
    → If error_rate > 50% AND requests > 20: → OPEN

  OPEN (circuit tripped):
    → Reject all requests immediately (fail-fast, return 503)
    → After 30 seconds → HALF_OPEN

  HALF_OPEN (testing recovery):
    → Allow 5 requests through
    → If all succeed → CLOSED (service recovered)
    → If any fail → OPEN (still unhealthy, wait another 30 seconds)

Load balancing policies (control plane configurable):

Per-service policy configured in control plane, enforced by sidecars:

1. Round-robin (default):
   → Simple rotation across healthy instances
   → Good for homogeneous instances

2. Least connections:
   → Route to instance with fewest active requests
   → Better for heterogeneous latency (slow queries vs fast)

3. Weighted round-robin:
   → Instances with higher weight get more traffic
   → Used for canary deployments: new version weight=10, old version weight=90

4. Locality-aware:
   → Prefer instances in same AZ (lowest latency)
   → Fallback to same region, then cross-region
   → Configurable: "same-az: 80%, same-region: 15%, cross-region: 5%"

5. Consistent hashing:
   → Route requests with same key to same instance
   → Useful for stateful services or cache affinity

Rolling deployment orchestration:

Deployment Controller state machine:
  1. Validate: target version image exists, health check endpoint responds
  2. Scale up: add max_surge% new instances (new version)
  3. Wait for readiness: new instances must pass readiness probe
  4. Shift traffic: update weights in load balancer (canary: 5% → 25% → 50% → 100%)
  5. Monitor: check error rate, latency, and guardrail metrics at each step
  6. If degradation detected:
     → Auto-rollback: shift traffic back to old version
     → Scale down new instances
     → Alert on-call
  7. Scale down: remove old version instances
  8. Complete: update service version in registry

7. Extensions (2 min)

Service mesh integration: The control plane naturally extends into a full service mesh (like Istio/Linkerd). Add mTLS between services (automatic certificate rotation via control plane), distributed tracing injection, and request-level access policies.
Multi-cluster federation: Federate control planes across multiple Kubernetes clusters or data centers. A global control plane aggregates service registries from regional control planes. Services can discover and route to instances in other regions.
Chaos engineering integration: Control plane orchestrates chaos experiments — inject latency into specific routes, kill random instances, simulate network partitions. Validates that circuit breakers and fallbacks work correctly.
GitOps configuration: Store all configuration in a Git repository. Control plane watches the repo and auto-applies changes on merge. Provides audit trail, peer review, and easy rollback (git revert).
Adaptive rate limiting: Control plane monitors global request rates and dynamically adjusts per-service rate limits based on current capacity. During an incident, automatically shed load to protect core services.

1. Requirements & Scope (5 min)#

Functional Requirements#

Non-Functional Requirements#

2. Estimation (3 min)#

Service Registry#

Service Discovery#

Health Checking#

Configuration#

Key Insight#

3. API Design (3 min)#

Service Registration#

Service Discovery#

Configuration#

Deployment#

4. Data Model (3 min)#

Service Registry (etcd / Consul KV)#

Configuration (etcd / Consul KV)#

Deployment State (PostgreSQL — for durability and complex queries)#

Why These Choices#

5. High-Level Design (12 min)#

Architecture#

Control Plane vs Data Plane Separation#

Service Discovery with Caching#

Components#

6. Deep Dives (15 min)#

Deep Dive 1: Leader Election and Consensus#

Deep Dive 2: Health Checking and Failure Detection#

Deep Dive 3: Circuit Breaking and Load Balancing Policies#

7. Extensions (2 min)#

1. Requirements & Scope (5 min)

Functional Requirements

Non-Functional Requirements

2. Estimation (3 min)

Service Registry

Service Discovery

Health Checking

Configuration

Key Insight

3. API Design (3 min)

Service Registration

Service Discovery

Configuration

Deployment

4. Data Model (3 min)

Service Registry (etcd / Consul KV)

Configuration (etcd / Consul KV)

Deployment State (PostgreSQL — for durability and complex queries)

Why These Choices

5. High-Level Design (12 min)

Architecture

Control Plane vs Data Plane Separation

Service Discovery with Caching

Components

6. Deep Dives (15 min)

Deep Dive 1: Leader Election and Consensus

Deep Dive 2: Health Checking and Failure Detection

Deep Dive 3: Circuit Breaking and Load Balancing Policies

7. Extensions (2 min)