1. Requirements & Scope (5 min)

Functional Requirements

  1. Service Discovery: Services register themselves on startup and discover other services by name. Registry is always up-to-date with running instances.
  2. Configuration Management: Centrally manage and distribute configuration to all services. Support versioning, rollback, and environment-specific overrides.
  3. Health Checking: Continuously monitor service health. Automatically remove unhealthy instances from the service registry. Support liveness and readiness probes.
  4. Rolling Deployments: Orchestrate zero-downtime deployments by gradually replacing old instances with new ones, with automatic rollback on failure.
  5. Load Balancing Policy: Define and enforce load balancing policies (round-robin, least-connections, weighted) and circuit breaking rules at the control plane level.

Non-Functional Requirements

  • Availability: 99.999% — if the control plane goes down, services can’t discover each other, configs can’t update, and deployments halt. Data plane must continue to function independently during control plane outages.
  • Latency: Service discovery lookups < 5ms. Config pushes reach all nodes within 30 seconds. Health check detection < 10 seconds.
  • Consistency: Service registry must be strongly consistent (a deregistered service must never receive traffic). Config updates must be atomically applied (no partial config states).
  • Scale: 50K service instances across 500 services. 10 regions. 1M service discovery lookups/sec. 100K config reads/sec.
  • Partition Tolerance: Control plane must handle network partitions gracefully. The data plane (actual service-to-service traffic) must continue even if the control plane is unreachable.

2. Estimation (3 min)

Service Registry

  • 500 services × 100 instances each = 50K registered instances
  • Each registration: ~1KB (service name, host, port, metadata, health status, version)
  • Total registry size: 50K × 1KB = 50MB — trivially fits in memory on every node
  • Registration/deregistration events: ~10K/hour (deploys, scaling, failures)

Service Discovery

  • 50K instances, each resolving other services ~20 times/sec = 1M lookups/sec
  • With local caching (refresh every 5-10 seconds): actual control plane queries = 50K instances / 5 sec = 10K queries/sec — very manageable

Health Checking

  • 50K instances, health checked every 5 seconds = 10K health checks/sec
  • Each health check: ~200 bytes response
  • Network: 10K × 200 bytes = 2MB/sec — trivial

Configuration

  • 500 services × 5 config keys each = 2,500 config entries
  • Total config data: 2,500 × 10KB average = 25MB
  • Config change events: ~50/day (most configs rarely change)
  • Config reads: 50K instances poll every 30 seconds = 1,700 reads/sec

Key Insight

The control plane is a low-throughput, high-availability system. Data volumes are small (< 100MB total state). The hard problem is availability, consistency, and graceful degradation — not scale.


3. API Design (3 min)

Service Registration

POST /v1/services/{service_name}/instances
  Body: {
    "instance_id": "i-abc123",
    "host": "10.0.1.42",
    "port": 8080,
    "metadata": {
      "version": "2.3.1",
      "region": "us-east-1",
      "az": "us-east-1a",
      "canary": false
    },
    "health_check": {
      "type": "http",
      "path": "/healthz",
      "interval_seconds": 5,
      "timeout_seconds": 2,
      "unhealthy_threshold": 3
    }
  }
  Response 201: { "instance_id": "i-abc123", "lease_ttl": 30 }

// Heartbeat (renew lease)
PUT /v1/services/{service_name}/instances/{instance_id}/heartbeat
  Response 200: { "lease_ttl": 30 }

// If heartbeat not sent within lease_ttl → instance auto-deregistered

Service Discovery

GET /v1/services/{service_name}/instances
  Query: ?healthy=true&region=us-east-1&version=2.3.x
  Response: {
    "instances": [
      {"instance_id": "i-abc", "host": "10.0.1.42", "port": 8080, "weight": 100, ...},
      {"instance_id": "i-def", "host": "10.0.1.43", "port": 8080, "weight": 100, ...}
    ],
    "version": 42,            // for long-polling / watch
    "lb_policy": "least_connections"
  }

// Watch (long-poll for changes):
GET /v1/services/{service_name}/instances?watch=true&after_version=42
  → Hangs until version > 42, then returns new instance list

Configuration

PUT /v1/config/{service_name}/{key}
  Body: { "value": "new_value", "comment": "increase timeout for holiday traffic" }
  Response 200: { "version": 15, "previous_value": "old_value" }

GET /v1/config/{service_name}
  Response: { "configs": { "db_pool_size": "20", "timeout_ms": "5000", ... }, "version": 15 }

// Watch for config changes:
GET /v1/config/{service_name}?watch=true&after_version=15

Deployment

POST /v1/deployments
  Body: {
    "service_name": "user-service",
    "target_version": "2.4.0",
    "strategy": "rolling",
    "max_unavailable": "25%",
    "max_surge": "25%",
    "health_check_grace_period_seconds": 30,
    "auto_rollback": true
  }
  Response 201: { "deployment_id": "dep_xyz", "status": "in_progress" }

GET /v1/deployments/{deployment_id}
  Response: {
    "status": "in_progress",       // in_progress, completed, rolling_back, failed
    "progress": { "desired": 100, "updated": 45, "available": 95, "unavailable": 5 },
    "events": [...]
  }

4. Data Model (3 min)

Service Registry (etcd / Consul KV)

// Stored as key-value pairs in etcd:
Key: /services/{service_name}/instances/{instance_id}
Value: {
  "host": "10.0.1.42",
  "port": 8080,
  "metadata": { "version": "2.3.1", "region": "us-east-1", ... },
  "health_status": "healthy",       // healthy, unhealthy, unknown
  "last_heartbeat": "2026-02-22T10:30:00Z",
  "registered_at": "2026-02-22T08:00:00Z"
}
Lease TTL: 30 seconds (auto-delete if heartbeat not renewed)

Configuration (etcd / Consul KV)

Key: /config/{service_name}/{key}
Value: {
  "value": "20",
  "version": 15,
  "updated_by": "[email protected]",
  "updated_at": "2026-02-22T09:00:00Z",
  "comment": "increase pool size for holiday traffic"
}
// etcd provides native versioning via revision numbers
// Watch API notifies subscribers of any change under a prefix

Deployment State (PostgreSQL — for durability and complex queries)

Table: deployments
  deployment_id    (PK) | uuid
  service_name           | varchar(100)
  target_version         | varchar(50)
  strategy               | enum('rolling', 'blue_green', 'canary')
  status                 | enum('pending', 'in_progress', 'completed', 'rolling_back', 'failed')
  desired_instances      | int
  config                 | jsonb  -- max_unavailable, max_surge, etc.
  started_at             | timestamp
  completed_at           | timestamp

Table: deployment_events
  event_id         (PK) | uuid
  deployment_id    (FK) | uuid
  event_type             | varchar(50)  -- instance_started, health_check_passed, rollback_triggered
  instance_id            | varchar(50)
  message                | text
  timestamp              | timestamp

Why These Choices

  • etcd for service registry and config: Purpose-built for this. Raft consensus for strong consistency. Watch API for real-time change notifications. TTL leases for automatic deregistration. Used by Kubernetes, proven at scale.
  • PostgreSQL for deployment state: Deployments are complex state machines that benefit from ACID transactions, complex queries (“show all failed deployments in the last 7 days”), and durability.
  • Not ZooKeeper: etcd has a simpler API (gRPC + REST), better performance for watches, and a more modern operational model. ZooKeeper requires managing sessions and ephemeral nodes — etcd leases are cleaner.

5. High-Level Design (12 min)

Architecture

                    ┌─────────────────────────────────┐
                    │          Control Plane           │
                    │                                  │
                    │  ┌───────────┐  ┌─────────────┐ │
                    │  │  Registry  │  │   Config     │ │
                    │  │  Service   │  │   Service    │ │
                    │  └─────┬─────┘  └──────┬──────┘ │
                    │        │               │         │
                    │  ┌─────┴───────────────┴──────┐ │
                    │  │       etcd Cluster          │ │
                    │  │   (3 or 5 node Raft)        │ │
                    │  └────────────────────────────┘ │
                    │                                  │
                    │  ┌────────────┐ ┌────────────┐  │
                    │  │  Health     │ │ Deployment │  │
                    │  │  Checker    │ │ Controller │  │
                    │  └────────────┘ └────────────┘  │
                    └──────────────┬───────────────────┘
                                   │
                    ───────────────┼───────────────────
                                   │
                    ┌──────────────┴───────────────────┐
                    │          Data Plane               │
                    │                                   │
                    │  Service A    Service B    ...     │
                    │  (sidecar)    (sidecar)            │
                    │  instance 1   instance 1           │
                    │  instance 2   instance 2           │
                    │  ...          ...                  │
                    └───────────────────────────────────┘

Control Plane vs Data Plane Separation

Control Plane (our design):
  → Manages the "what should be" state
  → Service registry, config, health policy, deployment orchestration
  → Pushes state to data plane sidecars
  → Can go down briefly without affecting running traffic

Data Plane (sidecars / service mesh):
  → Handles actual service-to-service traffic
  → Uses locally cached routing tables from control plane
  → Makes load balancing and circuit breaking decisions locally
  → Continues to function with stale data during control plane outages

Service Discovery with Caching

Service A wants to call Service B:

1. Sidecar proxy on Service A checks local cache for Service B instances
2. If cache is fresh (< 10 seconds): route to cached instance list
3. If cache is stale: async refresh from control plane
   → GET /v1/services/service-b/instances?watch=true&after_version=42
   → Long-poll returns immediately if there's a new version
   → Update local cache
4. If control plane is unreachable: use stale cache (graceful degradation)
   → Log warning, alert ops
5. Load balance across healthy instances using configured policy

Components

  1. etcd Cluster (3 or 5 nodes): Core state store. Raft consensus. Holds service registry and configuration. Provides watch API for change notifications.
  2. Registry Service: API layer over etcd for service registration/discovery. Manages lease renewal. Provides aggregated views (all instances of a service).
  3. Config Service: API layer for configuration management. Supports versioning, rollback, environment overrides. Validates configs before applying.
  4. Health Checker: Actively probes all registered instances. Updates health status in registry. Supports HTTP, TCP, and gRPC health checks.
  5. Deployment Controller: Orchestrates rolling deployments. State machine managing deployment lifecycle. Integrates with health checker for readiness gates.
  6. Sidecar Proxy (data plane): Runs alongside each service instance. Caches service registry and config locally. Handles load balancing, circuit breaking, retries, and mTLS.
  7. Admin Dashboard: Visualize service topology, health status, config diffs, deployment progress.

6. Deep Dives (15 min)

Deep Dive 1: Leader Election and Consensus

Why it matters: The control plane itself is distributed (3-5 nodes). Only one node should run the Health Checker and Deployment Controller at a time (to avoid duplicate health checks and conflicting deployment decisions). This requires leader election.

etcd-based leader election:

Leader Election Protocol:
  1. Each control plane node creates a lease in etcd (TTL = 15 seconds)
  2. Attempt to create a key with the lease:
     PUT /election/health-checker
     Value: "node-1"
     LeaseID: lease_abc
     IfNotExists: true    // only succeeds if key doesn't exist

  3. If successful → this node is the leader
     → Start running the Health Checker
     → Renew lease every 5 seconds (keepalive)

  4. If key already exists → this node is a follower
     → Watch the key for deletion
     → On deletion (leader's lease expired) → attempt to acquire

  5. If leader crashes → lease expires in 15 seconds → key deleted
     → Follower nodes race to acquire
     → Exactly one wins (etcd's atomic compare-and-swap)
     → New leader starts within 15-20 seconds of failure

Raft consensus within etcd:

etcd uses Raft for internal replication:
  → 5-node cluster tolerates 2 node failures
  → Writes require majority (3/5) agreement
  → Reads can be served from any node (with linearizable reads from leader)

Write path:
  Client → etcd Leader → replicate to followers (2/4) → commit → ack to client
  Latency: ~5-10ms (within same region)

In a network partition:
  → Partition with majority (3+ nodes) continues to operate
  → Partition with minority (2 nodes) rejects writes
  → On heal: minority nodes catch up from leader's log

Multiple leader roles (avoid single point of failure):

Instead of one "leader" for everything:
  → Health Checker Leader: runs health probes
  → Deployment Controller Leader: manages deployments
  → Config Sync Leader: pushes config to edge caches

Each role has independent leader election.
If node-1 is Health Checker leader and crashes:
  → node-2 takes over Health Checker
  → node-3 remains Deployment Controller leader (unaffected)

Deep Dive 2: Health Checking and Failure Detection

Challenge: Distinguishing between “service is down” and “network is unreliable.” A health check failure from one checker doesn’t mean the service is unhealthy — the checker’s network might be the problem.

Multi-probe health checking:

Health Check Architecture:
  → 3 health checker instances (in different AZs)
  → Each checks every instance every 5 seconds
  → Instance is marked unhealthy only if 2/3 checkers agree

Decision matrix:
  Checker A: healthy  | Checker B: unhealthy | Checker C: healthy  → HEALTHY (1 checker has network issue)
  Checker A: unhealthy| Checker B: unhealthy | Checker C: healthy  → UNHEALTHY (2/3 agree)
  Checker A: unhealthy| Checker B: unhealthy | Checker C: unhealthy→ UNHEALTHY (all agree)

Health check types:

1. Liveness probe (is the process alive?):
   → TCP connect to port
   → HTTP GET /healthz → expects 200
   → Failure → restart the instance

2. Readiness probe (can it serve traffic?):
   → HTTP GET /ready → checks DB connection, cache connection, etc.
   → Failure → remove from load balancer (but don't restart)
   → Used during startup (warm-up period) and during degraded states

3. Deep health check (is it performing well?):
   → HTTP GET /health/deep → checks latency percentiles, error rates
   → Returns degraded state if p99 > threshold
   → Used for load shedding (reduce traffic to degraded instances)

Graceful removal and drain:

When an instance is detected as unhealthy:
  1. Mark as "draining" in service registry (timestamp: now)
  2. Stop sending NEW requests to this instance
  3. Wait for in-flight requests to complete (drain period: 30 seconds)
  4. After drain: mark as "unhealthy"
  5. If health recovers during drain: cancel removal, mark healthy again

This prevents abrupt connection drops for in-flight requests.

Deep Dive 3: Circuit Breaking and Load Balancing Policies

Circuit breaker (configured at control plane, enforced at data plane):

Control Plane defines policy:
  PUT /v1/policies/circuit-breaker/user-service
  Body: {
    "error_threshold_percent": 50,     // open circuit if >50% errors
    "window_seconds": 10,              // in a 10-second window
    "min_requests": 20,                // need at least 20 requests to evaluate
    "open_duration_seconds": 30,       // stay open for 30 seconds
    "half_open_requests": 5            // allow 5 probe requests in half-open
  }

Sidecar enforces locally:
  State machine: CLOSED → OPEN → HALF_OPEN → CLOSED

  CLOSED (normal):
    → Track error rate in sliding window
    → If error_rate > 50% AND requests > 20: → OPEN

  OPEN (circuit tripped):
    → Reject all requests immediately (fail-fast, return 503)
    → After 30 seconds → HALF_OPEN

  HALF_OPEN (testing recovery):
    → Allow 5 requests through
    → If all succeed → CLOSED (service recovered)
    → If any fail → OPEN (still unhealthy, wait another 30 seconds)

Load balancing policies (control plane configurable):

Per-service policy configured in control plane, enforced by sidecars:

1. Round-robin (default):
   → Simple rotation across healthy instances
   → Good for homogeneous instances

2. Least connections:
   → Route to instance with fewest active requests
   → Better for heterogeneous latency (slow queries vs fast)

3. Weighted round-robin:
   → Instances with higher weight get more traffic
   → Used for canary deployments: new version weight=10, old version weight=90

4. Locality-aware:
   → Prefer instances in same AZ (lowest latency)
   → Fallback to same region, then cross-region
   → Configurable: "same-az: 80%, same-region: 15%, cross-region: 5%"

5. Consistent hashing:
   → Route requests with same key to same instance
   → Useful for stateful services or cache affinity

Rolling deployment orchestration:

Deployment Controller state machine:
  1. Validate: target version image exists, health check endpoint responds
  2. Scale up: add max_surge% new instances (new version)
  3. Wait for readiness: new instances must pass readiness probe
  4. Shift traffic: update weights in load balancer (canary: 5% → 25% → 50% → 100%)
  5. Monitor: check error rate, latency, and guardrail metrics at each step
  6. If degradation detected:
     → Auto-rollback: shift traffic back to old version
     → Scale down new instances
     → Alert on-call
  7. Scale down: remove old version instances
  8. Complete: update service version in registry

7. Extensions (2 min)

  • Service mesh integration: The control plane naturally extends into a full service mesh (like Istio/Linkerd). Add mTLS between services (automatic certificate rotation via control plane), distributed tracing injection, and request-level access policies.
  • Multi-cluster federation: Federate control planes across multiple Kubernetes clusters or data centers. A global control plane aggregates service registries from regional control planes. Services can discover and route to instances in other regions.
  • Chaos engineering integration: Control plane orchestrates chaos experiments — inject latency into specific routes, kill random instances, simulate network partitions. Validates that circuit breakers and fallbacks work correctly.
  • GitOps configuration: Store all configuration in a Git repository. Control plane watches the repo and auto-applies changes on merge. Provides audit trail, peer review, and easy rollback (git revert).
  • Adaptive rate limiting: Control plane monitors global request rates and dynamically adjusts per-service rate limits based on current capacity. During an incident, automatically shed load to protect core services.