1. Requirements & Scope (5 min)

Functional Requirements

  1. Authentication: Support email/password login, OAuth 2.0/OIDC (Google, GitHub, etc.), and multi-factor authentication (TOTP, SMS, WebAuthn/passkeys)
  2. Session Management: Issue, validate, refresh, and revoke sessions. Support “remember me” (long-lived) and “sign out everywhere” (revoke all sessions).
  3. Authorization: Role-Based Access Control (RBAC) with hierarchical roles and Attribute-Based Access Control (ABAC) for fine-grained policies
  4. Single Sign-On (SSO): Act as an identity provider (IdP) — users authenticate once and access multiple applications without re-authenticating
  5. Account security: Rate limit login attempts, detect credential stuffing, support password reset flows, and enforce password policies

Non-Functional Requirements

  • Availability: 99.999% — if auth is down, no user can log into any application. This is the single most critical shared service.
  • Latency: Login < 500ms (includes password hashing). Token validation < 5ms (must be in the hot path of every API call). Authorization check < 10ms.
  • Security: Passwords hashed with Argon2id (memory-hard). Tokens encrypted in transit (TLS 1.3) and at rest. No plaintext secrets in logs. PII encrypted at rest.
  • Scale: 500M registered users. 50M daily logins. 10B token validations/day (every API call). 1B authorization checks/day.
  • Compliance: GDPR (right to delete, data portability), SOC 2, support for data residency requirements.

2. Estimation (3 min)

Authentication Traffic

  • 50M logins/day = 580 logins/sec average, 5K/sec peak (morning login surge)
  • Each login: password hash verification (Argon2id: ~250ms CPU per attempt) + session creation
  • CPU: 580 logins/sec × 250ms = 145 CPU-seconds/sec → need ~150 CPU cores just for password hashing

Token Validation

  • 10B validations/day = 115K validations/sec
  • JWT validation: verify signature + check expiry + decode claims = < 0.1ms (no network call)
  • With opaque tokens: Redis lookup = ~1ms per validation

Authorization

  • 1B checks/day = 11.5K checks/sec
  • Each check: look up user’s roles/permissions, evaluate policy rules

Storage

  • User accounts: 500M × 2KB = 1TB
  • Sessions: 200M active sessions × 500 bytes = 100GB (fits in Redis)
  • Roles/permissions: 1000 roles × 100 permissions = 100K entries — trivial
  • Audit logs: 50M logins/day × 500 bytes = 25GB/day, 9TB/year

Key Insight

Token validation is the hottest path (115K/sec). It MUST be local (no network call) → JWT with local signature verification. Password hashing is CPU-intensive → dedicated worker pool with rate limiting. Authorization checks need low latency → cache policies locally, evaluate in-process.


3. API Design (3 min)

Authentication

// Email/password login
POST /auth/login
  Body: { "email": "[email protected]", "password": "..." }
  Response 200: {
    "access_token": "eyJ...",           // JWT, short-lived (15 min)
    "refresh_token": "rt_abc123...",    // opaque, long-lived (30 days)
    "token_type": "Bearer",
    "expires_in": 900,
    "user": { "id": "u_123", "email": "...", "roles": ["admin"] }
  }
  Response 401: { "error": "invalid_credentials" }
  Response 429: { "error": "too_many_attempts", "retry_after": 300 }

// MFA challenge (returned when MFA is enabled)
  Response 200: {
    "mfa_required": true,
    "mfa_token": "mfa_xyz...",          // temporary token for MFA flow
    "mfa_methods": ["totp", "webauthn"]
  }

POST /auth/mfa/verify
  Body: { "mfa_token": "mfa_xyz...", "method": "totp", "code": "123456" }
  Response 200: { "access_token": "eyJ...", "refresh_token": "rt_..." }

// OAuth 2.0 / OIDC
GET /auth/oauth/authorize?provider=google&redirect_uri=...&state=...
  → Redirects to Google's consent screen
GET /auth/oauth/callback?code=...&state=...
  → Exchanges code for tokens, creates/links account, returns our tokens

// Token refresh
POST /auth/token/refresh
  Body: { "refresh_token": "rt_abc123..." }
  Response 200: { "access_token": "eyJ...", "refresh_token": "rt_new..." }
  // Refresh token rotation: old refresh token invalidated

// Logout
POST /auth/logout
  Body: { "refresh_token": "rt_abc123..." }
  → Revokes refresh token and associated sessions
POST /auth/logout-all
  → Revokes ALL refresh tokens and sessions for the user

Authorization

// Check permission (called by API gateway or services)
POST /authz/check
  Body: {
    "subject": "u_123",
    "action": "documents:write",
    "resource": "doc_456",
    "context": { "ip": "10.0.1.1", "time": "2026-02-22T14:00:00Z" }
  }
  Response 200: { "allowed": true, "reason": "role:editor grants documents:write" }

// Batch check (multiple permissions at once)
POST /authz/check-batch
  Body: {
    "subject": "u_123",
    "checks": [
      { "action": "documents:write", "resource": "doc_456" },
      { "action": "documents:delete", "resource": "doc_456" },
      { "action": "admin:users:list" }
    ]
  }
  Response 200: {
    "results": [
      { "allowed": true },
      { "allowed": false, "reason": "no permission: documents:delete" },
      { "allowed": false, "reason": "role:editor does not include admin:*" }
    ]
  }

Key Decisions

  • JWT for access tokens: Validated locally without a network call. Contains user ID, roles, and scopes. Short-lived (15 min) to limit blast radius of token theft.
  • Opaque refresh tokens: Stored server-side. Enables revocation. Rotated on each use (one-time use tokens to detect token theft).
  • Separate auth and authz endpoints: Authentication (who are you?) and authorization (can you do this?) are independently scalable concerns.

4. Data Model (3 min)

Users (PostgreSQL)

Table: users
  user_id          (PK) | uuid
  email            (UQ) | varchar(255) -- encrypted at rest
  email_hash       (UQ) | varchar(64)  -- for lookups without decrypting
  password_hash          | varchar(255) -- Argon2id hash
  mfa_enabled            | boolean
  mfa_secret             | bytea (encrypted) -- TOTP secret
  status                 | enum('active', 'locked', 'suspended', 'deleted')
  failed_login_count     | int
  locked_until           | timestamp
  created_at             | timestamp
  updated_at             | timestamp

Table: oauth_accounts
  id               (PK) | uuid
  user_id          (FK) | uuid
  provider               | varchar(50) -- google, github, etc.
  provider_user_id       | varchar(255)
  access_token           | bytea (encrypted)
  refresh_token          | bytea (encrypted)
  UNIQUE(provider, provider_user_id)

Table: webauthn_credentials
  credential_id    (PK) | bytea
  user_id          (FK) | uuid
  public_key             | bytea
  sign_count             | int
  created_at             | timestamp

Sessions and Tokens (Redis + PostgreSQL)

// Redis (fast lookups, TTL-based expiry)
Key: session:{session_id}
Value: { "user_id": "u_123", "created_at": "...", "ip": "...", "user_agent": "..." }
TTL: 30 days (or until explicit logout)

Key: refresh_token:{token_hash}
Value: { "user_id": "u_123", "session_id": "s_abc", "family": "fam_1" }
TTL: 30 days

// PostgreSQL (audit trail, "sign out everywhere")
Table: sessions
  session_id       (PK) | uuid
  user_id          (FK) | uuid
  refresh_token_family   | uuid  -- for rotation detection
  ip_address             | inet
  user_agent             | varchar(500)
  created_at             | timestamp
  last_used_at           | timestamp
  revoked_at             | timestamp

Roles and Permissions (PostgreSQL)

Table: roles
  role_id          (PK) | uuid
  name             (UQ) | varchar(100) -- admin, editor, viewer
  parent_role_id   (FK) | uuid (nullable) -- role hierarchy
  description            | text

Table: permissions
  permission_id    (PK) | uuid
  name             (UQ) | varchar(200) -- documents:write, admin:users:list
  description            | text

Table: role_permissions
  role_id          (FK) | uuid
  permission_id    (FK) | uuid
  PRIMARY KEY: (role_id, permission_id)

Table: user_roles
  user_id          (FK) | uuid
  role_id          (FK) | uuid
  resource_scope         | varchar(200) -- null=global, "org:123"=scoped to org
  granted_at             | timestamp
  granted_by             | uuid
  PRIMARY KEY: (user_id, role_id, resource_scope)

Why These Choices

  • PostgreSQL for users and RBAC: ACID for critical operations (user creation, role assignment). Complex queries (find all users with permission X). Mature encryption-at-rest support.
  • Redis for sessions/tokens: Sub-millisecond lookups. TTL for automatic expiry. Supports mass revocation (delete by user_id pattern).
  • Argon2id for passwords: Memory-hard (resists GPU/ASIC attacks), tunable (can increase memory/time as hardware improves), recommended by OWASP. Parameters: memory=64MB, iterations=3, parallelism=4.

5. High-Level Design (12 min)

Authentication Flow (Email/Password + MFA)

Client → API Gateway → Auth Service:
  1. Rate limit check (Redis): user IP + email → under threshold?
     → If exceeded: return 429 (locked for 5 minutes)
  2. Look up user by email_hash
     → If not found: return 401 (constant-time to prevent user enumeration)
  3. Verify password: Argon2id.verify(password, stored_hash)
     → If wrong: increment failed_login_count, return 401
     → If failed_count >= 5: lock account for 15 minutes
  4. If MFA enabled:
     → Generate temporary MFA token (stored in Redis, TTL 5 min)
     → Return mfa_required response
     → Client submits MFA code → verify TOTP or WebAuthn
  5. Create session:
     → Generate session_id, refresh_token
     → Store in Redis + PostgreSQL
  6. Generate JWT access token:
     → Payload: { sub: user_id, roles: [...], exp: now+15min, iss: "auth-service" }
     → Sign with RS256 (RSA private key, rotated every 90 days)
  7. Return tokens to client

Token Validation (every API call)

Client sends: Authorization: Bearer eyJ...

API Gateway (or service middleware):
  1. Decode JWT header → get kid (key ID)
  2. Look up public key from local cache (JWKS)
     → Cache refreshed every 5 minutes from GET /auth/.well-known/jwks.json
  3. Verify signature (RS256) → tamper-proof
  4. Check exp claim → not expired?
  5. Check iss claim → issued by our auth service?
  6. Extract claims (user_id, roles, scopes)
  → NO NETWORK CALL NEEDED — pure CPU, < 0.1ms

If token is expired:
  → Client calls /auth/token/refresh with refresh_token
  → Auth service validates refresh_token in Redis
  → Issues new access_token + new refresh_token (rotation)
  → Invalidates old refresh_token

Authorization Check

Service or API Gateway → Authorization Service:
  1. Load user's roles (cached in Redis, TTL 5 min):
     → Direct roles + inherited roles (role hierarchy)
     → e.g., admin inherits editor inherits viewer
  2. Load role's permissions (cached, rarely changes)
  3. Evaluate RBAC:
     → Does any of user's roles grant the requested permission?
  4. If ABAC policies exist:
     → Evaluate attribute-based rules:
       "allow documents:write IF user.department == resource.department
        AND time.hour BETWEEN 9 AND 17"
  5. Return allow/deny with reason

Components

  1. Auth Service: Handles login, registration, password reset, MFA, OAuth flows. Stateless (all state in DB/Redis). Horizontally scaled.
  2. Token Service: Issues and refreshes JWTs. Manages signing keys. Publishes JWKS endpoint. Key rotation every 90 days.
  3. Authorization Service: Evaluates RBAC/ABAC policies. Caches user roles and permissions locally. Policy engine (OPA or custom).
  4. Session Store (Redis Cluster): Active sessions and refresh tokens. Supports mass revocation.
  5. User Database (PostgreSQL): User accounts, credentials, OAuth links, roles, permissions. Encrypted at rest.
  6. Audit Log Service: Records all auth events (logins, failures, permission changes). Immutable append-only log. Required for compliance.
  7. Brute Force Protection: Rate limiting (Redis), account lockout, CAPTCHA integration, IP reputation.

6. Deep Dives (15 min)

Deep Dive 1: JWT vs Opaque Tokens — Trade-offs and Revocation

JWT (JSON Web Token):

Structure: header.payload.signature (Base64URL encoded)

Header:  { "alg": "RS256", "kid": "key-2026-02" }
Payload: {
  "sub": "u_123",
  "email": "[email protected]",
  "roles": ["editor"],
  "scopes": ["read", "write"],
  "iss": "auth.company.com",
  "aud": "api.company.com",
  "iat": 1708617600,
  "exp": 1708618500     // 15 minutes from iat
}
Signature: RS256(header + "." + payload, private_key)

Pros of JWT:

  • Validation requires no network call (just verify signature with public key)
  • Self-contained: carries user claims, reducing DB lookups
  • Stateless: scales infinitely for reads

Cons of JWT:

  • Cannot be revoked before expiry — this is the big problem
  • If a token is stolen, it’s valid for up to 15 minutes
  • Payload is not encrypted (only signed) — don’t put sensitive data in it

Revocation strategies:

Strategy 1: Short-lived tokens + refresh (our primary approach)

Access token TTL: 15 minutes (short enough that revocation is rarely needed)
Refresh token: stored in Redis, can be revoked instantly

Timeline of a compromised token:
  T+0: Token stolen
  T+5: User notices, clicks "sign out everywhere"
  T+5: All refresh tokens revoked in Redis
  T+15: Stolen access token expires naturally
  → Maximum exposure: 15 minutes (acceptable for most use cases)

Strategy 2: Token blocklist (for high-security scenarios)

On revocation:
  → Add token's jti (JWT ID) to a Redis blocklist
  → TTL = remaining token lifetime (max 15 min)

On validation:
  → After verifying signature: check if jti is in blocklist
  → Adds 1 Redis lookup (~1ms) to every API call

Blocklist size: at most (revocations per 15 min) entries
  → Typically < 1000 entries — trivially small

Strategy 3: Token versioning

User record has a token_version counter.
JWT payload includes token_version.
On "sign out everywhere": increment token_version.
On validation: compare JWT's token_version with DB/cache.
  → Stale version → reject token

Requires 1 cache lookup per validation, but only for the user's version counter.

Our recommendation: Short-lived JWT (15 min) + refresh token rotation. Add blocklist only for admin/financial applications where 15-minute exposure is unacceptable.

Deep Dive 2: Password Hashing and Credential Stuffing Protection

Password hashing with Argon2id:

Why Argon2id (not bcrypt)?
  - Memory-hard: requires 64MB per hash attempt
  - GPU/ASIC resistant: memory bandwidth is the bottleneck, not compute
  - Bcrypt uses ~4KB memory → easily parallelized on GPUs
  - Argon2id at 64MB × 1000 parallel attempts = 64GB GPU memory needed

Parameters (OWASP recommended):
  - Memory: 64MB (m=65536)
  - Iterations: 3 (t=3)
  - Parallelism: 4 threads (p=4)
  - Salt: 16 bytes random (per user)
  - Output: 32 bytes

Resulting hash time: ~250ms per attempt on a single core
  → Attacker with 1000 GPUs: ~$100 per 1 billion attempts
  → A strong password (80+ bits entropy): untouchable

Upgrade path: when a user logs in with a bcrypt password:
  → Verify against old bcrypt hash
  → Re-hash with Argon2id
  → Update stored hash (transparent to user)

Credential stuffing protection:

Attackers use leaked username/password databases from other breaches to try logging in.

Layer 1: Rate limiting per IP (Redis)
  → 10 failed attempts per IP per 5 minutes → CAPTCHA required
  → 50 failed attempts per IP per hour → block IP for 1 hour
  Implementation: INCR login:fail:ip:{ip}:{5min_bucket}

Layer 2: Rate limiting per account
  → 5 failed attempts per account per 15 minutes → account locked (15 min)
  → 20 failed attempts per account per day → account locked (24 hours, email notification)

Layer 3: Compromised password detection
  → On registration and login: check password against Have I Been Pwned API
  → Uses k-anonymity: send first 5 chars of SHA-1(password), receive matching hashes
  → Never send full password to third party
  → If compromised: force password change

Layer 4: Anomaly detection
  → Track: login time, IP, user agent, geolocation
  → Flag anomalies: new device + new country + failed MFA = high risk
  → Trigger step-up auth: "We noticed a login from a new device in Germany. Please verify."

Layer 5: Proof of Work (extreme cases)
  → During active attack: require client to solve a computational puzzle before login
  → Adds 1-2 seconds per attempt → makes large-scale attacks economically infeasible

Deep Dive 3: OAuth 2.0 / OIDC and Single Sign-On

Our system as both an OAuth client (consuming Google/GitHub) and an OAuth provider (SSO for internal apps).

As OAuth Client (social login):

Authorization Code Flow with PKCE:

1. Client generates code_verifier (random 43-128 chars) and
   code_challenge = SHA256(code_verifier)

2. Redirect to Google:
   GET https://accounts.google.com/o/oauth2/v2/auth?
     client_id=our_client_id&
     redirect_uri=https://auth.company.com/oauth/callback&
     response_type=code&
     scope=openid email profile&
     state={random_csrf_token}&
     code_challenge={challenge}&
     code_challenge_method=S256

3. User consents → Google redirects back:
   GET https://auth.company.com/oauth/callback?
     code=auth_code_from_google&
     state={same_csrf_token}

4. Our backend exchanges code for tokens:
   POST https://oauth2.googleapis.com/token
     Body: { code, client_id, client_secret, redirect_uri, code_verifier }
   Response: { access_token, id_token (JWT with user info), refresh_token }

5. Decode Google's id_token → extract email, name, picture
6. Find or create user in our DB (link oauth_accounts)
7. Issue OUR tokens (JWT access + opaque refresh)
8. Redirect user to app with our tokens

As OAuth Provider (SSO for internal apps):

App A (dashboard.company.com) and App B (admin.company.com) both use our auth system.

1. User visits App A → not logged in → redirect to auth.company.com/login
2. User logs in at auth.company.com → session cookie set on auth.company.com
3. Redirect back to App A with authorization code
4. App A exchanges code for tokens → user is logged in to App A

5. User visits App B → not logged in → redirect to auth.company.com/login
6. Auth.company.com sees existing session cookie → user already authenticated!
7. Skip login, redirect back to App B with authorization code
8. App B exchanges code for tokens → user is logged in to App B
   → No password entry needed for App B — this is SSO

Key: The session cookie on auth.company.com is the "SSO session."
Logging out of auth.company.com → revoke session → all apps lose access on next token refresh.

OIDC (OpenID Connect) — identity layer on top of OAuth 2.0:

We implement these OIDC endpoints:
  GET  /.well-known/openid-configuration  → discovery document
  GET  /.well-known/jwks.json             → public signing keys
  GET  /oauth/authorize                   → authorization endpoint
  POST /oauth/token                       → token endpoint
  GET  /oauth/userinfo                    → user info endpoint

id_token (JWT) issued by our system:
  {
    "iss": "https://auth.company.com",
    "sub": "u_123",
    "aud": "app_a_client_id",
    "email": "[email protected]",
    "name": "Alice Smith",
    "roles": ["editor"],
    "iat": 1708617600,
    "exp": 1708618500,
    "nonce": "abc123"      // replay protection
  }

7. Extensions (2 min)

  • Passwordless authentication: Support magic links (email a one-time login URL) and passkeys (WebAuthn/FIDO2). Passkeys eliminate phishing entirely — the credential is bound to the domain and requires biometric/PIN verification on the device.
  • Step-up authentication: For sensitive operations (changing password, transferring money), require re-authentication even if the user has a valid session. Time-limited elevated session (5 min) for the sensitive action.
  • Delegated authorization: Support OAuth 2.0 scopes so third-party apps can request limited access. “App X wants to read your documents but not delete them.” Fine-grained consent management.
  • Audit log and compliance: Immutable audit trail of all auth events. Support GDPR data export (all data about a user) and right to deletion (anonymize user records while preserving audit integrity).
  • Adaptive MFA: Use risk scoring to decide when to require MFA. Low risk (same device, same location) → skip MFA. High risk (new device, foreign IP) → require MFA. Reduces friction for legitimate users while catching attacks.