Design an Authentication and Authorization System

Table of Contents

1. Requirements & Scope (5 min)
- Functional Requirements
- Non-Functional Requirements
2. Estimation (3 min)
3. API Design (3 min)
4. Data Model (3 min)
5. High-Level Design (12 min)
6. Deep Dives (15 min)
7. Extensions (2 min)

This content is password protected

1. Requirements & Scope (5 min)

Functional Requirements

Authentication: Support email/password login, OAuth 2.0/OIDC (Google, GitHub, etc.), and multi-factor authentication (TOTP, SMS, WebAuthn/passkeys)
Session Management: Issue, validate, refresh, and revoke sessions. Support “remember me” (long-lived) and “sign out everywhere” (revoke all sessions).
Authorization: Role-Based Access Control (RBAC) with hierarchical roles and Attribute-Based Access Control (ABAC) for fine-grained policies
Single Sign-On (SSO): Act as an identity provider (IdP) — users authenticate once and access multiple applications without re-authenticating
Account security: Rate limit login attempts, detect credential stuffing, support password reset flows, and enforce password policies

Non-Functional Requirements

Availability: 99.999% — if auth is down, no user can log into any application. This is the single most critical shared service.
Latency: Login < 500ms (includes password hashing). Token validation < 5ms (must be in the hot path of every API call). Authorization check < 10ms.
Security: Passwords hashed with Argon2id (memory-hard). Tokens encrypted in transit (TLS 1.3) and at rest. No plaintext secrets in logs. PII encrypted at rest.
Scale: 500M registered users. 50M daily logins. 10B token validations/day (every API call). 1B authorization checks/day.
Compliance: GDPR (right to delete, data portability), SOC 2, support for data residency requirements.

2. Estimation (3 min)

Authentication Traffic

50M logins/day = 580 logins/sec average, 5K/sec peak (morning login surge)
Each login: password hash verification (Argon2id: ~250ms CPU per attempt) + session creation
CPU: 580 logins/sec × 250ms = 145 CPU-seconds/sec → need ~150 CPU cores just for password hashing

Token Validation

10B validations/day = 115K validations/sec
JWT validation: verify signature + check expiry + decode claims = < 0.1ms (no network call)
With opaque tokens: Redis lookup = ~1ms per validation

Authorization

1B checks/day = 11.5K checks/sec
Each check: look up user’s roles/permissions, evaluate policy rules

Storage

User accounts: 500M × 2KB = 1TB
Sessions: 200M active sessions × 500 bytes = 100GB (fits in Redis)
Roles/permissions: 1000 roles × 100 permissions = 100K entries — trivial
Audit logs: 50M logins/day × 500 bytes = 25GB/day, 9TB/year

Key Insight

Token validation is the hottest path (115K/sec). It MUST be local (no network call) → JWT with local signature verification. Password hashing is CPU-intensive → dedicated worker pool with rate limiting. Authorization checks need low latency → cache policies locally, evaluate in-process.

3. API Design (3 min)

Authentication

// Email/password login
POST /auth/login
  Body: { "email": "[email protected]", "password": "..." }
  Response 200: {
    "access_token": "eyJ...",           // JWT, short-lived (15 min)
    "refresh_token": "rt_abc123...",    // opaque, long-lived (30 days)
    "token_type": "Bearer",
    "expires_in": 900,
    "user": { "id": "u_123", "email": "...", "roles": ["admin"] }
  }
  Response 401: { "error": "invalid_credentials" }
  Response 429: { "error": "too_many_attempts", "retry_after": 300 }

// MFA challenge (returned when MFA is enabled)
  Response 200: {
    "mfa_required": true,
    "mfa_token": "mfa_xyz...",          // temporary token for MFA flow
    "mfa_methods": ["totp", "webauthn"]
  }

POST /auth/mfa/verify
  Body: { "mfa_token": "mfa_xyz...", "method": "totp", "code": "123456" }
  Response 200: { "access_token": "eyJ...", "refresh_token": "rt_..." }

// OAuth 2.0 / OIDC
GET /auth/oauth/authorize?provider=google&redirect_uri=...&state=...
  → Redirects to Google's consent screen
GET /auth/oauth/callback?code=...&state=...
  → Exchanges code for tokens, creates/links account, returns our tokens

// Token refresh
POST /auth/token/refresh
  Body: { "refresh_token": "rt_abc123..." }
  Response 200: { "access_token": "eyJ...", "refresh_token": "rt_new..." }
  // Refresh token rotation: old refresh token invalidated

// Logout
POST /auth/logout
  Body: { "refresh_token": "rt_abc123..." }
  → Revokes refresh token and associated sessions
POST /auth/logout-all
  → Revokes ALL refresh tokens and sessions for the user

Authorization

// Check permission (called by API gateway or services)
POST /authz/check
  Body: {
    "subject": "u_123",
    "action": "documents:write",
    "resource": "doc_456",
    "context": { "ip": "10.0.1.1", "time": "2026-02-22T14:00:00Z" }
  }
  Response 200: { "allowed": true, "reason": "role:editor grants documents:write" }

// Batch check (multiple permissions at once)
POST /authz/check-batch
  Body: {
    "subject": "u_123",
    "checks": [
      { "action": "documents:write", "resource": "doc_456" },
      { "action": "documents:delete", "resource": "doc_456" },
      { "action": "admin:users:list" }
    ]
  }
  Response 200: {
    "results": [
      { "allowed": true },
      { "allowed": false, "reason": "no permission: documents:delete" },
      { "allowed": false, "reason": "role:editor does not include admin:*" }
    ]
  }

Key Decisions

JWT for access tokens: Validated locally without a network call. Contains user ID, roles, and scopes. Short-lived (15 min) to limit blast radius of token theft.
Opaque refresh tokens: Stored server-side. Enables revocation. Rotated on each use (one-time use tokens to detect token theft).
Separate auth and authz endpoints: Authentication (who are you?) and authorization (can you do this?) are independently scalable concerns.

4. Data Model (3 min)

Users (PostgreSQL)

Table: users
  user_id          (PK) | uuid
  email            (UQ) | varchar(255) -- encrypted at rest
  email_hash       (UQ) | varchar(64)  -- for lookups without decrypting
  password_hash          | varchar(255) -- Argon2id hash
  mfa_enabled            | boolean
  mfa_secret             | bytea (encrypted) -- TOTP secret
  status                 | enum('active', 'locked', 'suspended', 'deleted')
  failed_login_count     | int
  locked_until           | timestamp
  created_at             | timestamp
  updated_at             | timestamp

Table: oauth_accounts
  id               (PK) | uuid
  user_id          (FK) | uuid
  provider               | varchar(50) -- google, github, etc.
  provider_user_id       | varchar(255)
  access_token           | bytea (encrypted)
  refresh_token          | bytea (encrypted)
  UNIQUE(provider, provider_user_id)

Table: webauthn_credentials
  credential_id    (PK) | bytea
  user_id          (FK) | uuid
  public_key             | bytea
  sign_count             | int
  created_at             | timestamp

Sessions and Tokens (Redis + PostgreSQL)

// Redis (fast lookups, TTL-based expiry)
Key: session:{session_id}
Value: { "user_id": "u_123", "created_at": "...", "ip": "...", "user_agent": "..." }
TTL: 30 days (or until explicit logout)

Key: refresh_token:{token_hash}
Value: { "user_id": "u_123", "session_id": "s_abc", "family": "fam_1" }
TTL: 30 days

// PostgreSQL (audit trail, "sign out everywhere")
Table: sessions
  session_id       (PK) | uuid
  user_id          (FK) | uuid
  refresh_token_family   | uuid  -- for rotation detection
  ip_address             | inet
  user_agent             | varchar(500)
  created_at             | timestamp
  last_used_at           | timestamp
  revoked_at             | timestamp

Roles and Permissions (PostgreSQL)

Table: roles
  role_id          (PK) | uuid
  name             (UQ) | varchar(100) -- admin, editor, viewer
  parent_role_id   (FK) | uuid (nullable) -- role hierarchy
  description            | text

Table: permissions
  permission_id    (PK) | uuid
  name             (UQ) | varchar(200) -- documents:write, admin:users:list
  description            | text

Table: role_permissions
  role_id          (FK) | uuid
  permission_id    (FK) | uuid
  PRIMARY KEY: (role_id, permission_id)

Table: user_roles
  user_id          (FK) | uuid
  role_id          (FK) | uuid
  resource_scope         | varchar(200) -- null=global, "org:123"=scoped to org
  granted_at             | timestamp
  granted_by             | uuid
  PRIMARY KEY: (user_id, role_id, resource_scope)

Why These Choices

PostgreSQL for users and RBAC: ACID for critical operations (user creation, role assignment). Complex queries (find all users with permission X). Mature encryption-at-rest support.
Redis for sessions/tokens: Sub-millisecond lookups. TTL for automatic expiry. Supports mass revocation (delete by user_id pattern).
Argon2id for passwords: Memory-hard (resists GPU/ASIC attacks), tunable (can increase memory/time as hardware improves), recommended by OWASP. Parameters: memory=64MB, iterations=3, parallelism=4.

5. High-Level Design (12 min)

Authentication Flow (Email/Password + MFA)

Client → API Gateway → Auth Service:
  1. Rate limit check (Redis): user IP + email → under threshold?
     → If exceeded: return 429 (locked for 5 minutes)
  2. Look up user by email_hash
     → If not found: return 401 (constant-time to prevent user enumeration)
  3. Verify password: Argon2id.verify(password, stored_hash)
     → If wrong: increment failed_login_count, return 401
     → If failed_count >= 5: lock account for 15 minutes
  4. If MFA enabled:
     → Generate temporary MFA token (stored in Redis, TTL 5 min)
     → Return mfa_required response
     → Client submits MFA code → verify TOTP or WebAuthn
  5. Create session:
     → Generate session_id, refresh_token
     → Store in Redis + PostgreSQL
  6. Generate JWT access token:
     → Payload: { sub: user_id, roles: [...], exp: now+15min, iss: "auth-service" }
     → Sign with RS256 (RSA private key, rotated every 90 days)
  7. Return tokens to client

Token Validation (every API call)

Client sends: Authorization: Bearer eyJ...

API Gateway (or service middleware):
  1. Decode JWT header → get kid (key ID)
  2. Look up public key from local cache (JWKS)
     → Cache refreshed every 5 minutes from GET /auth/.well-known/jwks.json
  3. Verify signature (RS256) → tamper-proof
  4. Check exp claim → not expired?
  5. Check iss claim → issued by our auth service?
  6. Extract claims (user_id, roles, scopes)
  → NO NETWORK CALL NEEDED — pure CPU, < 0.1ms

If token is expired:
  → Client calls /auth/token/refresh with refresh_token
  → Auth service validates refresh_token in Redis
  → Issues new access_token + new refresh_token (rotation)
  → Invalidates old refresh_token

Authorization Check

Service or API Gateway → Authorization Service:
  1. Load user's roles (cached in Redis, TTL 5 min):
     → Direct roles + inherited roles (role hierarchy)
     → e.g., admin inherits editor inherits viewer
  2. Load role's permissions (cached, rarely changes)
  3. Evaluate RBAC:
     → Does any of user's roles grant the requested permission?
  4. If ABAC policies exist:
     → Evaluate attribute-based rules:
       "allow documents:write IF user.department == resource.department
        AND time.hour BETWEEN 9 AND 17"
  5. Return allow/deny with reason

Components

Auth Service: Handles login, registration, password reset, MFA, OAuth flows. Stateless (all state in DB/Redis). Horizontally scaled.
Token Service: Issues and refreshes JWTs. Manages signing keys. Publishes JWKS endpoint. Key rotation every 90 days.
Authorization Service: Evaluates RBAC/ABAC policies. Caches user roles and permissions locally. Policy engine (OPA or custom).
Session Store (Redis Cluster): Active sessions and refresh tokens. Supports mass revocation.
User Database (PostgreSQL): User accounts, credentials, OAuth links, roles, permissions. Encrypted at rest.
Audit Log Service: Records all auth events (logins, failures, permission changes). Immutable append-only log. Required for compliance.
Brute Force Protection: Rate limiting (Redis), account lockout, CAPTCHA integration, IP reputation.

6. Deep Dives (15 min)

Deep Dive 1: JWT vs Opaque Tokens — Trade-offs and Revocation

JWT (JSON Web Token):

Structure: header.payload.signature (Base64URL encoded)

Header:  { "alg": "RS256", "kid": "key-2026-02" }
Payload: {
  "sub": "u_123",
  "email": "[email protected]",
  "roles": ["editor"],
  "scopes": ["read", "write"],
  "iss": "auth.company.com",
  "aud": "api.company.com",
  "iat": 1708617600,
  "exp": 1708618500     // 15 minutes from iat
}
Signature: RS256(header + "." + payload, private_key)

Pros of JWT:

Validation requires no network call (just verify signature with public key)
Self-contained: carries user claims, reducing DB lookups
Stateless: scales infinitely for reads

Cons of JWT:

Cannot be revoked before expiry — this is the big problem
If a token is stolen, it’s valid for up to 15 minutes
Payload is not encrypted (only signed) — don’t put sensitive data in it

Revocation strategies:

Strategy 1: Short-lived tokens + refresh (our primary approach)

Access token TTL: 15 minutes (short enough that revocation is rarely needed)
Refresh token: stored in Redis, can be revoked instantly

Timeline of a compromised token:
  T+0: Token stolen
  T+5: User notices, clicks "sign out everywhere"
  T+5: All refresh tokens revoked in Redis
  T+15: Stolen access token expires naturally
  → Maximum exposure: 15 minutes (acceptable for most use cases)

Strategy 2: Token blocklist (for high-security scenarios)

On revocation:
  → Add token's jti (JWT ID) to a Redis blocklist
  → TTL = remaining token lifetime (max 15 min)

On validation:
  → After verifying signature: check if jti is in blocklist
  → Adds 1 Redis lookup (~1ms) to every API call

Blocklist size: at most (revocations per 15 min) entries
  → Typically < 1000 entries — trivially small

Strategy 3: Token versioning

User record has a token_version counter.
JWT payload includes token_version.
On "sign out everywhere": increment token_version.
On validation: compare JWT's token_version with DB/cache.
  → Stale version → reject token

Requires 1 cache lookup per validation, but only for the user's version counter.

Our recommendation: Short-lived JWT (15 min) + refresh token rotation. Add blocklist only for admin/financial applications where 15-minute exposure is unacceptable.

Deep Dive 2: Password Hashing and Credential Stuffing Protection

Password hashing with Argon2id:

Why Argon2id (not bcrypt)?
  - Memory-hard: requires 64MB per hash attempt
  - GPU/ASIC resistant: memory bandwidth is the bottleneck, not compute
  - Bcrypt uses ~4KB memory → easily parallelized on GPUs
  - Argon2id at 64MB × 1000 parallel attempts = 64GB GPU memory needed

Parameters (OWASP recommended):
  - Memory: 64MB (m=65536)
  - Iterations: 3 (t=3)
  - Parallelism: 4 threads (p=4)
  - Salt: 16 bytes random (per user)
  - Output: 32 bytes

Resulting hash time: ~250ms per attempt on a single core
  → Attacker with 1000 GPUs: ~$100 per 1 billion attempts
  → A strong password (80+ bits entropy): untouchable

Upgrade path: when a user logs in with a bcrypt password:
  → Verify against old bcrypt hash
  → Re-hash with Argon2id
  → Update stored hash (transparent to user)

Credential stuffing protection:

Attackers use leaked username/password databases from other breaches to try logging in.

Layer 1: Rate limiting per IP (Redis)
  → 10 failed attempts per IP per 5 minutes → CAPTCHA required
  → 50 failed attempts per IP per hour → block IP for 1 hour
  Implementation: INCR login:fail:ip:{ip}:{5min_bucket}

Layer 2: Rate limiting per account
  → 5 failed attempts per account per 15 minutes → account locked (15 min)
  → 20 failed attempts per account per day → account locked (24 hours, email notification)

Layer 3: Compromised password detection
  → On registration and login: check password against Have I Been Pwned API
  → Uses k-anonymity: send first 5 chars of SHA-1(password), receive matching hashes
  → Never send full password to third party
  → If compromised: force password change

Layer 4: Anomaly detection
  → Track: login time, IP, user agent, geolocation
  → Flag anomalies: new device + new country + failed MFA = high risk
  → Trigger step-up auth: "We noticed a login from a new device in Germany. Please verify."

Layer 5: Proof of Work (extreme cases)
  → During active attack: require client to solve a computational puzzle before login
  → Adds 1-2 seconds per attempt → makes large-scale attacks economically infeasible

Deep Dive 3: OAuth 2.0 / OIDC and Single Sign-On

Our system as both an OAuth client (consuming Google/GitHub) and an OAuth provider (SSO for internal apps).

As OAuth Client (social login):

Authorization Code Flow with PKCE:

1. Client generates code_verifier (random 43-128 chars) and
   code_challenge = SHA256(code_verifier)

2. Redirect to Google:
   GET https://accounts.google.com/o/oauth2/v2/auth?
     client_id=our_client_id&
     redirect_uri=https://auth.company.com/oauth/callback&
     response_type=code&
     scope=openid email profile&
     state={random_csrf_token}&
     code_challenge={challenge}&
     code_challenge_method=S256

3. User consents → Google redirects back:
   GET https://auth.company.com/oauth/callback?
     code=auth_code_from_google&
     state={same_csrf_token}

4. Our backend exchanges code for tokens:
   POST https://oauth2.googleapis.com/token
     Body: { code, client_id, client_secret, redirect_uri, code_verifier }
   Response: { access_token, id_token (JWT with user info), refresh_token }

5. Decode Google's id_token → extract email, name, picture
6. Find or create user in our DB (link oauth_accounts)
7. Issue OUR tokens (JWT access + opaque refresh)
8. Redirect user to app with our tokens

As OAuth Provider (SSO for internal apps):

App A (dashboard.company.com) and App B (admin.company.com) both use our auth system.

1. User visits App A → not logged in → redirect to auth.company.com/login
2. User logs in at auth.company.com → session cookie set on auth.company.com
3. Redirect back to App A with authorization code
4. App A exchanges code for tokens → user is logged in to App A

5. User visits App B → not logged in → redirect to auth.company.com/login
6. Auth.company.com sees existing session cookie → user already authenticated!
7. Skip login, redirect back to App B with authorization code
8. App B exchanges code for tokens → user is logged in to App B
   → No password entry needed for App B — this is SSO

Key: The session cookie on auth.company.com is the "SSO session."
Logging out of auth.company.com → revoke session → all apps lose access on next token refresh.

OIDC (OpenID Connect) — identity layer on top of OAuth 2.0:

We implement these OIDC endpoints:
  GET  /.well-known/openid-configuration  → discovery document
  GET  /.well-known/jwks.json             → public signing keys
  GET  /oauth/authorize                   → authorization endpoint
  POST /oauth/token                       → token endpoint
  GET  /oauth/userinfo                    → user info endpoint

id_token (JWT) issued by our system:
  {
    "iss": "https://auth.company.com",
    "sub": "u_123",
    "aud": "app_a_client_id",
    "email": "[email protected]",
    "name": "Alice Smith",
    "roles": ["editor"],
    "iat": 1708617600,
    "exp": 1708618500,
    "nonce": "abc123"      // replay protection
  }

7. Extensions (2 min)

Passwordless authentication: Support magic links (email a one-time login URL) and passkeys (WebAuthn/FIDO2). Passkeys eliminate phishing entirely — the credential is bound to the domain and requires biometric/PIN verification on the device.
Step-up authentication: For sensitive operations (changing password, transferring money), require re-authentication even if the user has a valid session. Time-limited elevated session (5 min) for the sensitive action.
Delegated authorization: Support OAuth 2.0 scopes so third-party apps can request limited access. “App X wants to read your documents but not delete them.” Fine-grained consent management.
Audit log and compliance: Immutable audit trail of all auth events. Support GDPR data export (all data about a user) and right to deletion (anonymize user records while preserving audit integrity).
Adaptive MFA: Use risk scoring to decide when to require MFA. Low risk (same device, same location) → skip MFA. High risk (new device, foreign IP) → require MFA. Reduces friction for legitimate users while catching attacks.

1. Requirements & Scope (5 min)#

Functional Requirements#

Non-Functional Requirements#

2. Estimation (3 min)#

Authentication Traffic#

Token Validation#

Authorization#

Storage#

Key Insight#

3. API Design (3 min)#

Authentication#

Authorization#

Key Decisions#

4. Data Model (3 min)#

Users (PostgreSQL)#

Sessions and Tokens (Redis + PostgreSQL)#

Roles and Permissions (PostgreSQL)#

Why These Choices#

5. High-Level Design (12 min)#

Authentication Flow (Email/Password + MFA)#

Token Validation (every API call)#

Authorization Check#

Components#

6. Deep Dives (15 min)#

Deep Dive 1: JWT vs Opaque Tokens — Trade-offs and Revocation#

Deep Dive 2: Password Hashing and Credential Stuffing Protection#

Deep Dive 3: OAuth 2.0 / OIDC and Single Sign-On#

7. Extensions (2 min)#

1. Requirements & Scope (5 min)

Functional Requirements

Non-Functional Requirements

2. Estimation (3 min)

Authentication Traffic

Token Validation

Authorization

Storage

Key Insight

3. API Design (3 min)

Authentication

Authorization

Key Decisions

4. Data Model (3 min)

Users (PostgreSQL)

Sessions and Tokens (Redis + PostgreSQL)

Roles and Permissions (PostgreSQL)

Why These Choices

5. High-Level Design (12 min)

Authentication Flow (Email/Password + MFA)

Token Validation (every API call)

Authorization Check

Components

6. Deep Dives (15 min)

Deep Dive 1: JWT vs Opaque Tokens — Trade-offs and Revocation

Deep Dive 2: Password Hashing and Credential Stuffing Protection

Deep Dive 3: OAuth 2.0 / OIDC and Single Sign-On

7. Extensions (2 min)