1. Requirements & Scope (5 min)
Functional Requirements
- Authentication: Support email/password login, OAuth 2.0/OIDC (Google, GitHub, etc.), and multi-factor authentication (TOTP, SMS, WebAuthn/passkeys)
- Session Management: Issue, validate, refresh, and revoke sessions. Support “remember me” (long-lived) and “sign out everywhere” (revoke all sessions).
- Authorization: Role-Based Access Control (RBAC) with hierarchical roles and Attribute-Based Access Control (ABAC) for fine-grained policies
- Single Sign-On (SSO): Act as an identity provider (IdP) — users authenticate once and access multiple applications without re-authenticating
- Account security: Rate limit login attempts, detect credential stuffing, support password reset flows, and enforce password policies
Non-Functional Requirements
- Availability: 99.999% — if auth is down, no user can log into any application. This is the single most critical shared service.
- Latency: Login < 500ms (includes password hashing). Token validation < 5ms (must be in the hot path of every API call). Authorization check < 10ms.
- Security: Passwords hashed with Argon2id (memory-hard). Tokens encrypted in transit (TLS 1.3) and at rest. No plaintext secrets in logs. PII encrypted at rest.
- Scale: 500M registered users. 50M daily logins. 10B token validations/day (every API call). 1B authorization checks/day.
- Compliance: GDPR (right to delete, data portability), SOC 2, support for data residency requirements.
2. Estimation (3 min)
Authentication Traffic
- 50M logins/day = 580 logins/sec average, 5K/sec peak (morning login surge)
- Each login: password hash verification (Argon2id: ~250ms CPU per attempt) + session creation
- CPU: 580 logins/sec × 250ms = 145 CPU-seconds/sec → need ~150 CPU cores just for password hashing
Token Validation
- 10B validations/day = 115K validations/sec
- JWT validation: verify signature + check expiry + decode claims = < 0.1ms (no network call)
- With opaque tokens: Redis lookup = ~1ms per validation
Authorization
- 1B checks/day = 11.5K checks/sec
- Each check: look up user’s roles/permissions, evaluate policy rules
Storage
- User accounts: 500M × 2KB = 1TB
- Sessions: 200M active sessions × 500 bytes = 100GB (fits in Redis)
- Roles/permissions: 1000 roles × 100 permissions = 100K entries — trivial
- Audit logs: 50M logins/day × 500 bytes = 25GB/day, 9TB/year
Key Insight
Token validation is the hottest path (115K/sec). It MUST be local (no network call) → JWT with local signature verification. Password hashing is CPU-intensive → dedicated worker pool with rate limiting. Authorization checks need low latency → cache policies locally, evaluate in-process.
3. API Design (3 min)
Authentication
// Email/password login
POST /auth/login
Body: { "email": "[email protected]", "password": "..." }
Response 200: {
"access_token": "eyJ...", // JWT, short-lived (15 min)
"refresh_token": "rt_abc123...", // opaque, long-lived (30 days)
"token_type": "Bearer",
"expires_in": 900,
"user": { "id": "u_123", "email": "...", "roles": ["admin"] }
}
Response 401: { "error": "invalid_credentials" }
Response 429: { "error": "too_many_attempts", "retry_after": 300 }
// MFA challenge (returned when MFA is enabled)
Response 200: {
"mfa_required": true,
"mfa_token": "mfa_xyz...", // temporary token for MFA flow
"mfa_methods": ["totp", "webauthn"]
}
POST /auth/mfa/verify
Body: { "mfa_token": "mfa_xyz...", "method": "totp", "code": "123456" }
Response 200: { "access_token": "eyJ...", "refresh_token": "rt_..." }
// OAuth 2.0 / OIDC
GET /auth/oauth/authorize?provider=google&redirect_uri=...&state=...
→ Redirects to Google's consent screen
GET /auth/oauth/callback?code=...&state=...
→ Exchanges code for tokens, creates/links account, returns our tokens
// Token refresh
POST /auth/token/refresh
Body: { "refresh_token": "rt_abc123..." }
Response 200: { "access_token": "eyJ...", "refresh_token": "rt_new..." }
// Refresh token rotation: old refresh token invalidated
// Logout
POST /auth/logout
Body: { "refresh_token": "rt_abc123..." }
→ Revokes refresh token and associated sessions
POST /auth/logout-all
→ Revokes ALL refresh tokens and sessions for the user
Authorization
// Check permission (called by API gateway or services)
POST /authz/check
Body: {
"subject": "u_123",
"action": "documents:write",
"resource": "doc_456",
"context": { "ip": "10.0.1.1", "time": "2026-02-22T14:00:00Z" }
}
Response 200: { "allowed": true, "reason": "role:editor grants documents:write" }
// Batch check (multiple permissions at once)
POST /authz/check-batch
Body: {
"subject": "u_123",
"checks": [
{ "action": "documents:write", "resource": "doc_456" },
{ "action": "documents:delete", "resource": "doc_456" },
{ "action": "admin:users:list" }
]
}
Response 200: {
"results": [
{ "allowed": true },
{ "allowed": false, "reason": "no permission: documents:delete" },
{ "allowed": false, "reason": "role:editor does not include admin:*" }
]
}
Key Decisions
- JWT for access tokens: Validated locally without a network call. Contains user ID, roles, and scopes. Short-lived (15 min) to limit blast radius of token theft.
- Opaque refresh tokens: Stored server-side. Enables revocation. Rotated on each use (one-time use tokens to detect token theft).
- Separate auth and authz endpoints: Authentication (who are you?) and authorization (can you do this?) are independently scalable concerns.
4. Data Model (3 min)
Users (PostgreSQL)
Table: users
user_id (PK) | uuid
email (UQ) | varchar(255) -- encrypted at rest
email_hash (UQ) | varchar(64) -- for lookups without decrypting
password_hash | varchar(255) -- Argon2id hash
mfa_enabled | boolean
mfa_secret | bytea (encrypted) -- TOTP secret
status | enum('active', 'locked', 'suspended', 'deleted')
failed_login_count | int
locked_until | timestamp
created_at | timestamp
updated_at | timestamp
Table: oauth_accounts
id (PK) | uuid
user_id (FK) | uuid
provider | varchar(50) -- google, github, etc.
provider_user_id | varchar(255)
access_token | bytea (encrypted)
refresh_token | bytea (encrypted)
UNIQUE(provider, provider_user_id)
Table: webauthn_credentials
credential_id (PK) | bytea
user_id (FK) | uuid
public_key | bytea
sign_count | int
created_at | timestamp
Sessions and Tokens (Redis + PostgreSQL)
// Redis (fast lookups, TTL-based expiry)
Key: session:{session_id}
Value: { "user_id": "u_123", "created_at": "...", "ip": "...", "user_agent": "..." }
TTL: 30 days (or until explicit logout)
Key: refresh_token:{token_hash}
Value: { "user_id": "u_123", "session_id": "s_abc", "family": "fam_1" }
TTL: 30 days
// PostgreSQL (audit trail, "sign out everywhere")
Table: sessions
session_id (PK) | uuid
user_id (FK) | uuid
refresh_token_family | uuid -- for rotation detection
ip_address | inet
user_agent | varchar(500)
created_at | timestamp
last_used_at | timestamp
revoked_at | timestamp
Roles and Permissions (PostgreSQL)
Table: roles
role_id (PK) | uuid
name (UQ) | varchar(100) -- admin, editor, viewer
parent_role_id (FK) | uuid (nullable) -- role hierarchy
description | text
Table: permissions
permission_id (PK) | uuid
name (UQ) | varchar(200) -- documents:write, admin:users:list
description | text
Table: role_permissions
role_id (FK) | uuid
permission_id (FK) | uuid
PRIMARY KEY: (role_id, permission_id)
Table: user_roles
user_id (FK) | uuid
role_id (FK) | uuid
resource_scope | varchar(200) -- null=global, "org:123"=scoped to org
granted_at | timestamp
granted_by | uuid
PRIMARY KEY: (user_id, role_id, resource_scope)
Why These Choices
- PostgreSQL for users and RBAC: ACID for critical operations (user creation, role assignment). Complex queries (find all users with permission X). Mature encryption-at-rest support.
- Redis for sessions/tokens: Sub-millisecond lookups. TTL for automatic expiry. Supports mass revocation (delete by user_id pattern).
- Argon2id for passwords: Memory-hard (resists GPU/ASIC attacks), tunable (can increase memory/time as hardware improves), recommended by OWASP. Parameters: memory=64MB, iterations=3, parallelism=4.
5. High-Level Design (12 min)
Authentication Flow (Email/Password + MFA)
Client → API Gateway → Auth Service:
1. Rate limit check (Redis): user IP + email → under threshold?
→ If exceeded: return 429 (locked for 5 minutes)
2. Look up user by email_hash
→ If not found: return 401 (constant-time to prevent user enumeration)
3. Verify password: Argon2id.verify(password, stored_hash)
→ If wrong: increment failed_login_count, return 401
→ If failed_count >= 5: lock account for 15 minutes
4. If MFA enabled:
→ Generate temporary MFA token (stored in Redis, TTL 5 min)
→ Return mfa_required response
→ Client submits MFA code → verify TOTP or WebAuthn
5. Create session:
→ Generate session_id, refresh_token
→ Store in Redis + PostgreSQL
6. Generate JWT access token:
→ Payload: { sub: user_id, roles: [...], exp: now+15min, iss: "auth-service" }
→ Sign with RS256 (RSA private key, rotated every 90 days)
7. Return tokens to client
Token Validation (every API call)
Client sends: Authorization: Bearer eyJ...
API Gateway (or service middleware):
1. Decode JWT header → get kid (key ID)
2. Look up public key from local cache (JWKS)
→ Cache refreshed every 5 minutes from GET /auth/.well-known/jwks.json
3. Verify signature (RS256) → tamper-proof
4. Check exp claim → not expired?
5. Check iss claim → issued by our auth service?
6. Extract claims (user_id, roles, scopes)
→ NO NETWORK CALL NEEDED — pure CPU, < 0.1ms
If token is expired:
→ Client calls /auth/token/refresh with refresh_token
→ Auth service validates refresh_token in Redis
→ Issues new access_token + new refresh_token (rotation)
→ Invalidates old refresh_token
Authorization Check
Service or API Gateway → Authorization Service:
1. Load user's roles (cached in Redis, TTL 5 min):
→ Direct roles + inherited roles (role hierarchy)
→ e.g., admin inherits editor inherits viewer
2. Load role's permissions (cached, rarely changes)
3. Evaluate RBAC:
→ Does any of user's roles grant the requested permission?
4. If ABAC policies exist:
→ Evaluate attribute-based rules:
"allow documents:write IF user.department == resource.department
AND time.hour BETWEEN 9 AND 17"
5. Return allow/deny with reason
Components
- Auth Service: Handles login, registration, password reset, MFA, OAuth flows. Stateless (all state in DB/Redis). Horizontally scaled.
- Token Service: Issues and refreshes JWTs. Manages signing keys. Publishes JWKS endpoint. Key rotation every 90 days.
- Authorization Service: Evaluates RBAC/ABAC policies. Caches user roles and permissions locally. Policy engine (OPA or custom).
- Session Store (Redis Cluster): Active sessions and refresh tokens. Supports mass revocation.
- User Database (PostgreSQL): User accounts, credentials, OAuth links, roles, permissions. Encrypted at rest.
- Audit Log Service: Records all auth events (logins, failures, permission changes). Immutable append-only log. Required for compliance.
- Brute Force Protection: Rate limiting (Redis), account lockout, CAPTCHA integration, IP reputation.
6. Deep Dives (15 min)
Deep Dive 1: JWT vs Opaque Tokens — Trade-offs and Revocation
JWT (JSON Web Token):
Structure: header.payload.signature (Base64URL encoded)
Header: { "alg": "RS256", "kid": "key-2026-02" }
Payload: {
"sub": "u_123",
"email": "[email protected]",
"roles": ["editor"],
"scopes": ["read", "write"],
"iss": "auth.company.com",
"aud": "api.company.com",
"iat": 1708617600,
"exp": 1708618500 // 15 minutes from iat
}
Signature: RS256(header + "." + payload, private_key)
Pros of JWT:
- Validation requires no network call (just verify signature with public key)
- Self-contained: carries user claims, reducing DB lookups
- Stateless: scales infinitely for reads
Cons of JWT:
- Cannot be revoked before expiry — this is the big problem
- If a token is stolen, it’s valid for up to 15 minutes
- Payload is not encrypted (only signed) — don’t put sensitive data in it
Revocation strategies:
Strategy 1: Short-lived tokens + refresh (our primary approach)
Access token TTL: 15 minutes (short enough that revocation is rarely needed)
Refresh token: stored in Redis, can be revoked instantly
Timeline of a compromised token:
T+0: Token stolen
T+5: User notices, clicks "sign out everywhere"
T+5: All refresh tokens revoked in Redis
T+15: Stolen access token expires naturally
→ Maximum exposure: 15 minutes (acceptable for most use cases)
Strategy 2: Token blocklist (for high-security scenarios)
On revocation:
→ Add token's jti (JWT ID) to a Redis blocklist
→ TTL = remaining token lifetime (max 15 min)
On validation:
→ After verifying signature: check if jti is in blocklist
→ Adds 1 Redis lookup (~1ms) to every API call
Blocklist size: at most (revocations per 15 min) entries
→ Typically < 1000 entries — trivially small
Strategy 3: Token versioning
User record has a token_version counter.
JWT payload includes token_version.
On "sign out everywhere": increment token_version.
On validation: compare JWT's token_version with DB/cache.
→ Stale version → reject token
Requires 1 cache lookup per validation, but only for the user's version counter.
Our recommendation: Short-lived JWT (15 min) + refresh token rotation. Add blocklist only for admin/financial applications where 15-minute exposure is unacceptable.
Deep Dive 2: Password Hashing and Credential Stuffing Protection
Password hashing with Argon2id:
Why Argon2id (not bcrypt)?
- Memory-hard: requires 64MB per hash attempt
- GPU/ASIC resistant: memory bandwidth is the bottleneck, not compute
- Bcrypt uses ~4KB memory → easily parallelized on GPUs
- Argon2id at 64MB × 1000 parallel attempts = 64GB GPU memory needed
Parameters (OWASP recommended):
- Memory: 64MB (m=65536)
- Iterations: 3 (t=3)
- Parallelism: 4 threads (p=4)
- Salt: 16 bytes random (per user)
- Output: 32 bytes
Resulting hash time: ~250ms per attempt on a single core
→ Attacker with 1000 GPUs: ~$100 per 1 billion attempts
→ A strong password (80+ bits entropy): untouchable
Upgrade path: when a user logs in with a bcrypt password:
→ Verify against old bcrypt hash
→ Re-hash with Argon2id
→ Update stored hash (transparent to user)
Credential stuffing protection:
Attackers use leaked username/password databases from other breaches to try logging in.
Layer 1: Rate limiting per IP (Redis)
→ 10 failed attempts per IP per 5 minutes → CAPTCHA required
→ 50 failed attempts per IP per hour → block IP for 1 hour
Implementation: INCR login:fail:ip:{ip}:{5min_bucket}
Layer 2: Rate limiting per account
→ 5 failed attempts per account per 15 minutes → account locked (15 min)
→ 20 failed attempts per account per day → account locked (24 hours, email notification)
Layer 3: Compromised password detection
→ On registration and login: check password against Have I Been Pwned API
→ Uses k-anonymity: send first 5 chars of SHA-1(password), receive matching hashes
→ Never send full password to third party
→ If compromised: force password change
Layer 4: Anomaly detection
→ Track: login time, IP, user agent, geolocation
→ Flag anomalies: new device + new country + failed MFA = high risk
→ Trigger step-up auth: "We noticed a login from a new device in Germany. Please verify."
Layer 5: Proof of Work (extreme cases)
→ During active attack: require client to solve a computational puzzle before login
→ Adds 1-2 seconds per attempt → makes large-scale attacks economically infeasible
Deep Dive 3: OAuth 2.0 / OIDC and Single Sign-On
Our system as both an OAuth client (consuming Google/GitHub) and an OAuth provider (SSO for internal apps).
As OAuth Client (social login):
Authorization Code Flow with PKCE:
1. Client generates code_verifier (random 43-128 chars) and
code_challenge = SHA256(code_verifier)
2. Redirect to Google:
GET https://accounts.google.com/o/oauth2/v2/auth?
client_id=our_client_id&
redirect_uri=https://auth.company.com/oauth/callback&
response_type=code&
scope=openid email profile&
state={random_csrf_token}&
code_challenge={challenge}&
code_challenge_method=S256
3. User consents → Google redirects back:
GET https://auth.company.com/oauth/callback?
code=auth_code_from_google&
state={same_csrf_token}
4. Our backend exchanges code for tokens:
POST https://oauth2.googleapis.com/token
Body: { code, client_id, client_secret, redirect_uri, code_verifier }
Response: { access_token, id_token (JWT with user info), refresh_token }
5. Decode Google's id_token → extract email, name, picture
6. Find or create user in our DB (link oauth_accounts)
7. Issue OUR tokens (JWT access + opaque refresh)
8. Redirect user to app with our tokens
As OAuth Provider (SSO for internal apps):
App A (dashboard.company.com) and App B (admin.company.com) both use our auth system.
1. User visits App A → not logged in → redirect to auth.company.com/login
2. User logs in at auth.company.com → session cookie set on auth.company.com
3. Redirect back to App A with authorization code
4. App A exchanges code for tokens → user is logged in to App A
5. User visits App B → not logged in → redirect to auth.company.com/login
6. Auth.company.com sees existing session cookie → user already authenticated!
7. Skip login, redirect back to App B with authorization code
8. App B exchanges code for tokens → user is logged in to App B
→ No password entry needed for App B — this is SSO
Key: The session cookie on auth.company.com is the "SSO session."
Logging out of auth.company.com → revoke session → all apps lose access on next token refresh.
OIDC (OpenID Connect) — identity layer on top of OAuth 2.0:
We implement these OIDC endpoints:
GET /.well-known/openid-configuration → discovery document
GET /.well-known/jwks.json → public signing keys
GET /oauth/authorize → authorization endpoint
POST /oauth/token → token endpoint
GET /oauth/userinfo → user info endpoint
id_token (JWT) issued by our system:
{
"iss": "https://auth.company.com",
"sub": "u_123",
"aud": "app_a_client_id",
"email": "[email protected]",
"name": "Alice Smith",
"roles": ["editor"],
"iat": 1708617600,
"exp": 1708618500,
"nonce": "abc123" // replay protection
}
7. Extensions (2 min)
- Passwordless authentication: Support magic links (email a one-time login URL) and passkeys (WebAuthn/FIDO2). Passkeys eliminate phishing entirely — the credential is bound to the domain and requires biometric/PIN verification on the device.
- Step-up authentication: For sensitive operations (changing password, transferring money), require re-authentication even if the user has a valid session. Time-limited elevated session (5 min) for the sensitive action.
- Delegated authorization: Support OAuth 2.0 scopes so third-party apps can request limited access. “App X wants to read your documents but not delete them.” Fine-grained consent management.
- Audit log and compliance: Immutable audit trail of all auth events. Support GDPR data export (all data about a user) and right to deletion (anonymize user records while preserving audit integrity).
- Adaptive MFA: Use risk scoring to decide when to require MFA. Low risk (same device, same location) → skip MFA. High risk (new device, foreign IP) → require MFA. Reduces friction for legitimate users while catching attacks.