1. Requirements & Scope (5 min)

Functional Requirements

  1. Provide current weather conditions and 7-day forecasts for any location worldwide (by coordinates, city name, or ZIP code)
  2. Ingest and process data from multiple sources: weather stations (100K+), satellites, radar, and third-party NWS/ECMWF model data
  3. Support geospatial queries: “weather at 40.7128,-74.0060” and reverse geocoding: “weather in New York, NY”
  4. Severe weather alert system: tornado warnings, flood alerts, heat advisories — push notifications to affected users within minutes
  5. Provide historical weather data warehouse for trend analysis, agriculture, insurance, and research use cases

Non-Functional Requirements

  • Availability: 99.99% for the API. Weather data is safety-critical — aviation, maritime, emergency services depend on it.
  • Latency: Current conditions API < 100ms. Forecast API < 200ms. Alert delivery < 2 minutes from NWS issuance.
  • Freshness: Current conditions updated every 5-15 minutes. Forecasts updated every 6 hours (aligned with model runs). Alerts delivered in real-time.
  • Scale: 1B API requests/day across 10M registered developers. 50K concurrent data ingestion streams. 100TB historical data.
  • Accuracy: Forecast accuracy comparable to top providers. Temperature within +/- 2 degrees F for 24-hour forecasts, +/- 5 degrees for 7-day.

2. Estimation (3 min)

API Traffic

  • 1B requests/day = 11.5K requests/sec average
  • Peak: 5x during severe weather events = ~60K requests/sec
  • Breakdown: 60% current conditions, 30% forecasts, 10% historical/alerts

Data Ingestion

  • Weather stations: 100K stations reporting every 5-15 minutes = ~400K observations/hour
  • Satellite data: 10 satellite feeds, each producing ~50GB/day = 500GB/day
  • Radar: 200 radar stations, 5-minute sweeps, each ~10MB = ~600K images/day = 6TB/day
  • NWS/ECMWF model output: 4 model runs/day × 50GB each = 200GB/day

Storage

  • Current conditions cache: 100K stations × 2KB = 200MB — fits entirely in Redis
  • Forecast grid data: global 0.25-degree grid = 1,440 × 720 = 1M grid points × 7 days × 24 hours × 100 bytes = 17GB per model run → cached in memory
  • Historical data: 100K stations × 365 days × 288 observations/day (every 5 min) × 200 bytes = 2TB/year station data + satellite/radar archives
  • Total historical warehouse: ~100TB (5 years of all sources)

Key Insight

The read pattern is highly cacheable — millions of users in New York all get the same weather. The cache hit rate should be > 95%. The hard problem is ingesting, processing, and gridding heterogeneous data sources into a unified model.


3. API Design (3 min)

Current Weather

GET /v1/weather/current?lat=40.7128&lon=-74.0060
  // or: ?city=New+York&state=NY&country=US
  // or: ?zip=10001&country=US

  Response 200: {
    "location": {
      "lat": 40.7128, "lon": -74.0060,
      "city": "New York", "state": "NY", "timezone": "America/New_York"
    },
    "current": {
      "timestamp": "2026-02-22T14:30:00Z",
      "temp_f": 42, "temp_c": 5.6,
      "feels_like_f": 36, "feels_like_c": 2.2,
      "humidity": 65,
      "wind_speed_mph": 12, "wind_direction": "NW",
      "pressure_mb": 1018,
      "visibility_miles": 10,
      "uv_index": 3,
      "condition": "Partly Cloudy",
      "icon": "partly-cloudy-day"
    },
    "source": "station_KNYC",
    "updated_at": "2026-02-22T14:25:00Z"
  }

Forecast

GET /v1/weather/forecast?lat=40.7128&lon=-74.0060&days=7

  Response 200: {
    "location": { ... },
    "forecast": {
      "hourly": [
        {"time": "2026-02-22T15:00:00Z", "temp_f": 43, "precip_prob": 20, "condition": "Partly Cloudy", ...},
        {"time": "2026-02-22T16:00:00Z", "temp_f": 41, "precip_prob": 35, "condition": "Cloudy", ...},
        ...
      ],
      "daily": [
        {"date": "2026-02-22", "high_f": 45, "low_f": 32, "precip_prob": 40, "sunrise": "06:42", "sunset": "17:38", ...},
        {"date": "2026-02-23", "high_f": 38, "low_f": 28, "precip_prob": 80, "condition": "Snow", ...},
        ...
      ]
    },
    "model": "GFS",
    "model_run": "2026-02-22T06:00:00Z"
  }

Alerts

GET /v1/weather/alerts?lat=40.7128&lon=-74.0060

  Response 200: {
    "alerts": [
      {
        "alert_id": "NWS-WinterStorm-2026022201",
        "type": "Winter Storm Warning",
        "severity": "severe",
        "headline": "Heavy snow expected Saturday",
        "description": "6-10 inches of snow expected...",
        "affected_zones": ["NYZ072", "NYZ073"],
        "effective": "2026-02-23T00:00:00Z",
        "expires": "2026-02-24T06:00:00Z",
        "source": "NWS"
      }
    ]
  }

Rate Limiting

Response headers on every call:
  X-RateLimit-Limit: 1000           // per minute
  X-RateLimit-Remaining: 847
  X-RateLimit-Reset: 1708617660

Tiers:
  Free: 1,000 req/min, current + 3-day forecast
  Pro: 10,000 req/min, current + 7-day + historical
  Enterprise: 100,000 req/min, all features + SLA

4. Data Model (3 min)

Observation Data (TimescaleDB)

Table: observations (hypertable, partitioned by time)
  station_id             | varchar(20)
  observed_at            | timestamptz (partition key)
  lat                    | float
  lon                    | float
  temp_c                 | float
  humidity               | float
  wind_speed_ms          | float
  wind_direction_deg     | int
  pressure_hpa           | float
  precipitation_mm       | float
  visibility_m           | float
  condition_code         | int
  raw_data               | jsonb  -- full observation for audit

  INDEX: (station_id, observed_at DESC)
  INDEX: GIST on (lat, lon) using PostGIS  -- geospatial queries

Forecast Grid (Redis + Object Storage)

// Redis cache (active forecasts)
Key: forecast:{model}:{run_time}:{grid_lat}:{grid_lon}
Value: MessagePack-encoded hourly forecast array (168 hours = 7 days)
TTL: 12 hours (until next model run replaces it)

// Object storage (S3) for full model output
Bucket: weather-forecasts/
  Path: {model}/{run_date}/{run_hour}/grid_{lat}_{lon}.parquet

Alerts (PostgreSQL + Redis)

Table: weather_alerts
  alert_id         (PK) | varchar(100)
  source                 | varchar(20) -- NWS, Environment Canada, etc.
  type                   | varchar(100)
  severity               | enum('minor', 'moderate', 'severe', 'extreme')
  headline               | text
  description            | text
  affected_zones         | text[] -- zone IDs
  affected_polygon       | geometry (PostGIS) -- geofence for spatial queries
  effective_at           | timestamptz
  expires_at             | timestamptz
  created_at             | timestamptz

// Redis (fast geospatial lookup for active alerts)
Key: active_alerts
Type: Geo set with alert_ids
  → GEOSEARCH active_alerts FROMLONLAT -74.006 40.713 BYRADIUS 50 km

Why These Choices

  • TimescaleDB for observations: Time-series optimized, automatic partitioning, efficient range queries (“all observations for station X in the last 24 hours”). PostGIS extension for spatial queries.
  • Redis for current conditions and forecasts: Sub-millisecond reads. 95%+ of API calls are cache hits. Geo commands for nearest-station lookups.
  • S3/Parquet for historical and model data: Columnar format for efficient analytical queries. Cost-effective for 100TB+ storage. Query via Athena/Presto.

5. High-Level Design (12 min)

Data Ingestion Pipeline

Data Sources:
  Weather Stations (METAR, SYNOP) → MQTT / HTTP push
  Satellite Feeds (GOES, Meteosat) → FTP / direct download
  Radar Data (NEXRAD) → S3 bucket subscription (AWS)
  NWS Model Output (GFS, NAM) → NOAA data feeds

→ Ingestion Workers (per source type):
    1. Parse raw format (METAR text, GRIB2 binary, NetCDF)
    2. Validate: range checks, duplicate detection, outlier flagging
    3. Normalize to common schema (SI units, UTC timestamps)
    4. Publish to Kafka topic: raw_observations

→ Processing Pipeline (Flink):
    1. Quality control: cross-validate with nearby stations
       → If station X reports 100F but 5 neighbors report 40F → flag as bad
    2. Spatial interpolation: fill gaps between stations using IDW or Kriging
    3. Grid the data: snap observations to nearest 0.25-degree grid point
    4. Update current conditions cache (Redis)
    5. Write to TimescaleDB (historical storage)
    6. Trigger alert evaluation if thresholds exceeded

Forecast Model Pipeline

Every 6 hours (00Z, 06Z, 12Z, 18Z UTC):
  1. NWS releases new GFS/NAM model run data
  2. Download full model output (~50GB GRIB2)
  3. Parse and extract relevant variables (temp, precip, wind, pressure)
  4. Apply post-processing:
     → Statistical downscaling (global model at 0.25° → local 0.01° for cities)
     → Bias correction using recent observations (model says 40F, actual was 42F → adjust)
     → Ensemble blending (GFS + ECMWF + NAM → weighted average for better accuracy)
  5. Generate forecast grids → write to Redis (active) + S3 (archive)
  6. Invalidate CDN caches for forecast endpoints

API Serving Path

Client → CDN (CloudFront) → API Gateway → Weather API Service

For current conditions:
  1. Geocode input (city name → lat/lon) using geocoding service (cached)
  2. Find nearest weather station: Redis GEOSEARCH by lat/lon
  3. Fetch observation from Redis cache: GET obs:{station_id}
     → Cache hit (99%): return immediately
     → Cache miss: query TimescaleDB, populate cache
  4. Return formatted response

For forecasts:
  1. Geocode → lat/lon
  2. Snap to nearest grid point (round to 0.25-degree)
  3. Fetch from Redis: GET forecast:GFS:{run}:{grid_lat}:{grid_lon}
     → Cache hit: return immediately
  4. Return formatted response with daily/hourly breakdown

Components

  1. Ingestion Workers: Source-specific parsers (METAR, GRIB2, BUFR). Horizontally scaled per source. Fault-tolerant with Kafka-backed retries.
  2. Processing Pipeline (Flink): Quality control, spatial interpolation, gridding. Stateful stream processing with checkpointing.
  3. Forecast Engine: Downloads and post-processes model output. Statistical downscaling and bias correction. Runs on GPU instances for large matrix operations.
  4. Weather API Service: Stateless API servers. Geocoding, station lookup, cache reads. Auto-scaled based on request rate.
  5. Cache Layer (Redis Cluster): Current conditions, forecasts, active alerts. Geo commands for spatial queries. 200MB-20GB depending on resolution.
  6. Historical Data Warehouse (TimescaleDB + S3/Athena): All raw observations and model outputs. Serves historical API and analytics.
  7. Alert Service: Monitors NWS/international alert feeds. Matches alerts to user locations. Pushes notifications via FCM/APNs/email.
  8. CDN (CloudFront): Caches API responses by location. TTL: 5 min for current, 30 min for forecasts. Handles 90%+ of traffic.

6. Deep Dives (15 min)

Deep Dive 1: Geospatial Queries and Nearest-Station Lookup

The problem: Given a user’s coordinates (40.7128, -74.0060), find the nearest weather station with a recent observation. There are 100K stations worldwide, unevenly distributed (dense in cities, sparse in oceans/deserts).

Approach 1: Redis Geo (our primary approach)

// At ingestion time: register station location
GEOADD stations -74.006 40.713 "KNYC"
GEOADD stations -73.869 40.777 "KLGA"
GEOADD stations -74.169 40.683 "KEWR"

// At query time: find nearest stations
GEOSEARCH stations FROMLONLAT -74.006 40.713 BYRADIUS 50 km COUNT 5 ASC
→ Returns: ["KNYC" (0.1km), "KEWR" (15.2km), "KLGA" (17.8km), ...]

// Pick the nearest station with a valid recent observation
for station in results:
    obs = GET obs:{station}
    if obs.age < 30 minutes:
        return obs

Complexity: GEOSEARCH is O(N+M) where N is elements in the area and M is the result count. With COUNT 5, it’s effectively O(1) for our use case.

Approach 2: Geohash-based bucketing (for extreme scale)

Divide the world into geohash cells (precision 4 = ~40km cells):
  Geohash("40.7128, -74.0060") = "dr5r"

Store: stations_by_geohash:dr5r = ["KNYC", "KLGA", ...]

Lookup:
  1. Compute geohash for query location
  2. Get stations in that cell + 8 neighboring cells (handles edge cases)
  3. Calculate exact distance to each candidate
  4. Return nearest

Advantage: works with any key-value store, no geo commands needed.

Interpolation for locations far from any station:

If nearest station is > 50km away:
  → Use Inverse Distance Weighting (IDW) from 3-5 nearest stations:
    T(query) = Σ(wi × Ti) / Σ(wi)
    where wi = 1 / distance(query, station_i)^2

  → Accounts for elevation difference:
    T_adjusted = T_interpolated - (lapse_rate × elevation_diff)
    lapse_rate ≈ 6.5°C per 1000m

This is how we provide weather for locations without a nearby station
(mountains, rural areas, open ocean).

Deep Dive 2: Forecast Model Serving and Caching

The challenge: A global 0.25-degree grid has 1M grid points. Each grid point has 168 hourly forecasts (7 days). Total forecast data: 1M × 168 × 100 bytes = 16.8GB per model run. This is too large for naive Redis caching but critical for low-latency serving.

Tiered caching strategy:

Tier 1: CDN (CloudFront) — 90% of requests
  → Cache key: /forecast/{lat_rounded}/{lon_rounded}
  → TTL: 30 minutes (forecasts don't change between model runs)
  → 90% hit rate (top 10K locations cover 90% of queries)
  → Response time: < 20ms (edge location)

Tier 2: Redis — 9% of requests (CDN miss)
  → Only cache "hot" grid points (requested in last hour)
  → Lazy loading: first request populates cache
  → 50K hot grid points × 16KB = 800MB — fits easily
  → Response time: < 5ms

Tier 3: S3 + in-memory grid file — 1% of requests (Redis miss)
  → Full model output stored as Parquet files in S3
  → API server holds a memory-mapped grid index
  → Index: grid_point → S3 offset for direct range read
  → Response time: < 100ms (S3 GET with byte range)

Model run transitions:

New model run available (e.g., 12Z GFS):
  1. Download and process new model output (takes 30-60 minutes)
  2. Write new forecasts to Redis with NEW run time key:
     forecast:GFS:2026022212:{grid} (new)
     forecast:GFS:2026022206:{grid} (old, still serving)
  3. Atomic switch: update "current run" pointer in Redis
     SET forecast:GFS:current_run "2026022212"
  4. API servers read current_run pointer → serve new data
  5. Old run data expires naturally (TTL 12 hours)

This is zero-downtime model switching. No request ever sees partial new data.

Multi-model ensemble:

For each grid point, blend multiple models:
  GFS (global, 0.25°, good for 5-7 day)
  NAM (North America, 0.1°, good for 1-3 day)
  ECMWF (global, 0.1°, best overall but expensive)
  HRRR (US only, 3km, best for 0-18 hours)

Blending weights (vary by lead time):
  0-6 hours:   HRRR: 0.5, NAM: 0.3, GFS: 0.1, ECMWF: 0.1
  6-24 hours:  NAM: 0.4, ECMWF: 0.3, GFS: 0.2, HRRR: 0.1
  1-3 days:    ECMWF: 0.5, GFS: 0.3, NAM: 0.2
  3-7 days:    ECMWF: 0.6, GFS: 0.4

Weights are tuned using historical forecast verification:
  → Compare each model's 30-day forecast vs actual observations
  → Models with lower recent error get higher weights
  → Retune weights weekly

Deep Dive 3: Severe Weather Alert System

Requirements: When the NWS issues a tornado warning, every affected user must be notified within 2 minutes. Lives depend on this.

Alert ingestion and matching:

NWS Alert Feed → CAP (Common Alerting Protocol) XML
  → Poll every 30 seconds (or WebSocket subscription where available)
  → Parse alert: type, severity, affected polygon (geofence), effective/expires times

Alert polygon example (tornado warning):
  POLYGON((-97.5 35.2, -97.3 35.2, -97.3 35.0, -97.5 35.0, -97.5 35.2))

User alert subscriptions:
  Table: alert_subscriptions
    user_id     | uuid
    location    | geometry (point)
    alert_types | text[]  -- ['tornado', 'flood', 'winter_storm']
    channels    | text[]  -- ['push', 'email', 'sms']

Matching (spatial join):
  SELECT user_id, channels
  FROM alert_subscriptions
  WHERE ST_Contains(alert_polygon, location)
    AND alert_type = ANY(alert_types);

For 10M subscribed users, this spatial query could be slow. Optimization:

Pre-index: bucket users by NWS zone (there are ~3,500 forecast zones in the US)
  Map each user's location → NWS zone at subscription time
  Store: zone_subscribers:{zone_id} = [user_ids]

NWS alerts reference zones: "affected_zones": ["OKC001", "OKC002"]

Fast path:
  For each affected zone:
    users = SMEMBERS zone_subscribers:{zone_id}
    → Typically 500-50K users per zone
  → Enqueue notifications for all matched users

This replaces a spatial query across 10M users with a set lookup across 3,500 zones.

Notification delivery priority:

Severity → delivery SLA:
  Extreme (tornado, tsunami):  < 60 seconds, push + SMS
  Severe (severe thunderstorm): < 2 minutes, push
  Moderate (winter storm):      < 10 minutes, push + email
  Minor (frost advisory):       < 30 minutes, email only

For extreme alerts:
  1. Skip all batching and rate limiting
  2. Use dedicated high-priority notification queue
  3. Pre-warmed connections to FCM/APNs
  4. SMS via multiple providers (Twilio + Vonage) for redundancy
  5. If push delivery fails → fall back to SMS immediately (no waiting)

7. Extensions (2 min)

  • Hyperlocal weather: Use data from personal weather stations (Weather Underground network, 250K+ stations) to provide block-level accuracy. Crowdsourced data requires aggressive quality filtering but dramatically improves urban coverage.
  • Air quality index (AQI): Ingest EPA AirNow data and satellite aerosol measurements. Combine with weather data for health-relevant forecasts: “Tomorrow’s AQI will be 150 (unhealthy for sensitive groups) due to wildfire smoke.”
  • Agricultural weather: Specialized endpoints for farming: growing degree days, soil moisture estimates, frost probability, evapotranspiration rates. Partner with USDA for crop-specific forecasts.
  • ML-based nowcasting: Use radar data and computer vision (ConvLSTM networks) to predict precipitation 0-2 hours ahead at 1km resolution — much better than NWP models for short-range.
  • Historical weather API: Enable queries like “average temperature in Chicago in July over the last 30 years” or “all days with > 2 inches of rain in Miami in 2025.” Powered by the columnar data warehouse with pre-computed aggregates.