1. Requirements & Scope (5 min)
Functional Requirements
- Provide current weather conditions and 7-day forecasts for any location worldwide (by coordinates, city name, or ZIP code)
- Ingest and process data from multiple sources: weather stations (100K+), satellites, radar, and third-party NWS/ECMWF model data
- Support geospatial queries: “weather at 40.7128,-74.0060” and reverse geocoding: “weather in New York, NY”
- Severe weather alert system: tornado warnings, flood alerts, heat advisories — push notifications to affected users within minutes
- Provide historical weather data warehouse for trend analysis, agriculture, insurance, and research use cases
Non-Functional Requirements
- Availability: 99.99% for the API. Weather data is safety-critical — aviation, maritime, emergency services depend on it.
- Latency: Current conditions API < 100ms. Forecast API < 200ms. Alert delivery < 2 minutes from NWS issuance.
- Freshness: Current conditions updated every 5-15 minutes. Forecasts updated every 6 hours (aligned with model runs). Alerts delivered in real-time.
- Scale: 1B API requests/day across 10M registered developers. 50K concurrent data ingestion streams. 100TB historical data.
- Accuracy: Forecast accuracy comparable to top providers. Temperature within +/- 2 degrees F for 24-hour forecasts, +/- 5 degrees for 7-day.
2. Estimation (3 min)
API Traffic
- 1B requests/day = 11.5K requests/sec average
- Peak: 5x during severe weather events = ~60K requests/sec
- Breakdown: 60% current conditions, 30% forecasts, 10% historical/alerts
Data Ingestion
- Weather stations: 100K stations reporting every 5-15 minutes = ~400K observations/hour
- Satellite data: 10 satellite feeds, each producing ~50GB/day = 500GB/day
- Radar: 200 radar stations, 5-minute sweeps, each ~10MB = ~600K images/day = 6TB/day
- NWS/ECMWF model output: 4 model runs/day × 50GB each = 200GB/day
Storage
- Current conditions cache: 100K stations × 2KB = 200MB — fits entirely in Redis
- Forecast grid data: global 0.25-degree grid = 1,440 × 720 = 1M grid points × 7 days × 24 hours × 100 bytes = 17GB per model run → cached in memory
- Historical data: 100K stations × 365 days × 288 observations/day (every 5 min) × 200 bytes = 2TB/year station data + satellite/radar archives
- Total historical warehouse: ~100TB (5 years of all sources)
Key Insight
The read pattern is highly cacheable — millions of users in New York all get the same weather. The cache hit rate should be > 95%. The hard problem is ingesting, processing, and gridding heterogeneous data sources into a unified model.
3. API Design (3 min)
Current Weather
GET /v1/weather/current?lat=40.7128&lon=-74.0060
// or: ?city=New+York&state=NY&country=US
// or: ?zip=10001&country=US
Response 200: {
"location": {
"lat": 40.7128, "lon": -74.0060,
"city": "New York", "state": "NY", "timezone": "America/New_York"
},
"current": {
"timestamp": "2026-02-22T14:30:00Z",
"temp_f": 42, "temp_c": 5.6,
"feels_like_f": 36, "feels_like_c": 2.2,
"humidity": 65,
"wind_speed_mph": 12, "wind_direction": "NW",
"pressure_mb": 1018,
"visibility_miles": 10,
"uv_index": 3,
"condition": "Partly Cloudy",
"icon": "partly-cloudy-day"
},
"source": "station_KNYC",
"updated_at": "2026-02-22T14:25:00Z"
}
Forecast
GET /v1/weather/forecast?lat=40.7128&lon=-74.0060&days=7
Response 200: {
"location": { ... },
"forecast": {
"hourly": [
{"time": "2026-02-22T15:00:00Z", "temp_f": 43, "precip_prob": 20, "condition": "Partly Cloudy", ...},
{"time": "2026-02-22T16:00:00Z", "temp_f": 41, "precip_prob": 35, "condition": "Cloudy", ...},
...
],
"daily": [
{"date": "2026-02-22", "high_f": 45, "low_f": 32, "precip_prob": 40, "sunrise": "06:42", "sunset": "17:38", ...},
{"date": "2026-02-23", "high_f": 38, "low_f": 28, "precip_prob": 80, "condition": "Snow", ...},
...
]
},
"model": "GFS",
"model_run": "2026-02-22T06:00:00Z"
}
Alerts
GET /v1/weather/alerts?lat=40.7128&lon=-74.0060
Response 200: {
"alerts": [
{
"alert_id": "NWS-WinterStorm-2026022201",
"type": "Winter Storm Warning",
"severity": "severe",
"headline": "Heavy snow expected Saturday",
"description": "6-10 inches of snow expected...",
"affected_zones": ["NYZ072", "NYZ073"],
"effective": "2026-02-23T00:00:00Z",
"expires": "2026-02-24T06:00:00Z",
"source": "NWS"
}
]
}
Rate Limiting
Response headers on every call:
X-RateLimit-Limit: 1000 // per minute
X-RateLimit-Remaining: 847
X-RateLimit-Reset: 1708617660
Tiers:
Free: 1,000 req/min, current + 3-day forecast
Pro: 10,000 req/min, current + 7-day + historical
Enterprise: 100,000 req/min, all features + SLA
4. Data Model (3 min)
Observation Data (TimescaleDB)
Table: observations (hypertable, partitioned by time)
station_id | varchar(20)
observed_at | timestamptz (partition key)
lat | float
lon | float
temp_c | float
humidity | float
wind_speed_ms | float
wind_direction_deg | int
pressure_hpa | float
precipitation_mm | float
visibility_m | float
condition_code | int
raw_data | jsonb -- full observation for audit
INDEX: (station_id, observed_at DESC)
INDEX: GIST on (lat, lon) using PostGIS -- geospatial queries
Forecast Grid (Redis + Object Storage)
// Redis cache (active forecasts)
Key: forecast:{model}:{run_time}:{grid_lat}:{grid_lon}
Value: MessagePack-encoded hourly forecast array (168 hours = 7 days)
TTL: 12 hours (until next model run replaces it)
// Object storage (S3) for full model output
Bucket: weather-forecasts/
Path: {model}/{run_date}/{run_hour}/grid_{lat}_{lon}.parquet
Alerts (PostgreSQL + Redis)
Table: weather_alerts
alert_id (PK) | varchar(100)
source | varchar(20) -- NWS, Environment Canada, etc.
type | varchar(100)
severity | enum('minor', 'moderate', 'severe', 'extreme')
headline | text
description | text
affected_zones | text[] -- zone IDs
affected_polygon | geometry (PostGIS) -- geofence for spatial queries
effective_at | timestamptz
expires_at | timestamptz
created_at | timestamptz
// Redis (fast geospatial lookup for active alerts)
Key: active_alerts
Type: Geo set with alert_ids
→ GEOSEARCH active_alerts FROMLONLAT -74.006 40.713 BYRADIUS 50 km
Why These Choices
- TimescaleDB for observations: Time-series optimized, automatic partitioning, efficient range queries (“all observations for station X in the last 24 hours”). PostGIS extension for spatial queries.
- Redis for current conditions and forecasts: Sub-millisecond reads. 95%+ of API calls are cache hits. Geo commands for nearest-station lookups.
- S3/Parquet for historical and model data: Columnar format for efficient analytical queries. Cost-effective for 100TB+ storage. Query via Athena/Presto.
5. High-Level Design (12 min)
Data Ingestion Pipeline
Data Sources:
Weather Stations (METAR, SYNOP) → MQTT / HTTP push
Satellite Feeds (GOES, Meteosat) → FTP / direct download
Radar Data (NEXRAD) → S3 bucket subscription (AWS)
NWS Model Output (GFS, NAM) → NOAA data feeds
→ Ingestion Workers (per source type):
1. Parse raw format (METAR text, GRIB2 binary, NetCDF)
2. Validate: range checks, duplicate detection, outlier flagging
3. Normalize to common schema (SI units, UTC timestamps)
4. Publish to Kafka topic: raw_observations
→ Processing Pipeline (Flink):
1. Quality control: cross-validate with nearby stations
→ If station X reports 100F but 5 neighbors report 40F → flag as bad
2. Spatial interpolation: fill gaps between stations using IDW or Kriging
3. Grid the data: snap observations to nearest 0.25-degree grid point
4. Update current conditions cache (Redis)
5. Write to TimescaleDB (historical storage)
6. Trigger alert evaluation if thresholds exceeded
Forecast Model Pipeline
Every 6 hours (00Z, 06Z, 12Z, 18Z UTC):
1. NWS releases new GFS/NAM model run data
2. Download full model output (~50GB GRIB2)
3. Parse and extract relevant variables (temp, precip, wind, pressure)
4. Apply post-processing:
→ Statistical downscaling (global model at 0.25° → local 0.01° for cities)
→ Bias correction using recent observations (model says 40F, actual was 42F → adjust)
→ Ensemble blending (GFS + ECMWF + NAM → weighted average for better accuracy)
5. Generate forecast grids → write to Redis (active) + S3 (archive)
6. Invalidate CDN caches for forecast endpoints
API Serving Path
Client → CDN (CloudFront) → API Gateway → Weather API Service
For current conditions:
1. Geocode input (city name → lat/lon) using geocoding service (cached)
2. Find nearest weather station: Redis GEOSEARCH by lat/lon
3. Fetch observation from Redis cache: GET obs:{station_id}
→ Cache hit (99%): return immediately
→ Cache miss: query TimescaleDB, populate cache
4. Return formatted response
For forecasts:
1. Geocode → lat/lon
2. Snap to nearest grid point (round to 0.25-degree)
3. Fetch from Redis: GET forecast:GFS:{run}:{grid_lat}:{grid_lon}
→ Cache hit: return immediately
4. Return formatted response with daily/hourly breakdown
Components
- Ingestion Workers: Source-specific parsers (METAR, GRIB2, BUFR). Horizontally scaled per source. Fault-tolerant with Kafka-backed retries.
- Processing Pipeline (Flink): Quality control, spatial interpolation, gridding. Stateful stream processing with checkpointing.
- Forecast Engine: Downloads and post-processes model output. Statistical downscaling and bias correction. Runs on GPU instances for large matrix operations.
- Weather API Service: Stateless API servers. Geocoding, station lookup, cache reads. Auto-scaled based on request rate.
- Cache Layer (Redis Cluster): Current conditions, forecasts, active alerts. Geo commands for spatial queries. 200MB-20GB depending on resolution.
- Historical Data Warehouse (TimescaleDB + S3/Athena): All raw observations and model outputs. Serves historical API and analytics.
- Alert Service: Monitors NWS/international alert feeds. Matches alerts to user locations. Pushes notifications via FCM/APNs/email.
- CDN (CloudFront): Caches API responses by location. TTL: 5 min for current, 30 min for forecasts. Handles 90%+ of traffic.
6. Deep Dives (15 min)
Deep Dive 1: Geospatial Queries and Nearest-Station Lookup
The problem: Given a user’s coordinates (40.7128, -74.0060), find the nearest weather station with a recent observation. There are 100K stations worldwide, unevenly distributed (dense in cities, sparse in oceans/deserts).
Approach 1: Redis Geo (our primary approach)
// At ingestion time: register station location
GEOADD stations -74.006 40.713 "KNYC"
GEOADD stations -73.869 40.777 "KLGA"
GEOADD stations -74.169 40.683 "KEWR"
// At query time: find nearest stations
GEOSEARCH stations FROMLONLAT -74.006 40.713 BYRADIUS 50 km COUNT 5 ASC
→ Returns: ["KNYC" (0.1km), "KEWR" (15.2km), "KLGA" (17.8km), ...]
// Pick the nearest station with a valid recent observation
for station in results:
obs = GET obs:{station}
if obs.age < 30 minutes:
return obs
Complexity: GEOSEARCH is O(N+M) where N is elements in the area and M is the result count. With COUNT 5, it’s effectively O(1) for our use case.
Approach 2: Geohash-based bucketing (for extreme scale)
Divide the world into geohash cells (precision 4 = ~40km cells):
Geohash("40.7128, -74.0060") = "dr5r"
Store: stations_by_geohash:dr5r = ["KNYC", "KLGA", ...]
Lookup:
1. Compute geohash for query location
2. Get stations in that cell + 8 neighboring cells (handles edge cases)
3. Calculate exact distance to each candidate
4. Return nearest
Advantage: works with any key-value store, no geo commands needed.
Interpolation for locations far from any station:
If nearest station is > 50km away:
→ Use Inverse Distance Weighting (IDW) from 3-5 nearest stations:
T(query) = Σ(wi × Ti) / Σ(wi)
where wi = 1 / distance(query, station_i)^2
→ Accounts for elevation difference:
T_adjusted = T_interpolated - (lapse_rate × elevation_diff)
lapse_rate ≈ 6.5°C per 1000m
This is how we provide weather for locations without a nearby station
(mountains, rural areas, open ocean).
Deep Dive 2: Forecast Model Serving and Caching
The challenge: A global 0.25-degree grid has 1M grid points. Each grid point has 168 hourly forecasts (7 days). Total forecast data: 1M × 168 × 100 bytes = 16.8GB per model run. This is too large for naive Redis caching but critical for low-latency serving.
Tiered caching strategy:
Tier 1: CDN (CloudFront) — 90% of requests
→ Cache key: /forecast/{lat_rounded}/{lon_rounded}
→ TTL: 30 minutes (forecasts don't change between model runs)
→ 90% hit rate (top 10K locations cover 90% of queries)
→ Response time: < 20ms (edge location)
Tier 2: Redis — 9% of requests (CDN miss)
→ Only cache "hot" grid points (requested in last hour)
→ Lazy loading: first request populates cache
→ 50K hot grid points × 16KB = 800MB — fits easily
→ Response time: < 5ms
Tier 3: S3 + in-memory grid file — 1% of requests (Redis miss)
→ Full model output stored as Parquet files in S3
→ API server holds a memory-mapped grid index
→ Index: grid_point → S3 offset for direct range read
→ Response time: < 100ms (S3 GET with byte range)
Model run transitions:
New model run available (e.g., 12Z GFS):
1. Download and process new model output (takes 30-60 minutes)
2. Write new forecasts to Redis with NEW run time key:
forecast:GFS:2026022212:{grid} (new)
forecast:GFS:2026022206:{grid} (old, still serving)
3. Atomic switch: update "current run" pointer in Redis
SET forecast:GFS:current_run "2026022212"
4. API servers read current_run pointer → serve new data
5. Old run data expires naturally (TTL 12 hours)
This is zero-downtime model switching. No request ever sees partial new data.
Multi-model ensemble:
For each grid point, blend multiple models:
GFS (global, 0.25°, good for 5-7 day)
NAM (North America, 0.1°, good for 1-3 day)
ECMWF (global, 0.1°, best overall but expensive)
HRRR (US only, 3km, best for 0-18 hours)
Blending weights (vary by lead time):
0-6 hours: HRRR: 0.5, NAM: 0.3, GFS: 0.1, ECMWF: 0.1
6-24 hours: NAM: 0.4, ECMWF: 0.3, GFS: 0.2, HRRR: 0.1
1-3 days: ECMWF: 0.5, GFS: 0.3, NAM: 0.2
3-7 days: ECMWF: 0.6, GFS: 0.4
Weights are tuned using historical forecast verification:
→ Compare each model's 30-day forecast vs actual observations
→ Models with lower recent error get higher weights
→ Retune weights weekly
Deep Dive 3: Severe Weather Alert System
Requirements: When the NWS issues a tornado warning, every affected user must be notified within 2 minutes. Lives depend on this.
Alert ingestion and matching:
NWS Alert Feed → CAP (Common Alerting Protocol) XML
→ Poll every 30 seconds (or WebSocket subscription where available)
→ Parse alert: type, severity, affected polygon (geofence), effective/expires times
Alert polygon example (tornado warning):
POLYGON((-97.5 35.2, -97.3 35.2, -97.3 35.0, -97.5 35.0, -97.5 35.2))
User alert subscriptions:
Table: alert_subscriptions
user_id | uuid
location | geometry (point)
alert_types | text[] -- ['tornado', 'flood', 'winter_storm']
channels | text[] -- ['push', 'email', 'sms']
Matching (spatial join):
SELECT user_id, channels
FROM alert_subscriptions
WHERE ST_Contains(alert_polygon, location)
AND alert_type = ANY(alert_types);
For 10M subscribed users, this spatial query could be slow. Optimization:
Pre-index: bucket users by NWS zone (there are ~3,500 forecast zones in the US)
Map each user's location → NWS zone at subscription time
Store: zone_subscribers:{zone_id} = [user_ids]
NWS alerts reference zones: "affected_zones": ["OKC001", "OKC002"]
Fast path:
For each affected zone:
users = SMEMBERS zone_subscribers:{zone_id}
→ Typically 500-50K users per zone
→ Enqueue notifications for all matched users
This replaces a spatial query across 10M users with a set lookup across 3,500 zones.
Notification delivery priority:
Severity → delivery SLA:
Extreme (tornado, tsunami): < 60 seconds, push + SMS
Severe (severe thunderstorm): < 2 minutes, push
Moderate (winter storm): < 10 minutes, push + email
Minor (frost advisory): < 30 minutes, email only
For extreme alerts:
1. Skip all batching and rate limiting
2. Use dedicated high-priority notification queue
3. Pre-warmed connections to FCM/APNs
4. SMS via multiple providers (Twilio + Vonage) for redundancy
5. If push delivery fails → fall back to SMS immediately (no waiting)
7. Extensions (2 min)
- Hyperlocal weather: Use data from personal weather stations (Weather Underground network, 250K+ stations) to provide block-level accuracy. Crowdsourced data requires aggressive quality filtering but dramatically improves urban coverage.
- Air quality index (AQI): Ingest EPA AirNow data and satellite aerosol measurements. Combine with weather data for health-relevant forecasts: “Tomorrow’s AQI will be 150 (unhealthy for sensitive groups) due to wildfire smoke.”
- Agricultural weather: Specialized endpoints for farming: growing degree days, soil moisture estimates, frost probability, evapotranspiration rates. Partner with USDA for crop-specific forecasts.
- ML-based nowcasting: Use radar data and computer vision (ConvLSTM networks) to predict precipitation 0-2 hours ahead at 1km resolution — much better than NWP models for short-range.
- Historical weather API: Enable queries like “average temperature in Chicago in July over the last 30 years” or “all days with > 2 inches of rain in Miami in 2025.” Powered by the columnar data warehouse with pre-computed aggregates.