Datadog’s pricing page has a special quality: it looks reasonable until you actually calculate what you will pay. $15 per host per month for infrastructure metrics. $1.27 per million log events ingested, plus retention costs. APM at $31 per host. Add custom metrics, synthetics, RUM, and incidents - and a mid-sized engineering organization pays $30,000-100,000 per month.
That is the number that starts the conversation about self-hosted observability.
What Datadog Does Well
Before the case for switching, the honest case for Datadog:
It works. It requires minimal setup. The agent handles metrics, logs, traces, and profiling from a single installation. The dashboards are polished. The alert system is mature. Notebooks and incident management are genuinely good. Integrations for every cloud provider, database, and framework exist and are maintained by Datadog.
If your company has $50K/month for observability and unlimited engineering time is more valuable than that, Datadog is the correct answer. Do not switch.
The case for self-hosted observability is primarily financial. Secondary benefits exist - data ownership, no ingestion limits, ability to customize deeply - but the primary driver is cost.
The Self-Hosted Stack
The modern self-hosted observability stack has three layers:
Metrics: Prometheus + Thanos Prometheus scrapes metrics from your services and stores them locally. Thanos adds long-term storage by shipping metrics to object storage (S3, GCS), and enables global querying across multiple Prometheus instances.
Logs: Loki + Promtail Grafana Loki is a log aggregation system designed to be cost-efficient. Unlike Elasticsearch, Loki indexes only log labels (similar to Prometheus labels), not the full log content. Storage costs are dramatically lower because logs are stored compressed in object storage. Promtail ships logs from your hosts to Loki.
Traces: Tempo + OpenTelemetry Grafana Tempo stores distributed traces. OpenTelemetry provides standardized instrumentation - vendor-neutral SDKs for your applications that export traces to Tempo, metrics to Prometheus, and logs to Loki through a single collector.
Visualization: Grafana Grafana connects to all three data sources and provides dashboards, alerting, and exploration. The same Grafana instance visualizes metrics, logs, and traces with correlation between them.
| Component | Datadog Equivalent | Self-Hosted |
|---|---|---|
| Infrastructure metrics | Datadog Agent | Prometheus + Node Exporter |
| Application metrics | StatsD / DogStatsD | Prometheus client libraries |
| Logs | Log Management | Loki + Promtail |
| Distributed traces | APM | Tempo + OpenTelemetry |
| Dashboards | Dashboards | Grafana |
| Alerting | Monitors | Grafana Alerting / Alertmanager |
The Real Cost Comparison
For a team running 50 Kubernetes pods with moderate log volume (10 GB/day ingested):
Datadog:
- Infrastructure: 50 hosts x $23/month = $1,150
- Logs: 10 GB/day x 30 days = 300 GB ingested x $0.10/GB = $30, plus retention
- APM: 50 hosts x $31 = $1,550
- Monthly total: ~$4,000-6,000
Self-hosted:
- 2x t3.medium for Prometheus + Thanos: ~$100
- S3 for Thanos + Loki object storage: ~$30
- Loki + Tempo: ~$50 (small instances, logs stored in S3)
- Grafana instance: ~$20
- Monthly total: ~$200-300
The cost reduction is 15-20x. For a larger team with higher log volume, the savings are proportionally larger.
The Hidden Costs of Self-Hosted
The cost comparison above is accurate but incomplete.
Engineering time. Setting up the stack takes 40-80 hours for a competent engineer. Maintaining it, upgrading components, debugging when something breaks - call it 4-8 hours per month ongoing. At loaded engineering cost of $150/hour, that is $600-1,200 per month in engineering time.
Even accounting for this, self-hosted is cheaper at most scales. But the time cost is real and should be in the calculation.
Operational responsibility. When Datadog has an outage, you open a ticket. When your self-hosted stack has an outage, you page your own team. Observability infrastructure going down when you most need it (during an incident) is a real risk.
Missing features. Datadog’s error tracking, security monitoring, continuous profiler, and synthetic monitoring are genuinely good products that do not have direct open-source equivalents at the same quality level. If you rely on these, the migration path is less clear.
What the Setup Actually Looks Like
With the Prometheus Operator for Kubernetes, the base setup is declarative:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
interval: 30s
This tells Prometheus to scrape the metrics port of any pod with label app: my-app every 30 seconds. No agent configuration needed.
For logs, Promtail’s Kubernetes discovery handles everything automatically:
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
pipeline_stages:
- json:
expressions:
level: level
message: message
For traces, OpenTelemetry auto-instrumentation in Python, Java, and Node.js requires zero code changes - just adding the auto-instrumentation package and pointing the exporter to your Tempo endpoint.
OpenTelemetry as the Foundation
The switch I would recommend regardless of where you end up: adopt OpenTelemetry for instrumentation now. OpenTelemetry is vendor-neutral. Your application emits signals (metrics, logs, traces) to an OpenTelemetry Collector, which routes them to any backend - Datadog, self-hosted, or both.
This means you can run self-hosted and Datadog in parallel during migration, and switch backends without changing application code if you ever want to change direction.
Bottom Line
The switch from Datadog to self-hosted observability is financially compelling at most scales - 10-20x cost reduction is achievable. The hidden costs are engineering time and operational responsibility for your own infrastructure. The stack (Prometheus + Loki + Tempo + Grafana, instrumented with OpenTelemetry) is mature and production-ready. Start with OpenTelemetry for instrumentation regardless of which backend you choose - it is the exit ramp from vendor lock-in that makes future decisions reversible.
Comments