Why Switch From Datadog to a Self-Hosted Observability Stack

Datadog’s pricing page has a special quality: it looks reasonable until you actually calculate what you will pay. $15 per host per month for infrastructure metrics. $1.27 per million log events ingested, plus retention costs. APM at $31 per host. Add custom metrics, synthetics, RUM, and incidents - and a mid-sized engineering organization pays $30,000-100,000 per month.

That is the number that starts the conversation about self-hosted observability.

What Datadog Does Well

Before the case for switching, the honest case for Datadog:

It works. It requires minimal setup. The agent handles metrics, logs, traces, and profiling from a single installation. The dashboards are polished. The alert system is mature. Notebooks and incident management are genuinely good. Integrations for every cloud provider, database, and framework exist and are maintained by Datadog.

If your company has $50K/month for observability and unlimited engineering time is more valuable than that, Datadog is the correct answer. Do not switch.

The case for self-hosted observability is primarily financial. Secondary benefits exist - data ownership, no ingestion limits, ability to customize deeply - but the primary driver is cost.

The Self-Hosted Stack

The modern self-hosted observability stack has three layers:

Metrics: Prometheus + Thanos Prometheus scrapes metrics from your services and stores them locally. Thanos adds long-term storage by shipping metrics to object storage (S3, GCS), and enables global querying across multiple Prometheus instances.

Logs: Loki + Promtail Grafana Loki is a log aggregation system designed to be cost-efficient. Unlike Elasticsearch, Loki indexes only log labels (similar to Prometheus labels), not the full log content. Storage costs are dramatically lower because logs are stored compressed in object storage. Promtail ships logs from your hosts to Loki.

Traces: Tempo + OpenTelemetry Grafana Tempo stores distributed traces. OpenTelemetry provides standardized instrumentation - vendor-neutral SDKs for your applications that export traces to Tempo, metrics to Prometheus, and logs to Loki through a single collector.

Visualization: Grafana Grafana connects to all three data sources and provides dashboards, alerting, and exploration. The same Grafana instance visualizes metrics, logs, and traces with correlation between them.

Component	Datadog Equivalent	Self-Hosted
Infrastructure metrics	Datadog Agent	Prometheus + Node Exporter
Application metrics	StatsD / DogStatsD	Prometheus client libraries
Logs	Log Management	Loki + Promtail
Distributed traces	APM	Tempo + OpenTelemetry
Dashboards	Dashboards	Grafana
Alerting	Monitors	Grafana Alerting / Alertmanager

The Real Cost Comparison

For a team running 50 Kubernetes pods with moderate log volume (10 GB/day ingested):

Datadog:

Infrastructure: 50 hosts x $23/month = $1,150
Logs: 10 GB/day x 30 days = 300 GB ingested x $0.10/GB = $30, plus retention
APM: 50 hosts x $31 = $1,550
Monthly total: ~$4,000-6,000

Self-hosted:

2x t3.medium for Prometheus + Thanos: ~$100
S3 for Thanos + Loki object storage: ~$30
Loki + Tempo: ~$50 (small instances, logs stored in S3)
Grafana instance: ~$20
Monthly total: ~$200-300

The cost reduction is 15-20x. For a larger team with higher log volume, the savings are proportionally larger.

The Hidden Costs of Self-Hosted

The cost comparison above is accurate but incomplete.

Engineering time. Setting up the stack takes 40-80 hours for a competent engineer. Maintaining it, upgrading components, debugging when something breaks - call it 4-8 hours per month ongoing. At loaded engineering cost of $150/hour, that is $600-1,200 per month in engineering time.

Even accounting for this, self-hosted is cheaper at most scales. But the time cost is real and should be in the calculation.

Operational responsibility. When Datadog has an outage, you open a ticket. When your self-hosted stack has an outage, you page your own team. Observability infrastructure going down when you most need it (during an incident) is a real risk.

Missing features. Datadog’s error tracking, security monitoring, continuous profiler, and synthetic monitoring are genuinely good products that do not have direct open-source equivalents at the same quality level. If you rely on these, the migration path is less clear.

What the Setup Actually Looks Like

With the Prometheus Operator for Kubernetes, the base setup is declarative:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
  - port: metrics
    interval: 30s

This tells Prometheus to scrape the metrics port of any pod with label app: my-app every 30 seconds. No agent configuration needed.

For logs, Promtail’s Kubernetes discovery handles everything automatically:

scrape_configs:
  - job_name: kubernetes-pods
    kubernetes_sd_configs:
      - role: pod
    pipeline_stages:
      - json:
          expressions:
            level: level
            message: message

For traces, OpenTelemetry auto-instrumentation in Python, Java, and Node.js requires zero code changes - just adding the auto-instrumentation package and pointing the exporter to your Tempo endpoint.

OpenTelemetry as the Foundation

The switch I would recommend regardless of where you end up: adopt OpenTelemetry for instrumentation now. OpenTelemetry is vendor-neutral. Your application emits signals (metrics, logs, traces) to an OpenTelemetry Collector, which routes them to any backend - Datadog, self-hosted, or both.

This means you can run self-hosted and Datadog in parallel during migration, and switch backends without changing application code if you ever want to change direction.

Bottom Line

The switch from Datadog to self-hosted observability is financially compelling at most scales - 10-20x cost reduction is achievable. The hidden costs are engineering time and operational responsibility for your own infrastructure. The stack (Prometheus + Loki + Tempo + Grafana, instrumented with OpenTelemetry) is mature and production-ready. Start with OpenTelemetry for instrumentation regardless of which backend you choose - it is the exit ramp from vendor lock-in that makes future decisions reversible.

What Datadog Does Well#

The Self-Hosted Stack#

The Real Cost Comparison#

The Hidden Costs of Self-Hosted#

What the Setup Actually Looks Like#

OpenTelemetry as the Foundation#

Bottom Line#

Comments