Three years ago, the question was whether OpenTelemetry was ready for production. In 2026, the question is why you would choose anything else. OTel has become the POSIX of observability - the standard interface that everything implements, regardless of which backend you use.

This did not happen because of hype. It happened because the alternative - vendor lock-in through proprietary agents - costs too much money and creates too much friction.

Why Proprietary Agents Lost

Datadog, New Relic, and Dynatrace all have excellent products. Their agents are easy to install and the dashboards are polished. The problem is economic and architectural.

The cost problem: Datadog’s pricing scales with hosts, custom metrics, log volume, and trace retention. A mid-size company running 50 microservices can easily spend $30,000-80,000/month. When your observability bill rivals your compute bill, something is wrong.

The lock-in problem: Each vendor’s agent generates data in a proprietary format. Switching from Datadog to Grafana Cloud means re-instrumenting every service. That migration takes months and carries risk.

The multi-vendor problem: Some teams want Grafana for dashboards, PagerDuty for alerts, and Jaeger for trace analysis. Proprietary agents make this painful because data flows through the vendor’s pipeline, not yours.

OpenTelemetry solves all three. You instrument once, and send data wherever you want. Switching backends is a configuration change, not a code change.

The Three Signals, Unified

OTel’s core value is unifying the three observability signals under one SDK and one protocol (OTLP):

Signal What it captures Example
Traces Request flow across services User request -> API -> Auth -> DB -> Response
Metrics Numerical measurements over time Request latency p99, error rate, queue depth
Logs Discrete events with context “Payment failed: insufficient funds” with trace ID

The key insight is correlation. When a trace shows high latency on a database call, you can jump to the metrics for that database instance and then to the logs from that exact time window. All three signals share the same trace ID and resource attributes.

Before OTel, you would use Jaeger for traces, Prometheus for metrics, and ELK for logs. Correlating them required manual work - matching timestamps, grepping for request IDs across systems. OTel makes this automatic.

Real Setup: Go Service

Here is a production-ready OTel setup for a Go HTTP service. This is not a toy example - it includes proper resource attributes, batch processing, and graceful shutdown.

package main

import (
    "context"
    "log"
    "net/http"
    "time"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    sdkmetric "go.opentelemetry.io/otel/sdk/metric"
    semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
    "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)

func initTracer(ctx context.Context) (*sdktrace.TracerProvider, error) {
    exporter, err := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint("otel-collector:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter,
            sdktrace.WithBatchTimeout(5*time.Second),
            sdktrace.WithMaxExportBatchSize(512),
        ),
        sdktrace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceName("payment-service"),
            semconv.ServiceVersion("1.4.2"),
            semconv.DeploymentEnvironment("production"),
        )),
        sdktrace.WithSampler(sdktrace.ParentBased(
            sdktrace.TraceIDRatioBased(0.1), // Sample 10%
        )),
    )
    otel.SetTracerProvider(tp)
    return tp, nil
}

func main() {
    ctx := context.Background()
    tp, err := initTracer(ctx)
    if err != nil {
        log.Fatal(err)
    }
    defer tp.Shutdown(ctx)

    handler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        // Your handler logic - spans are created automatically
        w.Write([]byte("ok"))
    })

    wrappedHandler := otelhttp.NewHandler(handler, "payment-api")
    log.Fatal(http.ListenAndServe(":8080", wrappedHandler))
}

Key decisions in this setup:

  • Batch exporter: Buffers spans and sends them in batches every 5 seconds. Unbatched export kills performance
  • Sampling at 10%: Production services generating thousands of requests per second do not need 100% trace capture. Sample judiciously
  • Resource attributes: Service name, version, and environment travel with every span. This is how you filter in your backend
  • Parent-based sampling: If an upstream service decided to sample a request, this service honors that decision. Consistent sampling across the trace

Real Setup: Python Service

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from fastapi import FastAPI

resource = Resource.create({
    SERVICE_NAME: "user-service",
    "service.version": "2.1.0",
    "deployment.environment": "production",
})

# Traces
tracer_provider = TracerProvider(resource=resource)
tracer_provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="otel-collector:4317"))
)
trace.set_tracer_provider(tracer_provider)

# Metrics
metric_reader = PeriodicExportingMetricReader(
    OTLPMetricExporter(endpoint="otel-collector:4317"),
    export_interval_millis=30000,
)
meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
metrics.set_meter_provider(meter_provider)

app = FastAPI()
FastAPIInstrumentor.instrument_app(app)

The Python SDK follows the same pattern. Instrument once, export to the collector, let auto-instrumentation handle the common cases.

The Collector Architecture

The OTel Collector is where the architecture becomes powerful. It sits between your applications and your backends, acting as a telemetry pipeline:

Services -> OTel Collector -> Backends
                |
          +-----------+
          | Receivers  |  (OTLP, Prometheus, Jaeger, Zipkin)
          | Processors |  (Batch, Filter, Transform, Tail Sampling)
          | Exporters  |  (OTLP, Prometheus, Loki, Tempo, Datadog)
          +-----------+

A production collector config:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
  filter:
    error_mode: ignore
    traces:
      span:
        - 'attributes["http.target"] == "/healthz"'

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
  prometheusremotewrite:
    endpoint: http://mimir:9009/api/v1/push
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, filter, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

This config does several important things:

  • Filters out health check spans: /healthz traces are noise. Drop them at the collector, not in your code
  • Memory limiting: Prevents the collector from OOMing during traffic spikes
  • Multi-backend export: Traces go to Tempo, metrics to Mimir, logs to Loki. All Grafana stack, all open source
  • Protocol flexibility: Accepts both gRPC and HTTP. Some SDKs prefer one over the other

The Cost Comparison

Running the Grafana stack (Tempo + Mimir + Loki + Grafana) on your own infrastructure versus Datadog:

Datadog (50 services, 200 hosts) Self-hosted Grafana + OTel
Monthly cost $40,000-80,000 $3,000-6,000 (compute + storage)
Setup effort Low (install agent) Medium (deploy collector + backends)
Maintenance None (SaaS) Ongoing (upgrades, scaling)
Vendor lock-in High None
Data retention Limited by plan Limited by your storage

The self-hosted path is not free. You need someone who understands the Grafana stack. But the 10x cost difference pays for a full-time SRE and then some.

When To Still Use a Vendor

Self-hosting is not always the right call:

  • Teams smaller than 10 engineers should use a SaaS backend and just instrument with OTel SDKs. You get vendor portability without ops burden
  • If your company already pays for Datadog and the cost is acceptable, use OTel SDKs with the Datadog exporter. You get the best dashboards with vendor portability as insurance
  • Compliance-heavy environments where managed services simplify audit requirements

The point is not “never use Datadog.” The point is “never instrument with Datadog’s proprietary SDK when OTel exists.” Instrument with OTel, export wherever you want. That is the standard now, and it is not going back.