Observability for LLM Apps - You Can't Fix What You Can't Trace

When your web service throws a 500, you have a stack trace. When your LLM app returns a bad answer, the status code is 200, the latency looks normal, and you have no idea what happened.

That is the problem. Standard observability - error rate, latency percentiles, throughput - tells you nothing about the most common failure mode in LLM applications: the model returned something plausible but wrong. You need a different class of instrumentation, one that captures the full context of every inference call - what you sent, what the model returned, how many tokens it used, and which tools it called along the way.

This post covers what to log, how to attribute cost accurately, and how to actually debug a bad answer three days after a user reported it.

Why LLM Observability Is Different From Regular Service Observability

Traditional service observability is built on the assumption that failures are exceptional. You monitor for errors and slow responses because in a healthy system, most requests succeed quickly. Failure is the outlier.

LLM apps break this assumption in three ways.

The failure mode is semantic, not structural. A request that returns a hallucinated answer is indistinguishable from a correct one at the HTTP layer. No exception is thrown. No error code is returned. Your existing alerting will not catch it. You find out when a user reports it, or you do not find out at all.

The input space is unbounded. A traditional service has a finite set of inputs defined by its schema. An LLM app accepts natural language, which means every request is potentially unique. A prompt that works perfectly in testing may fail on a subtly different user phrasing in production. You need to log the exact prompt that triggered the failure.

Cost is a per-request concern. Compute cost in a traditional service is fairly uniform per request type. In an LLM app, a single user interaction can cost anywhere from $0.001 to $5.00 depending on context size, model, and how many tool calls the model decides to make. Without token-level accounting, your cost attribution is a guess.

The Minimum Viable Trace

Before reaching for a tracing library, know what data you actually need. A minimal LLM trace captures five things:

The exact prompt - system message plus user messages - sent to the model
Model name, temperature, max tokens, and any other request parameters
The full response from the model
Token counts: input tokens, output tokens, and cache tokens separately
Latency: time to first token and total response time

Here is a lightweight wrapper that captures this for Anthropic calls:

import time
import uuid
import logging
from anthropic import Anthropic

logger = logging.getLogger("llm.traces")
client = Anthropic()

def traced_chat(
    messages: list,
    system: str = None,
    model: str = "claude-sonnet-4-6",
    **kwargs
) -> dict:
    trace_id = str(uuid.uuid4())
    request_params = {
        "model": model,
        "messages": messages,
        "max_tokens": kwargs.get("max_tokens", 2048),
        "temperature": kwargs.get("temperature", 1.0),
    }
    if system:
        request_params["system"] = system

    start = time.monotonic()
    response = client.messages.create(**request_params)
    latency_ms = (time.monotonic() - start) * 1000

    usage = response.usage
    trace = {
        "trace_id": trace_id,
        "model": model,
        "system": system,
        "messages": messages,
        "response_text": response.content[0].text,
        "stop_reason": response.stop_reason,
        "input_tokens": usage.input_tokens,
        "output_tokens": usage.output_tokens,
        "cache_read_tokens": getattr(usage, "cache_read_input_tokens", 0),
        "cache_write_tokens": getattr(usage, "cache_creation_input_tokens", 0),
        "latency_ms": latency_ms,
        "params": {k: v for k, v in request_params.items() if k not in ("messages", "system")},
    }

    logger.info("llm_trace", extra={"trace": trace})
    return trace

Log this as structured JSON, not as plain text. Structured logs are queryable. Plain text logs are not. The difference matters the moment you need to find all traces where input_tokens > 8000 or where stop_reason == "max_tokens".

Token Accounting

Token counts are the single metric that determines your LLM spend, and most teams do not track them per request. They look at the monthly bill and try to reason backward from one number. This does not work once you have more than a handful of features all hitting the same model.

Track these fields separately for every call:

Field	What it measures	Why it matters
input_tokens	Tokens in the prompt and messages	Scales with context size and retrieval results
output_tokens	Tokens in the model response	Scales with verbosity and response length
cache_read_tokens	Tokens served from the prompt cache	Significantly cheaper than full input tokens
cache_write_tokens	Tokens written to the prompt cache	One-time cost to warm the cache prefix

Prompt caching is the biggest lever most teams ignore. If your system prompt is 2000 tokens and you cache it, you pay about 10% of regular input token cost on cache hits. Tracking cache hits versus misses per feature is the difference between a $500 and $50 monthly bill for the same workload.

Calculate cost per request and store it alongside the trace:

# Verify against current Anthropic pricing page before shipping
PRICING_PER_MTK = {
    "claude-sonnet-4-6": {
        "input": 3.00,
        "output": 15.00,
        "cache_read": 0.30,
        "cache_write": 3.75,
    }
}

def calculate_cost(model: str, usage: dict) -> float:
    p = PRICING_PER_MTK.get(model, PRICING_PER_MTK["claude-sonnet-4-6"])
    cost = (
        usage["input_tokens"] * p["input"] / 1_000_000
        + usage["output_tokens"] * p["output"] / 1_000_000
        + usage.get("cache_read_tokens", 0) * p["cache_read"] / 1_000_000
        + usage.get("cache_write_tokens", 0) * p["cache_write"] / 1_000_000
    )
    return round(cost, 6)

Aggregate by user, feature, and session. Within a week you will know which parts of your app are expensive and whether the cost is proportional to the value they deliver.

Tracing Tool Calls

If your app uses tool-calling, a single user interaction may involve multiple LLM inference calls separated by tool executions. This is where traces get complex and where most teams lose visibility entirely.

The naive approach - log each LLM call independently - loses the relationship between calls. When debugging an agent that made three tool calls before producing a wrong answer, you need to see the full sequence as a unit, not three orphaned log entries.

Use a parent-child span model: one parent trace per user interaction, child spans for each LLM call and each tool execution.

[User request - trace_id: abc123]
  [LLM call - turn 1]
    input_tokens: 850
    tool_calls: ["search_docs", "get_account_info"]
    stop_reason: "tool_use"
  [Tool: search_docs]
    query: "refund policy for annual subscriptions"
    results: 3 documents
    latency_ms: 142
  [Tool: get_account_info]
    user_id: "u_9912"
    latency_ms: 23
  [LLM call - turn 2]
    input_tokens: 1420  (grew because tool results were injected)
    output_tokens: 310
    stop_reason: "end_turn"
  [Total: 2 LLM calls, 2 tool calls, $0.024, 1890ms]

A concrete implementation using a session context object:

import time
import uuid
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class LLMTrace:
    trace_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    session_id: Optional[str] = None
    user_id: Optional[str] = None
    feature: Optional[str] = None
    turns: list = field(default_factory=list)
    tool_calls: list = field(default_factory=list)
    total_input_tokens: int = 0
    total_output_tokens: int = 0
    total_cost: float = 0.0
    _start: float = field(default_factory=time.monotonic)

    def add_llm_turn(self, model: str, usage: dict, stop_reason: str, latency_ms: float):
        self.turns.append({
            "model": model,
            "input_tokens": usage.get("input_tokens", 0),
            "output_tokens": usage.get("output_tokens", 0),
            "stop_reason": stop_reason,
            "latency_ms": latency_ms,
        })
        self.total_input_tokens += usage.get("input_tokens", 0)
        self.total_output_tokens += usage.get("output_tokens", 0)

    def add_tool_call(self, name: str, inputs: dict, result, latency_ms: float):
        self.tool_calls.append({
            "tool": name,
            "inputs": inputs,
            "result_preview": str(result)[:500],
            "latency_ms": latency_ms,
        })

    def finalize(self) -> dict:
        return {
            "trace_id": self.trace_id,
            "session_id": self.session_id,
            "user_id": self.user_id,
            "feature": self.feature,
            "turns": self.turns,
            "tool_calls": self.tool_calls,
            "total_input_tokens": self.total_input_tokens,
            "total_output_tokens": self.total_output_tokens,
            "total_cost": self.total_cost,
            "total_latency_ms": (time.monotonic() - self._start) * 1000,
        }

Emit the finalized trace object at the end of the interaction, not piecemeal. One structured log entry per user interaction is far easier to query than reconstructing a conversation from dozens of scattered log lines.

The OpenTelemetry GenAI Semantic Conventions

The OpenTelemetry project published semantic conventions for GenAI spans in late 2025. These define standard attribute names for LLM observability, which matters if you want your data to work with off-the-shelf dashboards.

Key attributes from the spec:

gen_ai.system                 = "anthropic" | "openai" | "google"
gen_ai.request.model          = "claude-sonnet-4-6"
gen_ai.request.temperature    = 1.0
gen_ai.request.max_tokens     = 2048
gen_ai.response.model         = "claude-sonnet-4-6-20260101"
gen_ai.response.finish_reasons = ["end_turn"]
gen_ai.usage.input_tokens     = 850
gen_ai.usage.output_tokens    = 310

If you are already using OpenTelemetry for your services, instrument LLM calls as child spans with these attributes. Auto-instrumentation packages handle most of this:

pip install opentelemetry-instrumentation-anthropic
pip install opentelemetry-instrumentation-openai

With auto-instrumentation, every API call becomes a span that inherits the trace ID of the parent HTTP request. The full flow becomes visible in one trace: HTTP handler -> RAG retrieval -> vector DB query -> LLM call -> tool execution -> LLM call -> response.

The practical reason to use the standard attributes: traces exported via OTLP to any backend (Grafana Tempo, Honeycomb, Jaeger, Datadog) use the same field names. Queries and dashboards written once work everywhere, and you avoid the problem of having LLM data in a different schema than your service traces.

Debugging a Bad Answer After the Fact

A user reports: “your chatbot told me the wrong cancellation policy yesterday around 3 PM.” Here is the diagnostic process - and which steps fail if you did not log the right data.

Step 1 - Find the trace. Query your log storage for the user and time window. If you logged user_id, session_id, and a timestamp, this is one query. If you did not, you are guessing.

Step 2 - Replay the exact prompt. The trace contains the full system message and conversation history that was sent to the model. Replay this against the same model and parameters. Does it reproduce the bad answer? If yes, the problem is in the prompt or the retrieved context. If no, the problem is non-determinism from a high temperature setting, or the model version changed between then and now.

Step 3 - Inspect the retrieved context. If you use RAG, what documents were included in the prompt? Log the retrieval results as part of the trace. A wrong answer often traces back to a wrong retrieval - the model answered accurately from the context it received, but the context was outdated or mismatched to the query.

Step 4 - Check the token counts. Was the context window near the limit? When input_tokens approaches the model’s context window, models tend to compress or skip information from the middle of the prompt. If you are routinely hitting 80-90% of max context, that is a structural issue with how you build prompts.

Step 5 - Check the temperature. High temperature (above 0.7) increases creativity and hallucination probability proportionally. A support chatbot at temperature 1.0 will produce wrong answers more often than the same chatbot at temperature 0.2. This should be visible in the trace.

Steps 2 through 5 only work if step 1 works. The most common failure when teams do this exercise: “we log the response but not the full prompt.” That is the wrong tradeoff. The response is what the user saw. The prompt is what caused it. Log both, always.

Cost Attribution and Anomaly Detection

Once you have per-request cost data, two things become possible that are not possible otherwise.

Cost by feature. Aggregate total cost by the endpoint or feature that triggered the LLM call. Most teams discover that 20% of features account for 80% of LLM spend, and some of that expensive 20% is low-value automation that could be replaced with a smaller model, a shorter prompt, or a non-LLM approach entirely.

Cost anomaly detection. Establish a baseline cost per request type - for example, support ticket summarization averages $0.008 per call. Alert when the rolling average exceeds 2x the baseline. A spike usually means context is growing unchecked: a conversation history that is never truncated, a retrieval step returning 20 documents instead of 5, or a bug injecting redundant data into every prompt.

def check_cost_anomaly(feature: str, cost: float, multiplier: float = 2.0) -> bool:
    baseline = get_p50_cost(feature, window_days=7)
    if baseline and cost > baseline * multiplier:
        logger.warning(
            "cost_anomaly",
            extra={
                "feature": feature,
                "cost": cost,
                "baseline": baseline,
                "ratio": cost / baseline,
            }
        )
        return True
    return False

Run this check in the hot path after logging the trace. A spike in per-request cost at 2 AM is better caught at 2 AM than at month-end billing.

What Actually Works

Log the full prompt and full response for every production call. Storage is cheap. Debugging without the prompt is not possible. There is no argument for sampling here - unlike HTTP traces where sampling at 10% is reasonable because errors are rare, a semantically wrong LLM answer happens 1-5% of the time and you need to catch every instance.

Track input, output, and cache tokens separately per request. Aggregate daily totals are useless for cost attribution. You need per-call and per-feature granularity.

Link LLM spans to the parent request trace via trace ID. This lets you answer “which user action triggered this LLM call and what was the end-to-end latency” instead of having disconnected islands of observability.

Use structured JSON for all LLM trace logs. Plain text logs cannot be queried by field. You will need to filter by user ID, by token count range, by feature, and by cost - all of which require structured fields.

What does not work: relying on model provider dashboards. OpenAI and Anthropic dashboards show aggregate token usage across your entire account. They do not show which user interaction was expensive, what prompt caused a bad answer, or which feature is responsible for 40% of your bill.

What to skip for now: the commercial LLM observability platforms (Langfuse, Arize, W&B Weave) are useful once you have volume and need dashboards and eval pipelines. Start with structured logging and a Postgres or ClickHouse table. The fundamentals - logging the right data, linking it to user context, tracking token costs - are things you need to understand regardless of which platform you eventually use.

The rule is simple: if you cannot answer “what prompt caused this bad answer, and how much did it cost,” your observability is not set up yet.

Why LLM Observability Is Different From Regular Service Observability#

The Minimum Viable Trace#

Token Accounting#

Tracing Tool Calls#

The OpenTelemetry GenAI Semantic Conventions#

Debugging a Bad Answer After the Fact#

Cost Attribution and Anomaly Detection#

What Actually Works#

Comments