LangSmith is excellent if you're using LangChain and you're okay with your traces living in Langchain Inc.'s infrastructure. But if your org runs Grafana and Datadog, or if you have compliance requirements about where LLM inputs and outputs can go, you need a different approach.
OpenTelemetry gives you vendor-neutral observability that plugs into whatever backend you already use. Here's how to instrument your LLM application properly — traces for every call, token cost attribution, latency dashboards, and alerts that fire before your users notice problems.
OTel basics in one paragraph
OpenTelemetry is a CNCF standard for distributed tracing, metrics, and logs. Your application emits spans (units of work with a start time, end time, and attributes). Spans compose into traces (the full call tree for a request). An OTLP exporter sends these to your backend (Grafana Tempo, Datadog APM, Jaeger, Honeycomb — they all accept OTLP). You instrument once; you can switch backends without changing code.
LLM-specific semantic conventions
The OpenTelemetry project has standardized attribute names for LLM calls under gen_ai.*. Using these means your traces are compatible with any tool that understands the spec:
| Attribute | Example value |
|---|---|
gen_ai.system | "anthropic" |
gen_ai.request.model | "claude-sonnet-4-5" |
gen_ai.request.max_tokens | 2000 |
gen_ai.response.model | "claude-sonnet-4-5" |
gen_ai.usage.input_tokens | 1547 |
gen_ai.usage.output_tokens | 312 |
gen_ai.response.finish_reasons | ["end_turn"] |
gen_ai.usage.cost_usd | 0.0051 (computed) |
The spec doesn't include cost_usd natively — you compute it from token counts and add it as a custom attribute. Worth doing: it's the metric your finance team will ask for.
Setup: instrumenting the Anthropic SDK
The Anthropic SDK doesn't have automatic OTel instrumentation yet (unlike some LangChain components). You wrap the client:
pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc anthropic
import time
import anthropic
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
# Configure tracer
resource = Resource.create({"service.name": "my-llm-app", "service.version": "1.0.0"})
provider = TracerProvider(resource=resource)
# OTLP exporter — change endpoint for your backend
otlp_exporter = OTLPSpanExporter(endpoint="http://localhost:4317")
provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("llm-instrumentation")
# Pricing table (update as models/prices change)
COST_PER_MILLION_TOKENS = {
"claude-opus-4-5": {"input": 15.00, "output": 75.00},
"claude-sonnet-4-5": {"input": 3.00, "output": 15.00},
"claude-haiku-3-5": {"input": 0.80, "output": 4.00},
}
def compute_cost(model: str, input_tokens: int, output_tokens: int) -> float:
pricing = COST_PER_MILLION_TOKENS.get(model, {"input": 3.00, "output": 15.00})
return (input_tokens * pricing["input"] + output_tokens * pricing["output"]) / 1_000_000
class InstrumentedAnthropic:
"""
Wraps anthropic.Anthropic and emits OTel spans for every messages.create call.
Drop-in replacement for the bare client.
"""
def __init__(self, **kwargs):
self._client = anthropic.Anthropic(**kwargs)
self.messages = self._InstrumentedMessages(self._client, tracer)
class _InstrumentedMessages:
def __init__(self, client, tracer):
self._client = client
self._tracer = tracer
def create(self, **kwargs) -> anthropic.types.Message:
model = kwargs.get("model", "unknown")
with self._tracer.start_as_current_span("anthropic.messages.create") as span:
# Set request attributes
span.set_attribute("gen_ai.system", "anthropic")
span.set_attribute("gen_ai.request.model", model)
span.set_attribute("gen_ai.request.max_tokens", kwargs.get("max_tokens", 0))
# Include system prompt length for debugging (don't log full content)
if "system" in kwargs:
span.set_attribute("gen_ai.request.system_prompt_length", len(kwargs["system"]))
span.set_attribute("gen_ai.request.message_count", len(kwargs.get("messages", [])))
start_time = time.time()
try:
response = self._client.messages.create(**kwargs)
latency_ms = (time.time() - start_time) * 1000
# Set response attributes
input_tokens = response.usage.input_tokens
output_tokens = response.usage.output_tokens
cost = compute_cost(model, input_tokens, output_tokens)
span.set_attribute("gen_ai.response.model", response.model)
span.set_attribute("gen_ai.usage.input_tokens", input_tokens)
span.set_attribute("gen_ai.usage.output_tokens", output_tokens)
span.set_attribute("gen_ai.response.finish_reasons", [response.stop_reason or "unknown"])
span.set_attribute("gen_ai.usage.cost_usd", round(cost, 6))
span.set_attribute("gen_ai.latency_ms", round(latency_ms, 2))
return response
except anthropic.RateLimitError as e:
span.set_attribute("error.type", "rate_limit")
span.record_exception(e)
span.set_status(trace.StatusCode.ERROR, "Rate limit exceeded")
raise
except anthropic.APIError as e:
span.set_attribute("error.type", "api_error")
span.set_attribute("error.status_code", e.status_code if hasattr(e, 'status_code') else 0)
span.record_exception(e)
span.set_status(trace.StatusCode.ERROR, str(e))
raise
# Usage — same as bare client
client = InstrumentedAnthropic()
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=500,
system="You are a helpful assistant.",
messages=[{"role": "user", "content": "Explain caching in one paragraph."}]
)
Every call now emits a span with all the attributes you need for dashboards and alerts. The wrapper is transparent — the return type is identical to the bare client.
Connecting to Grafana
Point the OTLP exporter to Grafana Alloy (the OTel collector), which forwards to Tempo (traces) and Mimir (metrics):
# For Grafana Cloud
otlp_exporter = OTLPSpanExporter(
endpoint="https://otlp-gateway-prod-eu-west-0.grafana.net/otlp",
headers={
"Authorization": f"Basic {GRAFANA_API_KEY_BASE64}"
}
)
Or self-hosted with a local collector:
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
exporters:
otlphttp/tempo:
endpoint: http://tempo:4318
prometheusremotewrite:
endpoint: http://mimir:9009/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
exporters: [otlphttp/tempo]
metrics:
receivers: [otlp]
exporters: [prometheusremotewrite]
Connecting to Datadog
Datadog's OTLP ingestion is on by default in the agent. Set the endpoint:
otlp_exporter = OTLPSpanExporter(
endpoint="http://localhost:4317" # Datadog agent OTLP receiver
)
In datadog.yaml:
otlp_config:
receiver:
protocols:
grpc:
endpoint: 0.0.0.0:4317
Datadog will automatically pick up gen_ai.* attributes and display them in APM traces. Custom metrics (cost, token counts) appear under "Metrics" in your namespace.
Key metrics to track
Once spans are flowing, build dashboards around these:
Latency:
p50,p95,p99latency by model and by API endpoint- Time-to-first-token for streaming responses (requires streaming instrumentation — add a separate span for the first chunk)
Throughput and errors:
- Requests per minute by model
- Error rate by
gen_ai.response.finish_reasons—max_tokensfinish reason means you're truncating output - Rate limit errors per hour (indicates you need to request quota increase or add retry logic)
Cost:
- Daily and hourly total cost (sum of
gen_ai.usage.cost_usd) - Cost per API endpoint or feature (add
feature.nameas a span attribute) - Input vs output token ratio (high output/input ratio = model is generating a lot; could indicate missing
max_tokenscaps)
Cache performance (if using prompt caching):
- Cache hit rate:
cache_read_input_tokens / (input_tokens + cache_read_input_tokens) - Add
gen_ai.usage.cache_read_tokensandgen_ai.usage.cache_write_tokensas span attributes
Alerting rules
Configure these alerts before going to production:
# Grafana alerting rules (PromQL)
# p95 latency > 10 seconds
- alert: LLMHighLatency
expr: histogram_quantile(0.95, rate(gen_ai_latency_ms_bucket[5m])) > 10000
for: 2m
labels:
severity: warning
annotations:
summary: "LLM p95 latency above 10s"
# Daily cost exceeding 150% of 7-day average
- alert: LLMCostSpike
expr: |
sum(increase(gen_ai_usage_cost_usd_total[24h])) >
sum(increase(gen_ai_usage_cost_usd_total[7d])) / 7 * 1.5
for: 0m
labels:
severity: critical
annotations:
summary: "LLM cost spike — daily spend >150% of 7-day average"
# Error rate > 5%
- alert: LLMHighErrorRate
expr: |
rate(gen_ai_requests_total{status="error"}[5m]) /
rate(gen_ai_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
# Rate limit errors > 10/hour
- alert: LLMRateLimiting
expr: increase(gen_ai_errors_total{error_type="rate_limit"}[1h]) > 10
for: 0m
labels:
severity: warning
The cost spike alert is the one that saves you from runaway loops. An agent that gets stuck in a retry loop can burn hundreds of dollars before anyone notices without this alert.
OTel vs LangSmith vs Braintrust
These tools have different purposes. You'll likely use all three:
| Tool | Focus | When to use |
|---|---|---|
| OpenTelemetry | Ops/infra — latency, errors, cost, availability | Production monitoring, alerting, cost attribution |
| LangSmith | ML eval — prompt traces, comparison, regression testing | Prompt development, debugging unexpected outputs |
| Braintrust | ML eval — eval dataset runs, score tracking over time | Systematic eval tracking across prompt versions |
OTel answers "is the system healthy?" LangSmith and Braintrust answer "is the output quality good?" You need both. The agent observability guide covering LangSmith and Braintrust covers the eval side.
Adding context to spans
The instrumentation wrapper above logs model-level attributes. Add application-level context as span attributes to make traces actionable:
from opentelemetry import trace
def handle_support_query(user_id: str, query: str, feature: str):
current_span = trace.get_current_span()
# Add business context to the LLM span
current_span.set_attribute("app.user_id", user_id)
current_span.set_attribute("app.feature", feature) # e.g. "support_bot"
current_span.set_attribute("app.query_length", len(query))
response = client.messages.create(
model="claude-haiku-3-5",
max_tokens=500,
messages=[{"role": "user", "content": query}]
)
return response
Now you can answer: "Which feature is driving the most cost?" and "Which users are hitting rate limits?" — questions your ops team will ask within a week of launch.
For the full production readiness picture, the agent production checklist covers everything you need before go-live. The FastAPI + Claude production patterns guide shows how to structure the application layer that these traces feed from.
OTel isn't glamorous infrastructure, but it's what separates a toy LLM app from something you can operate confidently at scale.



