What is prompt engineering?

Prompt engineering is the practice of crafting inputs to AI language models to produce accurate, useful, and reliable outputs. It involves choosing the right words, structure, context, and format to guide the AI toward the response you actually need — rather than a generic or off-target one.

Which AI models benefit most from better prompting?

All major large language models — including ChatGPT (GPT-4o), Claude, and Gemini — respond significantly to prompt quality. The same task can produce dramatically different results depending on how you structure your request. Better prompting improves output across every major model.

Do I need technical skills to do prompt engineering?

No. Prompt engineering is done in natural language — you write text instructions, not code. Basic prompting needs no technical background at all. Advanced techniques like prompt chaining or agentic workflows can benefit from light scripting knowledge, but the core skill is clear written communication.

Where can I learn more about prompt engineering?

MasterPrompting.net offers a structured curriculum from beginner to advanced, covering every major technique from basic clarity and context to chain-of-thought, meta-prompting, and agentic workflows. Start with the Beginner track to build a solid foundation.

LLM Observability with OpenTelemetry — Beyond LangSmith

LangSmith is excellent if you're using LangChain and you're okay with your traces living in Langchain Inc.'s infrastructure. But if your org runs Grafana and Datadog, or if you have compliance requirements about where LLM inputs and outputs can go, you need a different approach.

OpenTelemetry gives you vendor-neutral observability that plugs into whatever backend you already use. Here's how to instrument your LLM application properly — traces for every call, token cost attribution, latency dashboards, and alerts that fire before your users notice problems.

OTel basics in one paragraph

OpenTelemetry is a CNCF standard for distributed tracing, metrics, and logs. Your application emits spans (units of work with a start time, end time, and attributes). Spans compose into traces (the full call tree for a request). An OTLP exporter sends these to your backend (Grafana Tempo, Datadog APM, Jaeger, Honeycomb — they all accept OTLP). You instrument once; you can switch backends without changing code.

LLM-specific semantic conventions

The OpenTelemetry project has standardized attribute names for LLM calls under gen_ai.*. Using these means your traces are compatible with any tool that understands the spec:

Attribute	Example value
`gen_ai.system`	`"anthropic"`
`gen_ai.request.model`	`"claude-sonnet-4-5"`
`gen_ai.request.max_tokens`	`2000`
`gen_ai.response.model`	`"claude-sonnet-4-5"`
`gen_ai.usage.input_tokens`	`1547`
`gen_ai.usage.output_tokens`	`312`
`gen_ai.response.finish_reasons`	`["end_turn"]`
`gen_ai.usage.cost_usd`	`0.0051` (computed)

The spec doesn't include cost_usd natively — you compute it from token counts and add it as a custom attribute. Worth doing: it's the metric your finance team will ask for.

Setup: instrumenting the Anthropic SDK

The Anthropic SDK doesn't have automatic OTel instrumentation yet (unlike some LangChain components). You wrap the client:

pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc anthropic

import time
import anthropic
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

# Configure tracer
resource = Resource.create({"service.name": "my-llm-app", "service.version": "1.0.0"})
provider = TracerProvider(resource=resource)

# OTLP exporter — change endpoint for your backend
otlp_exporter = OTLPSpanExporter(endpoint="http://localhost:4317")
provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("llm-instrumentation")

# Pricing table (update as models/prices change)
COST_PER_MILLION_TOKENS = {
    "claude-opus-4-5": {"input": 15.00, "output": 75.00},
    "claude-sonnet-4-5": {"input": 3.00, "output": 15.00},
    "claude-haiku-3-5": {"input": 0.80, "output": 4.00},
}

def compute_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    pricing = COST_PER_MILLION_TOKENS.get(model, {"input": 3.00, "output": 15.00})
    return (input_tokens * pricing["input"] + output_tokens * pricing["output"]) / 1_000_000


class InstrumentedAnthropic:
    """
    Wraps anthropic.Anthropic and emits OTel spans for every messages.create call.
    Drop-in replacement for the bare client.
    """
    
    def __init__(self, **kwargs):
        self._client = anthropic.Anthropic(**kwargs)
        self.messages = self._InstrumentedMessages(self._client, tracer)
    
    class _InstrumentedMessages:
        def __init__(self, client, tracer):
            self._client = client
            self._tracer = tracer
        
        def create(self, **kwargs) -> anthropic.types.Message:
            model = kwargs.get("model", "unknown")
            
            with self._tracer.start_as_current_span("anthropic.messages.create") as span:
                # Set request attributes
                span.set_attribute("gen_ai.system", "anthropic")
                span.set_attribute("gen_ai.request.model", model)
                span.set_attribute("gen_ai.request.max_tokens", kwargs.get("max_tokens", 0))
                
                # Include system prompt length for debugging (don't log full content)
                if "system" in kwargs:
                    span.set_attribute("gen_ai.request.system_prompt_length", len(kwargs["system"]))
                
                span.set_attribute("gen_ai.request.message_count", len(kwargs.get("messages", [])))
                
                start_time = time.time()
                
                try:
                    response = self._client.messages.create(**kwargs)
                    
                    latency_ms = (time.time() - start_time) * 1000
                    
                    # Set response attributes
                    input_tokens = response.usage.input_tokens
                    output_tokens = response.usage.output_tokens
                    cost = compute_cost(model, input_tokens, output_tokens)
                    
                    span.set_attribute("gen_ai.response.model", response.model)
                    span.set_attribute("gen_ai.usage.input_tokens", input_tokens)
                    span.set_attribute("gen_ai.usage.output_tokens", output_tokens)
                    span.set_attribute("gen_ai.response.finish_reasons", [response.stop_reason or "unknown"])
                    span.set_attribute("gen_ai.usage.cost_usd", round(cost, 6))
                    span.set_attribute("gen_ai.latency_ms", round(latency_ms, 2))
                    
                    return response
                    
                except anthropic.RateLimitError as e:
                    span.set_attribute("error.type", "rate_limit")
                    span.record_exception(e)
                    span.set_status(trace.StatusCode.ERROR, "Rate limit exceeded")
                    raise
                    
                except anthropic.APIError as e:
                    span.set_attribute("error.type", "api_error")
                    span.set_attribute("error.status_code", e.status_code if hasattr(e, 'status_code') else 0)
                    span.record_exception(e)
                    span.set_status(trace.StatusCode.ERROR, str(e))
                    raise


# Usage — same as bare client
client = InstrumentedAnthropic()

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=500,
    system="You are a helpful assistant.",
    messages=[{"role": "user", "content": "Explain caching in one paragraph."}]
)

Every call now emits a span with all the attributes you need for dashboards and alerts. The wrapper is transparent — the return type is identical to the bare client.

Connecting to Grafana

Point the OTLP exporter to Grafana Alloy (the OTel collector), which forwards to Tempo (traces) and Mimir (metrics):

# For Grafana Cloud
otlp_exporter = OTLPSpanExporter(
    endpoint="https://otlp-gateway-prod-eu-west-0.grafana.net/otlp",
    headers={
        "Authorization": f"Basic {GRAFANA_API_KEY_BASE64}"
    }
)

Or self-hosted with a local collector:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

exporters:
  otlphttp/tempo:
    endpoint: http://tempo:4318
  prometheusremotewrite:
    endpoint: http://mimir:9009/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [otlphttp/tempo]
    metrics:
      receivers: [otlp]
      exporters: [prometheusremotewrite]

Connecting to Datadog

Datadog's OTLP ingestion is on by default in the agent. Set the endpoint:

otlp_exporter = OTLPSpanExporter(
    endpoint="http://localhost:4317"  # Datadog agent OTLP receiver
)

In datadog.yaml:

otlp_config:
  receiver:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

Datadog will automatically pick up gen_ai.* attributes and display them in APM traces. Custom metrics (cost, token counts) appear under "Metrics" in your namespace.

Key metrics to track

Once spans are flowing, build dashboards around these:

Latency:

p50, p95, p99 latency by model and by API endpoint
Time-to-first-token for streaming responses (requires streaming instrumentation — add a separate span for the first chunk)

Throughput and errors:

Requests per minute by model
Error rate by gen_ai.response.finish_reasons — max_tokens finish reason means you're truncating output
Rate limit errors per hour (indicates you need to request quota increase or add retry logic)

Cost:

Daily and hourly total cost (sum of gen_ai.usage.cost_usd)
Cost per API endpoint or feature (add feature.name as a span attribute)
Input vs output token ratio (high output/input ratio = model is generating a lot; could indicate missing max_tokens caps)

Cache performance (if using prompt caching):

Cache hit rate: cache_read_input_tokens / (input_tokens + cache_read_input_tokens)
Add gen_ai.usage.cache_read_tokens and gen_ai.usage.cache_write_tokens as span attributes

Alerting rules

Configure these alerts before going to production:

# Grafana alerting rules (PromQL)

# p95 latency > 10 seconds
- alert: LLMHighLatency
  expr: histogram_quantile(0.95, rate(gen_ai_latency_ms_bucket[5m])) > 10000
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "LLM p95 latency above 10s"

# Daily cost exceeding 150% of 7-day average
- alert: LLMCostSpike
  expr: |
    sum(increase(gen_ai_usage_cost_usd_total[24h])) > 
    sum(increase(gen_ai_usage_cost_usd_total[7d])) / 7 * 1.5
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: "LLM cost spike — daily spend >150% of 7-day average"

# Error rate > 5%
- alert: LLMHighErrorRate
  expr: |
    rate(gen_ai_requests_total{status="error"}[5m]) / 
    rate(gen_ai_requests_total[5m]) > 0.05
  for: 5m
  labels:
    severity: critical

# Rate limit errors > 10/hour
- alert: LLMRateLimiting
  expr: increase(gen_ai_errors_total{error_type="rate_limit"}[1h]) > 10
  for: 0m
  labels:
    severity: warning

The cost spike alert is the one that saves you from runaway loops. An agent that gets stuck in a retry loop can burn hundreds of dollars before anyone notices without this alert.

OTel vs LangSmith vs Braintrust

These tools have different purposes. You'll likely use all three:

Tool	Focus	When to use
OpenTelemetry	Ops/infra — latency, errors, cost, availability	Production monitoring, alerting, cost attribution
LangSmith	ML eval — prompt traces, comparison, regression testing	Prompt development, debugging unexpected outputs
Braintrust	ML eval — eval dataset runs, score tracking over time	Systematic eval tracking across prompt versions

OTel answers "is the system healthy?" LangSmith and Braintrust answer "is the output quality good?" You need both. The agent observability guide covering LangSmith and Braintrust covers the eval side.

Adding context to spans

The instrumentation wrapper above logs model-level attributes. Add application-level context as span attributes to make traces actionable:

from opentelemetry import trace

def handle_support_query(user_id: str, query: str, feature: str):
    current_span = trace.get_current_span()
    
    # Add business context to the LLM span
    current_span.set_attribute("app.user_id", user_id)
    current_span.set_attribute("app.feature", feature)  # e.g. "support_bot"
    current_span.set_attribute("app.query_length", len(query))
    
    response = client.messages.create(
        model="claude-haiku-3-5",
        max_tokens=500,
        messages=[{"role": "user", "content": query}]
    )
    return response

Now you can answer: "Which feature is driving the most cost?" and "Which users are hitting rate limits?" — questions your ops team will ask within a week of launch.

For the full production readiness picture, the agent production checklist covers everything you need before go-live. The FastAPI + Claude production patterns guide shows how to structure the application layer that these traces feed from.

OTel isn't glamorous infrastructure, but it's what separates a toy LLM app from something you can operate confidently at scale.

OTel basics in one paragraph

LLM-specific semantic conventions

The OpenTelemetry project has standardized attribute names for LLM calls under gen_ai.*. Using these means your traces are compatible with any tool that understands the spec:

Attribute	Example value
`gen_ai.system`	`"anthropic"`
`gen_ai.request.model`	`"claude-sonnet-4-5"`
`gen_ai.request.max_tokens`	`2000`
`gen_ai.response.model`	`"claude-sonnet-4-5"`
`gen_ai.usage.input_tokens`	`1547`
`gen_ai.usage.output_tokens`	`312`
`gen_ai.response.finish_reasons`	`["end_turn"]`
`gen_ai.usage.cost_usd`	`0.0051` (computed)

The spec doesn't include cost_usd natively — you compute it from token counts and add it as a custom attribute. Worth doing: it's the metric your finance team will ask for.

Setup: instrumenting the Anthropic SDK

The Anthropic SDK doesn't have automatic OTel instrumentation yet (unlike some LangChain components). You wrap the client:

pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc anthropic

import time
import anthropic
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

# Configure tracer
resource = Resource.create({"service.name": "my-llm-app", "service.version": "1.0.0"})
provider = TracerProvider(resource=resource)

# OTLP exporter — change endpoint for your backend
otlp_exporter = OTLPSpanExporter(endpoint="http://localhost:4317")
provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("llm-instrumentation")

# Pricing table (update as models/prices change)
COST_PER_MILLION_TOKENS = {
    "claude-opus-4-5": {"input": 15.00, "output": 75.00},
    "claude-sonnet-4-5": {"input": 3.00, "output": 15.00},
    "claude-haiku-3-5": {"input": 0.80, "output": 4.00},
}

def compute_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    pricing = COST_PER_MILLION_TOKENS.get(model, {"input": 3.00, "output": 15.00})
    return (input_tokens * pricing["input"] + output_tokens * pricing["output"]) / 1_000_000


class InstrumentedAnthropic:
    """
    Wraps anthropic.Anthropic and emits OTel spans for every messages.create call.
    Drop-in replacement for the bare client.
    """
    
    def __init__(self, **kwargs):
        self._client = anthropic.Anthropic(**kwargs)
        self.messages = self._InstrumentedMessages(self._client, tracer)
    
    class _InstrumentedMessages:
        def __init__(self, client, tracer):
            self._client = client
            self._tracer = tracer
        
        def create(self, **kwargs) -> anthropic.types.Message:
            model = kwargs.get("model", "unknown")
            
            with self._tracer.start_as_current_span("anthropic.messages.create") as span:
                # Set request attributes
                span.set_attribute("gen_ai.system", "anthropic")
                span.set_attribute("gen_ai.request.model", model)
                span.set_attribute("gen_ai.request.max_tokens", kwargs.get("max_tokens", 0))
                
                # Include system prompt length for debugging (don't log full content)
                if "system" in kwargs:
                    span.set_attribute("gen_ai.request.system_prompt_length", len(kwargs["system"]))
                
                span.set_attribute("gen_ai.request.message_count", len(kwargs.get("messages", [])))
                
                start_time = time.time()
                
                try:
                    response = self._client.messages.create(**kwargs)
                    
                    latency_ms = (time.time() - start_time) * 1000
                    
                    # Set response attributes
                    input_tokens = response.usage.input_tokens
                    output_tokens = response.usage.output_tokens
                    cost = compute_cost(model, input_tokens, output_tokens)
                    
                    span.set_attribute("gen_ai.response.model", response.model)
                    span.set_attribute("gen_ai.usage.input_tokens", input_tokens)
                    span.set_attribute("gen_ai.usage.output_tokens", output_tokens)
                    span.set_attribute("gen_ai.response.finish_reasons", [response.stop_reason or "unknown"])
                    span.set_attribute("gen_ai.usage.cost_usd", round(cost, 6))
                    span.set_attribute("gen_ai.latency_ms", round(latency_ms, 2))
                    
                    return response
                    
                except anthropic.RateLimitError as e:
                    span.set_attribute("error.type", "rate_limit")
                    span.record_exception(e)
                    span.set_status(trace.StatusCode.ERROR, "Rate limit exceeded")
                    raise
                    
                except anthropic.APIError as e:
                    span.set_attribute("error.type", "api_error")
                    span.set_attribute("error.status_code", e.status_code if hasattr(e, 'status_code') else 0)
                    span.record_exception(e)
                    span.set_status(trace.StatusCode.ERROR, str(e))
                    raise


# Usage — same as bare client
client = InstrumentedAnthropic()

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=500,
    system="You are a helpful assistant.",
    messages=[{"role": "user", "content": "Explain caching in one paragraph."}]
)

Every call now emits a span with all the attributes you need for dashboards and alerts. The wrapper is transparent — the return type is identical to the bare client.

Connecting to Grafana

Point the OTLP exporter to Grafana Alloy (the OTel collector), which forwards to Tempo (traces) and Mimir (metrics):

# For Grafana Cloud
otlp_exporter = OTLPSpanExporter(
    endpoint="https://otlp-gateway-prod-eu-west-0.grafana.net/otlp",
    headers={
        "Authorization": f"Basic {GRAFANA_API_KEY_BASE64}"
    }
)

Or self-hosted with a local collector:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

exporters:
  otlphttp/tempo:
    endpoint: http://tempo:4318
  prometheusremotewrite:
    endpoint: http://mimir:9009/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [otlphttp/tempo]
    metrics:
      receivers: [otlp]
      exporters: [prometheusremotewrite]

Connecting to Datadog

Datadog's OTLP ingestion is on by default in the agent. Set the endpoint:

otlp_exporter = OTLPSpanExporter(
    endpoint="http://localhost:4317"  # Datadog agent OTLP receiver
)

In datadog.yaml:

otlp_config:
  receiver:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

Datadog will automatically pick up gen_ai.* attributes and display them in APM traces. Custom metrics (cost, token counts) appear under "Metrics" in your namespace.

Key metrics to track

Once spans are flowing, build dashboards around these:

Latency:

p50, p95, p99 latency by model and by API endpoint
Time-to-first-token for streaming responses (requires streaming instrumentation — add a separate span for the first chunk)

Throughput and errors:

Requests per minute by model
Error rate by gen_ai.response.finish_reasons — max_tokens finish reason means you're truncating output
Rate limit errors per hour (indicates you need to request quota increase or add retry logic)

Cost:

Daily and hourly total cost (sum of gen_ai.usage.cost_usd)
Cost per API endpoint or feature (add feature.name as a span attribute)
Input vs output token ratio (high output/input ratio = model is generating a lot; could indicate missing max_tokens caps)

Cache performance (if using prompt caching):

Cache hit rate: cache_read_input_tokens / (input_tokens + cache_read_input_tokens)
Add gen_ai.usage.cache_read_tokens and gen_ai.usage.cache_write_tokens as span attributes

Alerting rules

Configure these alerts before going to production:

# Grafana alerting rules (PromQL)

# p95 latency > 10 seconds
- alert: LLMHighLatency
  expr: histogram_quantile(0.95, rate(gen_ai_latency_ms_bucket[5m])) > 10000
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "LLM p95 latency above 10s"

# Daily cost exceeding 150% of 7-day average
- alert: LLMCostSpike
  expr: |
    sum(increase(gen_ai_usage_cost_usd_total[24h])) > 
    sum(increase(gen_ai_usage_cost_usd_total[7d])) / 7 * 1.5
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: "LLM cost spike — daily spend >150% of 7-day average"

# Error rate > 5%
- alert: LLMHighErrorRate
  expr: |
    rate(gen_ai_requests_total{status="error"}[5m]) / 
    rate(gen_ai_requests_total[5m]) > 0.05
  for: 5m
  labels:
    severity: critical

# Rate limit errors > 10/hour
- alert: LLMRateLimiting
  expr: increase(gen_ai_errors_total{error_type="rate_limit"}[1h]) > 10
  for: 0m
  labels:
    severity: warning

The cost spike alert is the one that saves you from runaway loops. An agent that gets stuck in a retry loop can burn hundreds of dollars before anyone notices without this alert.

OTel vs LangSmith vs Braintrust

These tools have different purposes. You'll likely use all three:

Tool	Focus	When to use
OpenTelemetry	Ops/infra — latency, errors, cost, availability	Production monitoring, alerting, cost attribution
LangSmith	ML eval — prompt traces, comparison, regression testing	Prompt development, debugging unexpected outputs
Braintrust	ML eval — eval dataset runs, score tracking over time	Systematic eval tracking across prompt versions

Adding context to spans

The instrumentation wrapper above logs model-level attributes. Add application-level context as span attributes to make traces actionable:

from opentelemetry import trace

def handle_support_query(user_id: str, query: str, feature: str):
    current_span = trace.get_current_span()
    
    # Add business context to the LLM span
    current_span.set_attribute("app.user_id", user_id)
    current_span.set_attribute("app.feature", feature)  # e.g. "support_bot"
    current_span.set_attribute("app.query_length", len(query))
    
    response = client.messages.create(
        model="claude-haiku-3-5",
        max_tokens=500,
        messages=[{"role": "user", "content": query}]
    )
    return response

Now you can answer: "Which feature is driving the most cost?" and "Which users are hitting rate limits?" — questions your ops team will ask within a week of launch.

OTel isn't glamorous infrastructure, but it's what separates a toy LLM app from something you can operate confidently at scale.

LLM Observability with OpenTelemetry — Beyond LangSmith

OTel basics in one paragraph

LLM-specific semantic conventions

Setup: instrumenting the Anthropic SDK

Connecting to Grafana

Connecting to Datadog

Key metrics to track

Alerting rules

OTel vs LangSmith vs Braintrust

Adding context to spans

Related articles

A/B Testing Prompts in Production — A Statistical Guide

Async Python for LLM Apps — Patterns That Actually Work in Production

Claude Sonnet 4.6 — The Complete Guide

LLM Observability with OpenTelemetry — Beyond LangSmith

OTel basics in one paragraph

LLM-specific semantic conventions

Setup: instrumenting the Anthropic SDK

Connecting to Grafana

Connecting to Datadog

Key metrics to track

Alerting rules

OTel vs LangSmith vs Braintrust

Adding context to spans

Related articles

A/B Testing Prompts in Production — A Statistical Guide

Async Python for LLM Apps — Patterns That Actually Work in Production

Claude Sonnet 4.6 — The Complete Guide