I've shipped two production agents with LangChain. I debugged both of them with print statements because there was no other way. The abstractions swallowed the errors, the runtime types were lies, and the docs were three versions behind.
Pydantic AI is what I switched to. It's built by Samuel Colvin — the person who built Pydantic v2 — and the philosophy shows. If you know Python, you can understand every line of a Pydantic AI agent. No magic, no hidden chains, no runtime surprises.
This post covers everything you need to go from zero to a production-ready agent: typed outputs, tools, dependency injection, multi-turn conversations, streaming, and observability.
Why Pydantic AI exists
LangChain and LangGraph are powerful. They're also genuinely frustrating to work with for any Python developer who has opinions about type safety.
The core problem: LLM responses are strings. Everything in a LangChain chain is Any. Your IDE can't help you. Your tests are hard to write because there's nothing concrete to assert against. Runtime errors happen deep in abstraction layers.
Pydantic AI's answer: make the output a Pydantic model. When you tell the agent result_type=MyModel, the framework guarantees you get back a validated, typed instance. The IDE knows what .issues and .suggestions are. The test asserts against real attributes.
pip install pydantic-ai anthropic
The core primitives
Pydantic AI has four concepts. That's it.
Agent — the main orchestrator. Holds the model, system prompt, tools, and result_type.
Tool — a Python function the model can call. Decorated with @agent.tool or @agent.tool_plain. Has full type annotations.
RunContext — dependency injection container. Passed into every tool. Holds your DB connections, HTTP clients, config — anything a tool needs.
ModelRetry — raise this from a tool to tell the model its input was wrong and it should try again with corrected parameters.
Build a code reviewer agent
Here's a complete, runnable example. The agent reviews code and returns a structured result with typed fields.
from pydantic import BaseModel, Field
from pydantic_ai import Agent, RunContext
import anthropic
class ReviewResult(BaseModel):
issues: list[str] = Field(description="Bugs, security issues, or correctness problems")
suggestions: list[str] = Field(description="Style, performance, or readability improvements")
score: int = Field(ge=1, le=10, description="Overall code quality score")
safe_to_merge: bool = Field(description="Whether the code is safe to merge as-is")
agent = Agent(
"claude-sonnet-4-6",
result_type=ReviewResult,
system_prompt=(
"You are a senior engineer reviewing Python code. "
"Be specific about issues — include line numbers or variable names where relevant. "
"Score 1-10 where 7+ means the code is production-ready."
),
)
result = await agent.run(
"def get_user(id): return db.query(f'SELECT * FROM users WHERE id={id}')"
)
# Fully typed — IDE autocomplete works here
print(result.data.issues) # ['SQL injection via f-string interpolation']
print(result.data.score) # 2
print(result.data.safe_to_merge) # False
The result.data is a validated ReviewResult instance. If the model returns something that doesn't validate, Pydantic AI retries automatically (up to a configurable limit).
Tools with dependency injection
The dependency injection system is the feature that makes testing actually possible. Instead of accessing a database through a global or a closure, you pass it through RunContext.
from dataclasses import dataclass
from pydantic_ai import Agent, RunContext
import httpx
@dataclass
class ReviewDeps:
http_client: httpx.AsyncClient
github_token: str
agent = Agent(
"claude-sonnet-4-6",
result_type=ReviewResult,
deps_type=ReviewDeps,
system_prompt="You are a code reviewer with access to GitHub PR diffs.",
)
@agent.tool
async def fetch_pr_diff(ctx: RunContext[ReviewDeps], pr_url: str) -> str:
"""Fetch the diff for a GitHub pull request URL."""
# Extract owner/repo/number from URL
parts = pr_url.rstrip("/").split("/")
owner, repo, pr_number = parts[-4], parts[-3], parts[-1]
response = await ctx.deps.http_client.get(
f"https://api.github.com/repos/{owner}/{repo}/pulls/{pr_number}/files",
headers={"Authorization": f"Bearer {ctx.deps.github_token}"},
)
files = response.json()
return "\n".join(f["patch"] for f in files if f.get("patch"))
# Running it
async def review_pr(pr_url: str):
async with httpx.AsyncClient() as client:
deps = ReviewDeps(
http_client=client,
github_token=os.environ["GITHUB_TOKEN"],
)
result = await agent.run(
f"Review this PR: {pr_url}",
deps=deps,
)
return result.data
In tests, you pass a mock client as http_client. No monkeypatching, no unittest.mock.patch gymnastics. The test is just:
async def test_review_pr():
mock_client = MockHttpClient(response=FAKE_DIFF)
deps = ReviewDeps(http_client=mock_client, github_token="test")
result = await agent.run("Review this PR: ...", deps=deps)
assert result.data.score >= 1
assert isinstance(result.data.issues, list)
ModelRetry: tell the model it got something wrong
When a tool call receives invalid input, raise ModelRetry. The model sees the error message and tries again with corrected parameters.
from pydantic_ai import ModelRetry
@agent.tool
async def search_codebase(ctx: RunContext[ReviewDeps], query: str) -> str:
"""Search the codebase for relevant files matching a query."""
if len(query) < 3:
raise ModelRetry("Query too short — provide at least 3 characters for meaningful search")
results = await ctx.deps.search_index.search(query)
if not results:
raise ModelRetry(f"No results for '{query}' — try a broader term or a different keyword")
return "\n".join(r.path for r in results[:10])
This is better than returning an empty list or an error string because the model actively recovers instead of silently moving on.
Multi-turn conversations
agent.run() processes a single message. For multi-turn conversations, pass message_history from the previous result:
# First turn
result1 = await agent.run("Review this function: def foo(x): return x * 2")
print(result1.data)
# Second turn — model remembers the first
result2 = await agent.run(
"What if I add type annotations?",
message_history=result1.new_messages(),
)
print(result2.data)
result.new_messages() returns the messages from that specific run. result.all_messages() returns everything including the history you passed in. For a chat interface, keep accumulating all_messages() across turns.
Streaming
For real-time token delivery to a UI:
async with agent.run_stream("Review this code: ...") as response:
async for text in response.stream():
print(text, end="", flush=True) # tokens as they arrive
# After the stream, the validated result is available
final = await response.get_data()
print(final.score)
Streaming works with result_type — the framework buffers the full JSON response, then validates it once the stream ends. During streaming you get raw tokens; after the stream you get the typed result.
Sync vs async
Every method has a sync equivalent. agent.run_sync() blocks until completion:
# For scripts, CLIs, or tests that don't use asyncio
result = agent.run_sync("Review this: ...")
print(result.data.issues)
Use run_sync in scripts and tests. Use run (async) in FastAPI endpoints, web apps, or anywhere else already running an event loop.
Model switching
The model is a string argument. Swapping models is one line:
# Claude
agent = Agent("claude-sonnet-4-6", result_type=ReviewResult, ...)
# GPT-4o
agent = Agent("openai:gpt-4o", result_type=ReviewResult, ...)
# Gemini
agent = Agent("google-gla:gemini-2.5-pro", result_type=ReviewResult, ...)
# Local with Ollama
agent = Agent("ollama:llama3.2", result_type=ReviewResult, ...)
The tool interface, dependency injection, and streaming API are identical across all models. The framework handles the model-specific API formats internally.
Observability with Logfire
Pydantic AI integrates directly with Logfire (Pydantic's observability platform). One line of setup gives you full traces:
import logfire
logfire.configure()
logfire.instrument_pydantic_ai()
# Now every agent.run() call emits a trace with:
# - model used and tokens consumed
# - each tool call with inputs and outputs
# - validation results
# - total latency
For teams not using Logfire, Pydantic AI also emits OpenTelemetry spans, so any OTel-compatible backend (Datadog, Honeycomb, Jaeger) works.
Pydantic AI vs the alternatives
| Pydantic AI | LangChain | LangGraph | Raw SDK | |
|---|---|---|---|---|
| Type safety | Full (result_type) | Minimal | Minimal | None |
| Testability | Easy (DI) | Hard | Hard | Easy |
| Learning curve | Low | High | Medium | Low |
| Streaming | Built-in | Complex | Complex | Built-in |
| Multi-agent | Basic | Rich | Rich | Manual |
| Observability | Logfire/OTel | LangSmith | LangSmith | Manual |
| When to choose | Single agents, type-safety | Large ecosystem needed | Complex graphs | Full control |
LangGraph wins for genuinely complex multi-agent graphs with conditional routing and persistent state. If you're building something like that, use it. For everything else — a single agent with tools, a structured output pipeline, a chat assistant — Pydantic AI is faster to build and easier to maintain.
The agent components lesson covers the conceptual framework that applies to any of these libraries.
A complete production example
Here's a support ticket classifier that pulls from a customer database, classifies the ticket, and returns structured routing instructions:
from dataclasses import dataclass
from pydantic import BaseModel, Field
from pydantic_ai import Agent, RunContext
import asyncpg
class TicketClassification(BaseModel):
category: str = Field(description="One of: billing, technical, account, refund, other")
priority: int = Field(ge=1, le=5, description="1=low, 5=critical")
suggested_team: str = Field(description="Team to route to: support, billing, engineering, vip")
summary: str = Field(description="One sentence summary of the issue")
needs_human: bool = Field(description="Whether this requires a human agent")
@dataclass
class ClassifyDeps:
db: asyncpg.Connection
classifier = Agent(
"claude-haiku-4-5-20251001", # Haiku — cheap for classification
result_type=TicketClassification,
deps_type=ClassifyDeps,
system_prompt=(
"Classify incoming support tickets. Be consistent — the same type of request "
"should always get the same category and priority."
),
)
@classifier.tool
async def get_customer_tier(ctx: RunContext[ClassifyDeps], email: str) -> str:
"""Look up a customer's subscription tier to inform priority."""
row = await ctx.deps.db.fetchrow(
"SELECT tier FROM customers WHERE email = $1", email
)
return row["tier"] if row else "unknown"
async def classify_ticket(email: str, message: str, db: asyncpg.Connection):
result = await classifier.run(
f"Customer: {email}\nMessage: {message}",
deps=ClassifyDeps(db=db),
)
return result.data
This runs in a FastAPI endpoint, processes tickets at ~200ms each (Haiku), and the typed output feeds directly into your ticketing system without string parsing.
What to watch out for
Don't over-type: if you just want a string or a simple bool back, you don't need a Pydantic model. result_type=str works fine. Start simple.
Token costs for retries: validation failures trigger retries, which cost tokens. If you're seeing many retries, your result_type schema might be too strict or your system prompt needs to explain the expected format better.
The result_type is not the full response: result.data is the structured result. result.all_messages() is the full conversation. For structured output tasks, you usually just need result.data.
Pydantic AI is still maturing: the multi-agent primitives are basic compared to LangGraph. Complex agent graphs with dynamic routing and persistent checkpoints are better handled by LangGraph or the OpenAI Agents SDK for now. Pydantic AI's sweet spot is single-agent systems where type safety and testability matter.
The function calling lesson covers the underlying mechanics that all these frameworks are built on top of.



