Every agent demo looks brilliant. A user sends a message, the agent reasons through it, calls a tool, returns a crisp answer. Everyone in the room nods.
Then you ship it to real users. And the #1 complaint lands within days: "Why does it keep asking me the same questions?" "I already told it my account number." "It doesn't remember anything we talked about last week."
That's not a model problem. That's a memory problem. And it's the most under-engineered part of most agent builds.
Why agents forget everything
Every LLM conversation is stateless by default. When a new session starts, the model has zero recollection of previous interactions. The context window only holds what you explicitly put in it.
For a one-shot task — "summarize this report," "translate this paragraph" — that's fine. But for agents doing real work with real users over time, it means every conversation starts cold. The user has to re-establish context. The agent has to re-ask for information it already collected. Trust erodes.
The fix isn't "make the context window bigger." Bigger windows help, but they don't solve the architectural problem: you need a way to persist, retrieve, and inject relevant information across sessions.
The 4 types of memory your agent actually needs
Before picking a tool, get clear on what kind of memory you're dealing with.
Working memory is the current context window. Everything in the current conversation lives here — messages, tool results, reasoning traces. It's fast and always accurate, but it disappears when the session ends and costs tokens proportional to how much you stuff in.
Episodic memory is a record of past interactions. "Last Tuesday you asked about your order #12345. We resolved it by issuing a replacement." This is what makes a support agent feel like it remembers you, not just your account data. Typically stored as conversation summaries or structured event logs.
Semantic memory holds persistent facts about the user and domain. User preferences, account details, product knowledge, decisions made in past sessions. "User prefers INR pricing." "User is on the Pro plan." "User's tech stack is Python + FastAPI." This is the memory type that makes agents genuinely useful over time.
Procedural memory is how to do things — the agent's capabilities and workflows. This usually lives in the system prompt and rarely changes dynamically. It's still memory; it's just the kind you write once at deploy time.
Most teams only implement working memory. The agents feel smart in demos and dumb in production.
Architecture patterns, ranked by complexity
Naive: full history in context
Dump the entire conversation history into every request. This works surprisingly well up to about 20 turns. Past that, you hit context limits, costs compound, and the model starts losing track of early context anyway — lost in the middle is real.
Use this for: short-lived sessions with low turn counts. Don't use this for anything with returning users.
Summary compression
After every N turns (or when approaching the context limit), compress old turns into a rolling summary. Append that summary to new requests instead of the raw history.
Cheap and easy to implement. The problem: compression is lossy. Fine detail gets dropped. "User was frustrated about shipping delay on order #12345 but then satisfied after refund" becomes "user had an order issue that was resolved." You lose the specifics that actually help the agent respond well.
Good for casual assistants. Breaks down for support agents or anything where precise history matters.
Vector store retrieval
Embed every conversation turn as it happens. At the start of each new session, embed the current user message and retrieve the top-K most semantically similar past turns.
This gives you good recall on relevant history without blowing up your context budget. The trade-off: you're adding 100–300ms of latency per turn for the embedding + retrieval step, and you need to manage an embedding model alongside your LLM.
Check out the RAG lesson for the foundational mechanics — the same similarity-search principles apply here, just applied to conversation history instead of documents.
Hybrid fact extraction (the mem0 approach)
Instead of storing raw turns, extract discrete "memory facts" from each conversation in real time. "User prefers INR pricing." "User's order #12345 had a shipping delay." "User is building a FastAPI backend."
Store these as structured entries. Retrieve them semantically at the start of each session. This is what mem0 does — it runs a lightweight extraction pass after each turn to pull out meaningful facts, deduplicates them against existing memories, and builds a structured knowledge base per user.
The result: instead of retrieving a blob of old conversation text, you inject a clean list of relevant facts into the system prompt. Focused, low token count, high signal.
mem0 quickstart with full working code
Here's a minimal implementation. This connects to Claude via AICredits.in — INR billing, UPI top-up, no international card needed.
import os
from mem0 import Memory
from openai import OpenAI
client = OpenAI(
api_key=os.environ["AICREDITS_API_KEY"],
base_url="https://api.aicredits.in/v1"
)
m = Memory()
def chat(user_id: str, message: str) -> str:
relevant = m.search(query=message, user_id=user_id, limit=5)
context = "\n".join([r["memory"] for r in relevant])
response = client.chat.completions.create(
model="anthropic/claude-sonnet-4-6",
messages=[
{"role": "system", "content": f"You are a helpful assistant.\n\nWhat you know about this user:\n{context}"},
{"role": "user", "content": message}
]
)
reply = response.choices[0].message.content
m.add(f"User: {message}\nAssistant: {reply}", user_id=user_id)
return reply
A few things to note: m.search() runs a semantic search against stored memories for this user_id. The results get injected into the system prompt as a compact context block. After getting the reply, m.add() processes the new exchange and extracts any new facts worth storing.
mem0 handles deduplication — if the user mentions their account number again, it won't store a second copy. It updates the existing memory entry.
By default, mem0 uses a local Qdrant instance for the vector store and OpenAI-compatible embeddings. In production you'll want to point it at a managed vector database and configure your own embedding model.
Letta (formerly MemGPT) for autonomous memory management
mem0 is great for structured fact extraction, but it still requires you to decide what gets stored. Letta takes a different approach: the agent itself decides what to remember, update, and delete.
In Letta's architecture, the agent has explicit tools for memory management — core_memory_append, core_memory_replace, archival_memory_insert, archival_memory_search. The agent calls these tools during its reasoning loop, the same way it would call any other tool.
This means a long-running autonomous agent can actively maintain its own knowledge base. It can notice "I've been assuming the user is in Mumbai but they just mentioned they're in Bangalore now" and update that memory entry without you wiring up any extraction logic.
The trade-off is complexity. The agent loop is more expensive, and you're trusting the model to make good memory management decisions. For transactional support agents, mem0's extraction approach is simpler and more reliable. For open-ended research agents or assistants that run over weeks, Letta's autonomy is worth it.
Self-hosting with Postgres + pgvector
If you want full SQL control and no external service dependency, pgvector is the self-hosted path. It adds vector similarity search directly to Postgres — no separate vector database to manage.
Schema:
CREATE TABLE agent_memories (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id TEXT NOT NULL,
content TEXT NOT NULL,
embedding vector(1536),
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX ON agent_memories USING ivfflat (embedding vector_cosine_ops);
At query time, you embed the incoming message and run:
SELECT content
FROM agent_memories
WHERE user_id = $1
ORDER BY embedding <=> $2
LIMIT 5;
The <=> operator is cosine distance. You get the 5 most semantically similar memories for this user.
You handle extraction yourself — after each turn, run a small extraction prompt to pull out facts worth storing, embed them, and insert into the table. More work than mem0, but you own the entire stack.
This is the right choice if you're already running Postgres, if you need to run memory queries alongside other SQL (e.g., joining against your users table), or if compliance requirements mean you can't use external services.
The benchmark that changed how I think about this
I ran a support agent in production for an e-commerce client — same Claude model, same tools, same system prompt. The only variable was memory.
Without memory: 31% of returning-customer queries resolved without re-asking for context.
With mem0: 84%.
That 53-point gap is not the LLM getting smarter. It's the agent knowing that the user already told it their order number, their issue, and their preferred resolution method in a previous session. It's not re-asking. It's not starting cold.
The resolution rate improvement also cut average handle time by ~40%, which directly reduced cost per resolution. Memory has a measurable ROI.
See the agent components lesson for how memory fits into the broader agent architecture alongside tools, planning, and action execution.
When to skip memory entirely
Not every agent needs memory. Adding it to the wrong use cases costs money and adds latency for zero benefit.
Skip memory for one-shot tasks — "summarize this document," "translate this paragraph," "extract these fields." There's no user state to track. Each request is self-contained.
Skip it for anonymous or guest users. You need a stable user_id to store memories against. If users aren't authenticated, you're just accumulating orphaned memory entries.
Skip it for cost-sensitive, high-volume pipelines where the per-request latency and embedding cost matter. A pipeline processing 100,000 documents per day doesn't need to remember what it processed yesterday.
Use memory when: users return across multiple sessions, the agent's usefulness compounds with user-specific context, or you need to avoid re-asking for information the user already provided.
Choosing your architecture
Start with working memory only. Ship it. Instrument what users are re-explaining in follow-up sessions — that's your signal for what memory would actually help.
If you have 10,000+ users and a managed stack, use mem0 or a similar extraction-based approach. It's the fastest path to production-quality episodic and semantic memory.
If you have specific compliance or infrastructure requirements, wire up pgvector and build your own extraction layer. More work, but you own it completely.
If your agent is long-running and autonomous — a research assistant, a project manager, an agent that runs for weeks — look at Letta for self-managed memory.
For the underlying mechanics on retrieval and embedding, the RAG lesson covers everything you need. For how memory connects to the rest of your agent's architecture, start with the agentic RAG deep-dive.
The agents that feel smart aren't using a better model. They remember.



