An AI agent without memory is like a colleague with amnesia who forgets your conversation the moment you leave the room. Every new message feels like a fresh start. You explain your project, they help you, you come back the next day — and they have no idea who you are.
Memory is what transforms a one-shot LLM call into an agent that can maintain context, learn from interactions, and provide coherent help across a conversation — and across sessions. Without it, your agent is stateless. With it, your agent becomes genuinely useful over time.
This lesson covers the four types of agent memory, when to use each, and practical strategies for managing memory as conversations grow.
The context window is your agent's working memory
Before diving into memory types, there's one foundational concept you need to understand: the context window.
Everything your agent "knows" at any given moment lives in its context window. The conversation history, the system prompt, retrieved documents, tool call outputs — all of it. The model can only reason about what's currently in that window.
The context window is finite. Claude's context window is 200,000 tokens (roughly 150,000 words). GPT-4o is 128,000 tokens. These sound enormous, but a long-running agent conversation with tool outputs, retrieved documents, and multi-turn dialogue can fill that space faster than you'd expect.
Understanding the context window is the foundation of all agent memory design. Every memory strategy is essentially an answer to the same question: how do you make sure the right information is in the context window at the right time?
The 4 types of agent memory
Type 1: In-context memory (working memory)
In-context memory is the simplest form: it's just the conversation history currently in the context window.
Every time the user sends a message, it's appended to the context. Every time the agent responds, that response is also appended. By the time you're 20 messages in, the agent has a running transcript of everything that was said.
How it works: No special infrastructure required. The messages array grows with each turn. The model reads the full history on every call.
The limit: Context window size. Once you exceed it, either the API returns an error or you have to start dropping messages.
The cost problem: LLM APIs charge per token. A 50-message conversation might have 20,000 tokens of history — and you're paying to send all of that on every single call, even if most of it isn't relevant to the current question.
Use for: Short-to-medium conversations, single-session tasks, prototypes. If your agent sessions are typically under 30-40 exchanges, in-context memory is often all you need.
Type 2: External memory (long-term retrieval)
External memory stores information outside the context window — in a database, a vector store, or a search index — and retrieves it when needed.
How it works: Your agent has a search_memory(query) tool. When it needs to recall something, it calls that tool. The search returns the most relevant chunks, which get injected into the context for that specific call.
Think of it like the difference between trying to hold everything in your head versus having a filing cabinet you can search. You don't carry every file with you — you retrieve what's relevant when you need it.
What gets stored here:
- Conversation history from past sessions
- User preferences and past decisions
- Product documentation or knowledge bases
- Notes from previous tasks
The retrieval challenge: External memory is only as good as your retrieval. If the search doesn't surface the right chunk, the agent answers without it — and may hallucinate to fill the gap. Retrieval quality is a first-class concern.
Use for: Multi-session agents, large knowledge bases, personalized assistants, any case where the full conversation history exceeds what fits in context.
Type 3: Episodic memory (session summaries)
Episodic memory is a middle ground between keeping full conversation history and starting fresh each session. At the end of a session (or when the context starts getting full), you summarize the key information into a compact note and store it.
How it works: After a session ends, an LLM call compresses the conversation into the essential facts, decisions, and context. That summary is stored and injected at the start of the next session.
An example summary for an agent helping with a software project:
In our last 3 sessions, we established: the tech stack is Next.js + PostgreSQL, TypeScript is preferred, and the project is a B2B SaaS tool for inventory management. Last session: we debugged an issue where the auth middleware wasn't correctly attaching the user object to the request. Fix was to await the session lookup before calling next().
That's maybe 80 tokens — but it captures the context that would have taken 3,000 tokens to reconstruct from the raw conversation history.
The tradeoff: Summaries lose nuance. If a detail wasn't deemed important enough to include in the summary, it's gone. For most use cases this is fine, but if exact recall matters, use external memory instead.
Use for: Long-running projects, personal AI assistants, customer service agents with history, any multi-session use case where full verbatim recall isn't required.
Type 4: Semantic memory (persistent facts)
Semantic memory stores explicit, structured facts about users, entities, or the domain. Unlike episodic memory (which compresses narrative), semantic memory extracts discrete facts.
How it works: The agent (or a separate extraction step) identifies key facts from conversations and stores them in a structured format — a database, a knowledge graph, or even a simple key-value store.
Examples:
user.name = "Priya"user.preferred_language = "Python"user.company = "Meridian Analytics"product.pricing_tier = "Growth"project.deadline = "2026-04-15"
These facts persist across all sessions. At the start of each conversation, the agent loads the relevant facts for that user and includes them in the system prompt or early context.
The extraction challenge: You need a reliable way to identify which facts are worth storing. This usually means either a dedicated extraction prompt that runs after each session, or explicit "remember this" functionality you give the user.
Use for: Personalization, customer profiles, domain knowledge, any facts that should persist indefinitely and be accessed quickly without a semantic search.
Memory management strategies
Once your agent has been running for a while, you'll hit the context limit and need a strategy for handling it. The three main approaches:
Sliding window: Keep only the last N messages in context. Simple to implement and predictable in its token cost. The problem: early context is lost. If the user set important constraints in message 1, the agent won't remember them by message 30.
Summarization on overflow: When the context approaches the limit, pass the oldest portion to a summarization call and replace it with the summary. Key facts are retained in compressed form. The downside is you're making an extra LLM call every time the context overflows.
Importance-based pruning: Score each message or chunk by its relevance to the current query. Keep high-relevance items, drop low-relevance ones. This is the most token-efficient approach but also the most complex to implement correctly.
For most agents, start with summarization on overflow. It's a good balance of simplicity and retention.
When agents "forget" — what it looks like in practice
Understanding memory failure modes helps you debug real problems:
Mid-conversation forgetting: The agent was aware of something the user said in message 5, but by message 35, that part of the conversation has been pruned or fallen off the context window. The agent now acts as if it was never mentioned. The user gets frustrated because they feel like they're repeating themselves.
Session reset: No persistence between sessions. The user comes back the next day, and the agent has no idea who they are or what they've been working on. Common complaint: "I have to explain my whole project every single time."
Retrieval miss: The information exists in external memory, but the search query doesn't surface it. The agent answers without the relevant context, potentially hallucinating. This looks exactly like the agent "forgetting" something it should know — but the root cause is retrieval quality, not storage.
Each failure mode has a different fix. Knowing which type of forgetting you're dealing with saves significant debugging time.
Implementing memory in practice
The right approach depends on your framework and use case:
n8n: Use the Simple Memory node for session-level memory. For persistence across sessions, the PostgreSQL Chat Memory node stores conversation history in a database. For external retrieval, integrate with Pinecone or Supabase Vector.
LangChain/LangGraph: ConversationBufferMemory for in-context history, ConversationSummaryMemory for automatic summarization, and vector store retrievers for external memory. LangGraph's state management gives you explicit control over what information persists across graph steps.
Custom implementations: If you're building directly on an LLM API, you're managing the messages array yourself. Start simple: pass the full conversation history, and only add compression or external retrieval when you actually hit the context limit or cost ceiling.
The key decision: in-context vs. external memory. If your sessions are bounded and your context window is large enough to hold the full conversation, keep it in context — it's simpler and retrieval can't fail. Only move to external memory when you need cross-session continuity or when conversations regularly exceed your context limit.
Key takeaway
Memory type should match your use case. Most agents start with in-context memory and only need external memory when conversations span sessions or the knowledge base is too large to fit in context. Don't over-engineer memory architecture before you know you need it.
The progression for most teams: start with in-context memory (simple, no infrastructure). Add session summaries when users complain about context loss between sessions. Add external vector memory when the knowledge base grows beyond what fits in context. Add semantic memory when personalization and persistent user facts become important.
Next steps: To go deeper on how to optimize what information you put into the context window, see the Context Engineering (Advanced) and Context Engineering (Agents) lessons.