Context Engineering: Storage, Retrieval, and the New Memory Stack

Balys Kriksciunas · Tue Mar 17 2026 · 8 min read

#ai #infrastructure #context-engineering #memory #rag #agents #vector-database

Agents need more than a vector database. A tour of the memory stack production agents actually use — working, short-term, long-term, semantic, episodic — and the infrastructure behind each.

Context Engineering: Storage, Retrieval, and the New Memory Stack

RAG answered the question “how does my LLM know my company’s docs?” with a vector database and a retrieval step. It worked. It still works. But anyone building agents in 2026 has discovered that RAG alone is woefully insufficient.

Production agents need memory stacks — multiple storage systems, each for a different kind of context, wired together. The emerging practice of designing these stacks is called context engineering, and it’s one of the most impactful disciplines in AI infrastructure right now.

Why One Store Isn’t Enough

A production agent needs to answer questions like:

What did this user say ten seconds ago? (current turn)
What did they say last week? (prior sessions)
What do I know about them? (profile / preferences)
What’s the relevant domain knowledge for this query? (semantic retrieval)
What did a similar task look like before? (episodic memory)
What’s the current state of the task’s tools and environment? (working memory)

Each of these has different shape: different data models, different latency budgets, different write patterns, different retention requirements. No single store serves all of them well.

Teams that try to stuff everything into one vector database end up with slow retrieval, awkward data shapes, and agents that forget obvious things. Teams that split into a proper memory stack ship measurably better agents.

The Five Memory Tiers

A useful decomposition:

1. Working memory

The current turn. The agent’s “scratchpad.” Short-lived, ephemeral, lives in-process.

Implementation: the agent framework’s state (LangGraph’s GraphState, CrewAI’s task state). In Redis if you need cross-worker resumability.

What lives here: current user input, current plan, pending tool calls, partial results.

Latency budget: single-digit milliseconds.

2. Short-term memory

The current session. Lasts minutes to hours.

Implementation: Redis or in-memory cache with TTL; also the workflow-engine state if using Temporal/Inngest.

What lives here: recent message history, last few tool outputs, session-specific context.

Latency budget: tens of milliseconds.

3. Long-term memory

Durable facts about users, tasks, and entities. Lasts indefinitely.

Implementation: Postgres (or equivalent). Structured tables for known fields, JSON for flexible ones.

What lives here:

User preferences and profile data
Explicit “remembered” facts (“user’s company is ACME”, “user prefers concise answers”)
Task state history
Entity attributes

Latency budget: <50ms for typical queries.

4. Semantic memory

Retrievable knowledge. What RAG was originally built for.

Implementation: vector DB + BM25 + reranker. See Hybrid Search in Production.

What lives here:

Company documentation
External knowledge
Retrieved facts / citations
Code context

Latency budget: <200ms for typical queries.

5. Episodic memory

Past trajectories. Sessions the agent completed, what worked, what didn’t.

Implementation: Data warehouse (Snowflake, BigQuery, ClickHouse) for bulk storage; a vector index over summaries for retrieval.

What lives here:

Past task trajectories
Similar queries and their outcomes
Known-good solutions to task patterns
Failure cases (for learning from mistakes)

Latency budget: <500ms (lower-frequency access).

Reference Memory Stack

A typical 2026 production agent memory stack:

[ Agent workflow ]
      │
      ├── Working memory  → in-process state (LangGraph / framework state)
      │
      ├── Short-term     → Redis (session messages, hot cache)
      │
      ├── Long-term      → Postgres (profile, preferences, entities)
      │
      ├── Semantic       → Qdrant / pgvector (vector search + BM25 + rerank)
      │
      ├── Graph          → Neo4j / Memgraph (entity relationships)
      │
      ├── Episodic       → BigQuery / ClickHouse (past sessions) + vector index
      │
      └── Environment    → tool-specific state (calendar, CRM, etc)

Not every agent needs all tiers. Start with working + short-term + semantic. Add the rest as agent capabilities mature.

Graph Stores: The Addition That Matters

Where does graph fit? Mostly between long-term and semantic.

Agents reasoning about entities and relationships benefit from a graph. “Who works at ACME?” “Which projects is this user involved in?” “What tools does this user have access to?” These are graph queries.

Options:

Neo4j — mature, Cypher query language, Enterprise features
Memgraph — in-memory-focused, fast
Dgraph — GraphQL-native
Apache AGE — graph extension for Postgres (worth trying before adding a new DB)
ArangoDB — multi-model (document + graph)

For most agents, a graph DB is overkill for v1. Add it when you notice you’re making many small queries across tables to reconstruct relationships — that’s the graph pattern asking for its proper tool.

Context Window Management

The agent-facing question: given all this memory, what goes into the prompt?

Context window management is its own discipline. Patterns:

Static + Dynamic Split

Part of the context is fixed (system prompt, user profile). Part is retrieved per-turn (recent messages, semantic hits).

Allocate budget: system prompt gets X tokens, profile Y, recent messages Z, retrieval R. Fail loudly if budget is exceeded — don’t silently truncate.

Summarization

Long sessions exceed context window. Summarize old turns into a running recap.

Pattern: keep last N turns verbatim; summarize turns older than that; replace summary occasionally to re-condense.

Implementation: a dedicated “summarizer” LLM call runs asynchronously; summary is written back to the session state.

Retrieval

For each turn, retrieve the most relevant context from semantic memory. Don’t just stuff all company docs.

Tune retrieval quality. Low recall = missing info. Low precision = wasted context window.

Tool output pruning

Tool outputs can be enormous (a database query returns 10,000 rows). Don’t put raw tool output in context. Summarize or extract what’s needed.

Pattern: intermediate post-processing step between tool call and LLM. Can be regex, another LLM call, or application logic.

Memory Writes: When and How

Reading memory is well-understood. Writing is where teams get sloppy.

When should an agent write to long-term memory?

Patterns:

User-triggered. “Remember that my company is ACME.” Agent writes to profile.
Inference-driven. User mentions their company five times in three sessions. Agent proactively writes it.
Task-summary. At end of task, agent writes a task summary to episodic memory.
Action-driven. Every tool call gets logged to episodic memory.

Write quality matters more than read quality. A bad write poisons future reads. Validate writes: schema-check structured fields; have a “second LLM” confirm inferences before writing.

Cross-Session Learning

With episodic memory, agents can learn from their own history.

Patterns:

1. Example retrieval at task start

Before a new task, retrieve 3–5 similar past tasks and their trajectories. Inject into context. Agent learns from examples.

2. Failure analysis

When a task fails, analyze why, write to episodic memory with annotations. Future similar tasks retrieve the failure and avoid the pattern.

3. Tool reliability tracking

Track per-tool success rates from episodic data. Agent weighs reliability into tool selection.

4. Prompt-level self-improvement

Mine episodic data for common failure modes. Update system prompts (with human review) to address them.

These are mature patterns. Most teams don’t implement them until agent v2 or v3.

Consistency And Privacy

Consistency

Multiple memory tiers must stay consistent. When a user updates their profile, the long-term store updates; the short-term cache should invalidate; the graph relationship should update.

This is distributed-systems-flavored work. Use a message bus (Kafka, Redis streams) or direct writes through a single point.

Privacy

Memory often contains PII. Rules:

Segregate per-tenant. Never co-mingle tenants in the same index / table / graph.
TTL sensitive data. Session chat goes stale quickly; short TTLs.
Right-to-delete. When a user requests deletion, scrub across all tiers. Auditable.
Access controls. Which agents can read which memory? Tool-level enforcement.

For regulated industries, add:

Field-level encryption for sensitive data
Audit log of every memory read/write
Data residency (see Running Sovereign AI)

Emerging Tools

A category of “memory layer” products emerged in 2024–2025:

Mem0 — open-source memory layer with automatic categorization
Zep — long-term memory with automatic summarization and retrieval
MemGPT / Letta — OS-inspired memory hierarchy with paging
Cognee — graph + vector memory for knowledge workers
MemInf — enterprise-focused memory platform

These compress a lot of context-engineering work into managed services. Evaluate for your use case; for many teams, a homegrown Redis + Postgres + vector DB is sufficient and cheaper.

Common Mistakes

1. Everything in vector DB. User profiles, recent chats, knowledge — all stuffed in. Retrieval gets slow and noisy.

2. No forgetting. Memory grows unboundedly. Eventually, retrieval finds stale data. Add decay, TTL, or explicit pruning.

3. Writing raw LLM output to long-term memory. LLMs hallucinate. Writing their outputs to durable memory poisons the system. Always validate.

4. Synchronous writes in the hot path. Writing to episodic memory on every tool call slows interactive responses. Write async.

5. No separation of tenant memory. Multi-tenant agents where tenant A’s memory leaks into tenant B’s queries. Hard-isolate.

6. Forgetting to version. Memory schemas evolve. Old records break new code. Version and migrate.

Getting Started

If you’re starting from a simple RAG app and want to grow into a real agent memory stack:

Add working memory — if you don’t already have a proper agent state object (LangGraph, etc.), adopt one.
Add short-term memory — Redis for recent message history per session.
Add long-term memory for user profile — Postgres table for per-user facts the agent should remember.
Improve semantic retrieval — hybrid search, rerank, proper chunking.
Add episodic memory later — once you have traffic worth learning from.

Don’t do all of this at once. Each tier is a project.

Context Engineering: Storage, Retrieval, and the New Memory Stack

Context Engineering: Storage, Retrieval, and the New Memory Stack

Why One Store Isn’t Enough

The Five Memory Tiers

1. Working memory

2. Short-term memory

3. Long-term memory

4. Semantic memory

5. Episodic memory

Reference Memory Stack

Graph Stores: The Addition That Matters

Context Window Management

Static + Dynamic Split

Summarization

Retrieval

Tool output pruning

Memory Writes: When and How

Cross-Session Learning

1. Example retrieval at task start

2. Failure analysis

3. Tool reliability tracking

4. Prompt-level self-improvement

Consistency And Privacy

Consistency

Privacy

Emerging Tools

Common Mistakes

Getting Started

Further Reading

Related Posts

Securing RAG Pipelines: Prompt Injection via Data

Hybrid Search in Production: BM25 + Dense Retrieval

pgvector at Scale: When Postgres Is Enough

Context Engineering: Storage, Retrieval, and the New Memory Stack

Context Engineering: Storage, Retrieval, and the New Memory Stack

Why One Store Isn’t Enough

The Five Memory Tiers

1. Working memory

2. Short-term memory

3. Long-term memory

4. Semantic memory

5. Episodic memory

Reference Memory Stack

Graph Stores: The Addition That Matters

Context Window Management

Static + Dynamic Split

Summarization

Retrieval

Tool output pruning

Memory Writes: When and How

Cross-Session Learning

1. Example retrieval at task start

2. Failure analysis

3. Tool reliability tracking

4. Prompt-level self-improvement

Consistency And Privacy

Consistency

Privacy

Emerging Tools

Common Mistakes

Getting Started

Further Reading

Related Posts

Securing RAG Pipelines: Prompt Injection via Data

Hybrid Search in Production: BM25 + Dense Retrieval

pgvector at Scale: When Postgres Is Enough

Don't miss out on AI insights