Infrastructure

Context Engineering: Storage, Retrieval, and the New Memory Stack

Balys Kriksciunas 8 min read
#ai#infrastructure#context-engineering#memory#rag#agents#vector-database

Context Engineering: Storage, Retrieval, and the New Memory Stack

RAG answered the question “how does my LLM know my company’s docs?” with a vector database and a retrieval step. It worked. It still works. But anyone building agents in 2026 has discovered that RAG alone is woefully insufficient.

Production agents need memory stacks — multiple storage systems, each for a different kind of context, wired together. The emerging practice of designing these stacks is called context engineering, and it’s one of the most impactful disciplines in AI infrastructure right now.


Why One Store Isn’t Enough

A production agent needs to answer questions like:

Each of these has different shape: different data models, different latency budgets, different write patterns, different retention requirements. No single store serves all of them well.

Teams that try to stuff everything into one vector database end up with slow retrieval, awkward data shapes, and agents that forget obvious things. Teams that split into a proper memory stack ship measurably better agents.


The Five Memory Tiers

A useful decomposition:

1. Working memory

The current turn. The agent’s “scratchpad.” Short-lived, ephemeral, lives in-process.

Implementation: the agent framework’s state (LangGraph’s GraphState, CrewAI’s task state). In Redis if you need cross-worker resumability.

What lives here: current user input, current plan, pending tool calls, partial results.

Latency budget: single-digit milliseconds.

2. Short-term memory

The current session. Lasts minutes to hours.

Implementation: Redis or in-memory cache with TTL; also the workflow-engine state if using Temporal/Inngest.

What lives here: recent message history, last few tool outputs, session-specific context.

Latency budget: tens of milliseconds.

3. Long-term memory

Durable facts about users, tasks, and entities. Lasts indefinitely.

Implementation: Postgres (or equivalent). Structured tables for known fields, JSON for flexible ones.

What lives here:

Latency budget: <50ms for typical queries.

4. Semantic memory

Retrievable knowledge. What RAG was originally built for.

Implementation: vector DB + BM25 + reranker. See Hybrid Search in Production.

What lives here:

Latency budget: <200ms for typical queries.

5. Episodic memory

Past trajectories. Sessions the agent completed, what worked, what didn’t.

Implementation: Data warehouse (Snowflake, BigQuery, ClickHouse) for bulk storage; a vector index over summaries for retrieval.

What lives here:

Latency budget: <500ms (lower-frequency access).


Reference Memory Stack

A typical 2026 production agent memory stack:

[ Agent workflow ]

      ├── Working memory  → in-process state (LangGraph / framework state)

      ├── Short-term     → Redis (session messages, hot cache)

      ├── Long-term      → Postgres (profile, preferences, entities)

      ├── Semantic       → Qdrant / pgvector (vector search + BM25 + rerank)

      ├── Graph          → Neo4j / Memgraph (entity relationships)

      ├── Episodic       → BigQuery / ClickHouse (past sessions) + vector index

      └── Environment    → tool-specific state (calendar, CRM, etc)

Not every agent needs all tiers. Start with working + short-term + semantic. Add the rest as agent capabilities mature.


Graph Stores: The Addition That Matters

Where does graph fit? Mostly between long-term and semantic.

Agents reasoning about entities and relationships benefit from a graph. “Who works at ACME?” “Which projects is this user involved in?” “What tools does this user have access to?” These are graph queries.

Options:

For most agents, a graph DB is overkill for v1. Add it when you notice you’re making many small queries across tables to reconstruct relationships — that’s the graph pattern asking for its proper tool.


Context Window Management

The agent-facing question: given all this memory, what goes into the prompt?

Context window management is its own discipline. Patterns:

Static + Dynamic Split

Part of the context is fixed (system prompt, user profile). Part is retrieved per-turn (recent messages, semantic hits).

Allocate budget: system prompt gets X tokens, profile Y, recent messages Z, retrieval R. Fail loudly if budget is exceeded — don’t silently truncate.

Summarization

Long sessions exceed context window. Summarize old turns into a running recap.

Pattern: keep last N turns verbatim; summarize turns older than that; replace summary occasionally to re-condense.

Implementation: a dedicated “summarizer” LLM call runs asynchronously; summary is written back to the session state.

Retrieval

For each turn, retrieve the most relevant context from semantic memory. Don’t just stuff all company docs.

Tune retrieval quality. Low recall = missing info. Low precision = wasted context window.

Tool output pruning

Tool outputs can be enormous (a database query returns 10,000 rows). Don’t put raw tool output in context. Summarize or extract what’s needed.

Pattern: intermediate post-processing step between tool call and LLM. Can be regex, another LLM call, or application logic.


Memory Writes: When and How

Reading memory is well-understood. Writing is where teams get sloppy.

When should an agent write to long-term memory?

Patterns:

Write quality matters more than read quality. A bad write poisons future reads. Validate writes: schema-check structured fields; have a “second LLM” confirm inferences before writing.


Cross-Session Learning

With episodic memory, agents can learn from their own history.

Patterns:

1. Example retrieval at task start

Before a new task, retrieve 3–5 similar past tasks and their trajectories. Inject into context. Agent learns from examples.

2. Failure analysis

When a task fails, analyze why, write to episodic memory with annotations. Future similar tasks retrieve the failure and avoid the pattern.

3. Tool reliability tracking

Track per-tool success rates from episodic data. Agent weighs reliability into tool selection.

4. Prompt-level self-improvement

Mine episodic data for common failure modes. Update system prompts (with human review) to address them.

These are mature patterns. Most teams don’t implement them until agent v2 or v3.


Consistency And Privacy

Consistency

Multiple memory tiers must stay consistent. When a user updates their profile, the long-term store updates; the short-term cache should invalidate; the graph relationship should update.

This is distributed-systems-flavored work. Use a message bus (Kafka, Redis streams) or direct writes through a single point.

Privacy

Memory often contains PII. Rules:

For regulated industries, add:


Emerging Tools

A category of “memory layer” products emerged in 2024–2025:

These compress a lot of context-engineering work into managed services. Evaluate for your use case; for many teams, a homegrown Redis + Postgres + vector DB is sufficient and cheaper.


Common Mistakes

1. Everything in vector DB. User profiles, recent chats, knowledge — all stuffed in. Retrieval gets slow and noisy.

2. No forgetting. Memory grows unboundedly. Eventually, retrieval finds stale data. Add decay, TTL, or explicit pruning.

3. Writing raw LLM output to long-term memory. LLMs hallucinate. Writing their outputs to durable memory poisons the system. Always validate.

4. Synchronous writes in the hot path. Writing to episodic memory on every tool call slows interactive responses. Write async.

5. No separation of tenant memory. Multi-tenant agents where tenant A’s memory leaks into tenant B’s queries. Hard-isolate.

6. Forgetting to version. Memory schemas evolve. Old records break new code. Version and migrate.


Getting Started

If you’re starting from a simple RAG app and want to grow into a real agent memory stack:

  1. Add working memory — if you don’t already have a proper agent state object (LangGraph, etc.), adopt one.
  2. Add short-term memory — Redis for recent message history per session.
  3. Add long-term memory for user profile — Postgres table for per-user facts the agent should remember.
  4. Improve semantic retrieval — hybrid search, rerank, proper chunking.
  5. Add episodic memory later — once you have traffic worth learning from.

Don’t do all of this at once. Each tier is a project.


Further Reading

Designing a memory stack for a production agent? Let’s talk — we help teams scope from single-vector-DB to multi-tier memory systems.

← Back to Blog