Engineering a Persistent Agent Memory Layer via Elasticsearch

By Noam Schwartz | June 16, 2026 Agent Builder is now GA.

Quick Start: Explore the Elastic Cloud Trial or dive into the Agent Builder Documentation.

The Problem: The "Goldfish" Effect in AI Agents

Consider a scenario where a user, Sarah, has been struggling with a technical issue.

"She tried a specific fix in March, and again last week; neither attempt resolved the problem. The agent, however, remains oblivious to these failures—and completely unaware that a dog recently chewed through her sensor cables."

In this case, the critical history—what failed, who Sarah is, and the context of her environment—evaporates the moment the session ends.

~~The common fix is to cram all previous history into the context window.~~ This approach is flawed due to:

Cost & Latency: Larger windows increase token spend and slow down response times.
The "Lost in the Middle" Phenomenon: LLMs often ignore crucial data placed in the center of a long prompt, favoring the edges.

While the context window serves as short-term memory (the immediate reasoning space), agents desperately need long-term memory: a scalable, persistent store that allows for retrieval based on time, user identity, and content.

The Solution

We developed a memory architecture on Elasticsearch that achieves an average R@10 recall of 0.89 across 168 QA evaluations, with zero cross-tenant data leaks.

System Requirements & Design Goals

To build a production-ready memory layer, we had to solve several complex challenges:

Multi-tenancy: Absolute isolation between users via Document Level Security (DLS).
Memory Categorization: Distinguishing between raw events, stable facts, and playbooks.
Consolidation: Turning a "haystack" of fresh events into durable knowledge.
Supersession: Updating contradictory facts while maintaining an audit trail (no hard deletes).
Temporal Relevance: Ensuring new information outranks stale data without losing frequently accessed "evergreen" facts.
Interoperability: Ensuring the layer is accessible to any MCP-speaking client.

By using a search engine, we avoid the fragility of splitting these needs across four different services (vector store, keyword engine, auth service, and audit log).

The Cognitive Architecture: Three Indices

Drawing from cognitive psychology and the COALA framework, we split memory into three distinct Elasticsearch indices. This allows each to have unique write rates and aging logic.

Memory Type	Description	Key Characteristics	Logic/Metrics
Episodic	Raw, time-stamped interaction logs.	The "ground truth" of every turn.	Tracks `success_count` & `failure_count`.
Semantic	Curated, deduplicated facts.	Stable knowledge about the user.	Subject to supersession and refinement.
Procedural	Step-by-step playbooks.	How to solve specific problems.	Refined based on outcome feedback.

Note: A fourth surface—World Data (catalogs, KBs)—is also integrated into the retrieval pipeline, though it isn't "memory" in the biological sense.

The Recall Pipeline: Hybrid Search & Reranking

To ensure high recall, we employ a two-stage retrieval process.

1. Hybrid Retrieval

We combine BM25 (keyword) and Jina v5 (dense vectors). A single write operation populates both via copy_to, keeping the storage footprint lean.

BM25: Essential for literal matches (e.g., Lumio Hub v2, specific error codes).
Dense Vectors: Captures the semantic intent when the user uses different phrasing.

These are fused using Reciprocal Rank Fusion (RRF). The mathematical intuition for RRF can be represented as:

$score(d) = \sum_{r \in R} \frac{1}{k + r(d)}$

In our implementation, we use a rank_constant of 30 (tighter than the default 60), which gives more weight to the top-ranked results.

2. Cross-Encoder Reranking

Since RRF provides a wide candidate pool (80 candidates per leg), we use a Jina v2 cross-encoder to refine the results. Unlike bi-encoders, a cross-encoder performs full attention across the query and document pair simultaneously.

Implementation Detail: Index Mapping

To achieve this, the index mapping is configured to handle both text and semantic representations automatically:

{
  "mappings": {
    "properties": {
      "text_content": { "type": "text" },
      "semantic_text": { 
        "type": "semantic_text", 
        "inference_id": "jina-v5-model" 
      },
      "user_id": { "type": "keyword" },
      "timestamp": { "type": "date" }
    }
  }
}

By combining DLS for security, hybrid search for recall, and cognitive layering for organization, we've created a memory system that allows agents to truly "remember" and evolve with their users.