LoCoMo Benchmarks
Comprehensive evaluation on the LoCoMo (Long-Context Memory) benchmark, the industry standard for measuring memory recall in conversational AI systems.
Detailed Results
Performance across four distinct memory recall categories plus overall weighted score.
Single Hop
Direct question answering from a single memory source.
Multi Hop
Questions requiring reasoning across multiple memory entries.
Temporal
Time-sensitive queries about when events occurred or changed.
Open Domain
General knowledge recall without specific memory cues.
Overall Score
Weighted average across all four categories.
Complete Data Table
| Product | Single Hop | Multi Hop | Temporal | Open Domain | Overall |
|---|---|---|---|---|---|
| MemoryLakeBEST | 96.79% | 91.84% | 91.28% | 85.42% | 94.03% |
| Benchmark 1 | 96.08% | 91.13% | 89.72% | 70.83% | 92.32% |
| Benchmark 2 | 94.93% | 90.43% | 87.95% | 71.88% | 91.21% |
| Benchmark 3 | 90.84% | 81.91% | 77.26% | 75.00% | 85.22% |
| Benchmark 4 | 85.37% | 79.43% | 75.08% | 64.58% | 80.76% |
| Benchmark 5 | 74.91% | 72.34% | 43.61% | 54.17% | 66.67% |
| Benchmark 6 | 68.97% | 61.70% | 58.26% | 50.00% | 64.20% |
Deep Dive
Understanding the LoCoMo Benchmark
Based on the peer-reviewed paper "Evaluating Very Long-Term Conversational Memory of LLM Agents" by Maharana et al., published at ACL 2024 (62nd Annual Meeting of the Association for Computational Linguistics).
Why This Benchmark Matters
Most existing conversational benchmarks evaluate LLMs on short exchanges (5-10 turns). Real-world AI assistants, however, interact across dozens of sessions spanning weeks or months. LoCoMo is the first benchmark specifically designed to evaluate very long-term conversational memory — testing whether an AI can recall, reason about, and synthesize information scattered across 300+ turns and up to 35 sessions.
Without rigorous long-term memory benchmarks, there is no way to objectively measure whether an AI memory system truly works — or simply appears to work on trivial cases. LoCoMo fills this critical gap.
Dataset Construction & Scale
LoCoMo employs a machine-human collaborative pipeline: two LLM-based virtual agents with distinct personas are assigned temporal event graphs representing realistic life sequences. They converse across multiple sessions with memory and reflection modules. Human annotators then verify and edit the conversations for long-range consistency.
Four Core Evaluation Categories
Single-Hop Reasoning
Tests direct factual retrieval from a single session. The agent must locate and recall a specific piece of information mentioned once during a conversation.
Example Question
"What restaurant did Alice mention she visited last Tuesday?"
Key Challenge: Requires precise retrieval from a specific session among 35+ sessions without confusing similar contexts.
Multi-Hop Reasoning
Requires synthesizing information from two or more separate sessions to arrive at the answer. The agent must chain facts across different conversations.
Example Question
"Based on Alice's job change in session 12 and her relocation in session 24, where does she currently work?"
Key Challenge: Demands cross-session information integration — the hardest retrieval task, as relevant facts may be separated by thousands of tokens.
Temporal Reasoning
Tests the ability to reason about time-ordered events — understanding what happened before, after, or between specific points in the conversational timeline.
Example Question
"Did Bob adopt his dog before or after moving to the new apartment?"
Key Challenge: Requires building and querying a mental timeline across sessions. Most LLMs show a 73% performance gap vs. humans on temporal tasks.
Open-Domain Knowledge
Requires integrating information from the conversation with external world knowledge or commonsense reasoning that was not explicitly stated.
Example Question
"Alice mentioned she's visiting the Eiffel Tower next week. What country is she traveling to?"
Key Challenge: Tests the boundary between memory retrieval and world knowledge integration — the agent must know what it was told vs. what it should already know.
Adversarial Testing (5th Category)
Beyond the four scored categories, LoCoMo includes adversarial questions designed to trick agents into hallucinating answers. These questions are intentionally unanswerable based on the conversation — the correct response is to say "I don't know."
This tests a critical real-world requirement: an AI memory system must know the limits of what it remembers and refuse to fabricate information. Long-context LLMs show "significant hallucinations" on adversarial questions — a major safety concern for production memory systems.
Evaluation Process & Scoring
Conversation Ingestion
The full multi-session dialogue (~300 turns, ~9K tokens, up to 35 sessions) is provided to the memory system for indexing and storage.
Question Presentation
~1,500+ questions across the four categories (single-hop, multi-hop, temporal, open-domain) are posed. Each question has a ground-truth answer derived from the conversation and verified by human annotators.
Memory Retrieval & Response
The system must retrieve relevant memories and generate an answer. This tests the full pipeline: ingestion → storage → retrieval → reasoning → generation.
Multi-metric Scoring
Answers are evaluated using F1 score (token overlap with ground truth), BLEU-1 (unigram precision), and LLM-as-a-Judge (GPT-4 evaluates semantic correctness). The overall score is a weighted composite.
Why This Is Hard: Technical Challenges
Context Window Limits
9K+ tokens exceed many LLMs' effective attention span. Information at the beginning of conversations is often "forgotten" by the time a question is asked.
Temporal Coherence
Events happen across 35 sessions over simulated weeks/months. Maintaining correct temporal ordering without explicit timestamps is extremely challenging.
Cross-Session Synthesis
Multi-hop questions require connecting facts from session 3 with facts from session 28 — information separated by thousands of tokens of unrelated conversation.
Hallucination Resistance
Adversarial questions test whether the system fabricates plausible-sounding answers for things never discussed. Most LLMs fail significantly here.
Semantic Ambiguity
The same topic may be discussed differently across sessions with evolving context, requiring the system to resolve conflicting or updated information.
56% Human Gap
Even the best RAG approaches lag 56% behind human performance on this benchmark, demonstrating the fundamental difficulty of long-term conversational memory.
Key Takeaways: MemoryLake on LoCoMo
- MemoryLake achieves 94.03% overall — the highest score ever recorded on the LoCoMo benchmark, surpassing all published memory systems.
- Single-hop recall at 96.79% demonstrates near-perfect factual retrieval across long conversations, approaching human-level performance.
- Multi-hop reasoning at 91.84% shows MemoryLake can effectively chain information across sessions — the hardest category where most systems fail.
- Temporal reasoning at 91.28% validates MemoryLake's calendar-aware indexing and temporal event graph construction.
- Open-domain at 85.42% is the highest in the field, demonstrating strong integration of conversational memory with world knowledge.
- These results are achieved under strict experimental settings with no data leakage, no question-specific tuning, and full reproducibility.
Reference: Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). "Evaluating Very Long-Term Conversational Memory of LLM Agents." In Proceedings of ACL 2024. View our benchmark results →