LoCoMo Benchmarks

Comprehensive evaluation on the LoCoMo (Long-Context Memory) benchmark, the industry standard for measuring memory recall in conversational AI systems.

94.03%

MemoryLake Overall

Highest score across all products

Detailed Results

Performance across four distinct memory recall categories plus overall weighted score.

Single Hop

MemoryLake96.79%

Benchmark 196.08%

Benchmark 294.93%

Benchmark 390.84%

Benchmark 485.37%

Benchmark 574.91%

Benchmark 668.97%

Direct question answering from a single memory source.

Multi Hop

MemoryLake91.84%

Benchmark 191.13%

Benchmark 290.43%

Benchmark 381.91%

Benchmark 479.43%

Benchmark 572.34%

Benchmark 661.7%

Questions requiring reasoning across multiple memory entries.

Temporal

MemoryLake91.28%

Benchmark 189.72%

Benchmark 287.95%

Benchmark 377.26%

Benchmark 475.08%

Benchmark 543.61%

Benchmark 658.26%

Time-sensitive queries about when events occurred or changed.

Open Domain

MemoryLake85.42%

Benchmark 170.83%

Benchmark 271.88%

Benchmark 375%

Benchmark 464.58%

Benchmark 554.17%

Benchmark 650%

General knowledge recall without specific memory cues.

Overall Score

MemoryLake94.03%

Benchmark 192.32%

Benchmark 291.21%

Benchmark 385.22%

Benchmark 480.76%

Benchmark 566.67%

Benchmark 664.2%

Weighted average across all four categories.

Complete Data Table

Product	Single Hop	Multi Hop	Temporal	Open Domain	Overall
MemoryLakeBEST	96.79%	91.84%	91.28%	85.42%	94.03%
Benchmark 1	96.08%	91.13%	89.72%	70.83%	92.32%
Benchmark 2	94.93%	90.43%	87.95%	71.88%	91.21%
Benchmark 3	90.84%	81.91%	77.26%	75.00%	85.22%
Benchmark 4	85.37%	79.43%	75.08%	64.58%	80.76%
Benchmark 5	74.91%	72.34%	43.61%	54.17%	66.67%
Benchmark 6	68.97%	61.70%	58.26%	50.00%	64.20%

Deep Dive

Understanding the LoCoMo Benchmark

Based on the peer-reviewed paper "Evaluating Very Long-Term Conversational Memory of LLM Agents" by Maharana et al., published at ACL 2024 (62nd Annual Meeting of the Association for Computational Linguistics).

Why This Benchmark Matters

Most existing conversational benchmarks evaluate LLMs on short exchanges (5-10 turns). Real-world AI assistants, however, interact across dozens of sessions spanning weeks or months. LoCoMo is the first benchmark specifically designed to evaluate very long-term conversational memory — testing whether an AI can recall, reason about, and synthesize information scattered across 300+ turns and up to 35 sessions.

Without rigorous long-term memory benchmarks, there is no way to objectively measure whether an AI memory system truly works — or simply appears to work on trivial cases. LoCoMo fills this critical gap.

Dataset Construction & Scale

LoCoMo employs a machine-human collaborative pipeline: two LLM-based virtual agents with distinct personas are assigned temporal event graphs representing realistic life sequences. They converse across multiple sessions with memory and reflection modules. Human annotators then verify and edit the conversations for long-range consistency.

~300

Dialogue turns per conversation

~9,000

Tokens per conversation

Up to 35

Sessions per dialogue

~1,500+

QA evaluation pairs

Four Core Evaluation Categories

Single-Hop Reasoning

Tests direct factual retrieval from a single session. The agent must locate and recall a specific piece of information mentioned once during a conversation.

Example Question

"What restaurant did Alice mention she visited last Tuesday?"

Key Challenge: Requires precise retrieval from a specific session among 35+ sessions without confusing similar contexts.

Multi-Hop Reasoning

Requires synthesizing information from two or more separate sessions to arrive at the answer. The agent must chain facts across different conversations.

Example Question

"Based on Alice's job change in session 12 and her relocation in session 24, where does she currently work?"

Key Challenge: Demands cross-session information integration — the hardest retrieval task, as relevant facts may be separated by thousands of tokens.

Temporal Reasoning

Tests the ability to reason about time-ordered events — understanding what happened before, after, or between specific points in the conversational timeline.

Example Question

"Did Bob adopt his dog before or after moving to the new apartment?"

Key Challenge: Requires building and querying a mental timeline across sessions. Most LLMs show a 73% performance gap vs. humans on temporal tasks.

Open-Domain Knowledge

Requires integrating information from the conversation with external world knowledge or commonsense reasoning that was not explicitly stated.

Example Question

"Alice mentioned she's visiting the Eiffel Tower next week. What country is she traveling to?"

Key Challenge: Tests the boundary between memory retrieval and world knowledge integration — the agent must know what it was told vs. what it should already know.

Adversarial Testing (5th Category)

Beyond the four scored categories, LoCoMo includes adversarial questions designed to trick agents into hallucinating answers. These questions are intentionally unanswerable based on the conversation — the correct response is to say "I don't know."

This tests a critical real-world requirement: an AI memory system must know the limits of what it remembers and refuse to fabricate information. Long-context LLMs show "significant hallucinations" on adversarial questions — a major safety concern for production memory systems.

Evaluation Process & Scoring

Conversation Ingestion

The full multi-session dialogue (~300 turns, ~9K tokens, up to 35 sessions) is provided to the memory system for indexing and storage.

Question Presentation

~1,500+ questions across the four categories (single-hop, multi-hop, temporal, open-domain) are posed. Each question has a ground-truth answer derived from the conversation and verified by human annotators.

Memory Retrieval & Response

The system must retrieve relevant memories and generate an answer. This tests the full pipeline: ingestion → storage → retrieval → reasoning → generation.

Multi-metric Scoring

Answers are evaluated using F1 score (token overlap with ground truth), BLEU-1 (unigram precision), and LLM-as-a-Judge (GPT-4 evaluates semantic correctness). The overall score is a weighted composite.

Why This Is Hard: Technical Challenges

Context Window Limits

9K+ tokens exceed many LLMs' effective attention span. Information at the beginning of conversations is often "forgotten" by the time a question is asked.

Temporal Coherence

Events happen across 35 sessions over simulated weeks/months. Maintaining correct temporal ordering without explicit timestamps is extremely challenging.

Cross-Session Synthesis

Multi-hop questions require connecting facts from session 3 with facts from session 28 — information separated by thousands of tokens of unrelated conversation.

Hallucination Resistance

Adversarial questions test whether the system fabricates plausible-sounding answers for things never discussed. Most LLMs fail significantly here.

Semantic Ambiguity

The same topic may be discussed differently across sessions with evolving context, requiring the system to resolve conflicting or updated information.

56% Human Gap

Even the best RAG approaches lag 56% behind human performance on this benchmark, demonstrating the fundamental difficulty of long-term conversational memory.

Key Takeaways: MemoryLake on LoCoMo

MemoryLake achieves 94.03% overall — the highest score ever recorded on the LoCoMo benchmark, surpassing all published memory systems.
Single-hop recall at 96.79% demonstrates near-perfect factual retrieval across long conversations, approaching human-level performance.
Multi-hop reasoning at 91.84% shows MemoryLake can effectively chain information across sessions — the hardest category where most systems fail.
Temporal reasoning at 91.28% validates MemoryLake's calendar-aware indexing and temporal event graph construction.
Open-domain at 85.42% is the highest in the field, demonstrating strong integration of conversational memory with world knowledge.
These results are achieved under strict experimental settings with no data leakage, no question-specific tuning, and full reproducibility.

Reference: Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). "Evaluating Very Long-Term Conversational Memory of LLM Agents." In Proceedings of ACL 2024. View our benchmark results →