Benchmark Evaluation

LoCoMo

LoCoMo is a commonly used benchmark for long-conversation memory. It covers cross-turn, cross-topic, and temporal questions, making it useful for testing whether an Agent can recall stable user facts and events from long histories. GUMem reaches 92.9% accuracy, which is currently SOTA-level performance.

Product	Overall	Single-hop	Multi-hop	Temporal	Open-domain
GUMem	92.90	94.40	91.50	94.40	79.20
Mem0 New April 2026	91.60	92.30	93.30	92.80	76.00
HyperGraphRAG	86.49	90.61	80.85	85.36	70.83
MIRIX	85.38	85.11	83.70	88.39	65.62
HippoRAG 2	81.62	86.44	75.89	78.50	66.67
LightRAG	79.87	86.68	84.04	60.75	71.88
MemOS	75.80	81.09	67.49	75.18	55.90
Membase	72.01	73.12	64.65	81.20	53.12
Mem0g	68.44	65.71	47.19	58.13	75.71
GraphRAG	67.60	79.55	54.96	50.16	58.33
Mem0	66.88	67.13	51.15	55.51	72.93
Zep	65.99	61.70	41.35	49.31	76.60
LangMem	58.10	62.23	47.92	23.43	71.12
MemU	56.55	66.34	63.12	27.10	50.01
OpenAI	52.90	63.79	42.92	21.71	62.29
A-Mem	48.38	39.79	18.85	49.91	54.05

Dataset Characteristics

The LoCoMo paper describes an original dataset with 50 very long conversations. The official repository's current evaluation file, locomo10.json, is a 10-conversation subset selected for longer conversations and higher-quality annotations. This subset contains 272 sessions, 5,882 dialogue turns, and 1,986 QA annotations, averaging about 27.2 sessions and 588.2 turns per conversation.

LoCoMo is not only a test of whether a system can find one exact sentence. Its conversations are built around personas, timelines, event relations, and image references, and require the model to answer questions, summarize events, or continue multimodal dialogue across days and topics. For a Memory system such as GUMem, it mainly tests three capabilities: whether the system can recall correct facts from long histories, combine clues across sessions, and handle temporal order, causal relations, and unanswerable questions.

Question type	Count	Main evaluation point
Single-hop	841	Recall an explicit fact from one context fragment.
Multi-hop	282	Combine multiple dialogue fragments or facts before answering.
Temporal	321	Understand dates, order, intervals, and event timing.
Open-domain	96	Infer an answer from conversation facts and commonsense knowledge.
Adversarial	446	Detect missing premises and avoid fabricating unsupported answers.

This GUMem LoCoMo experiment excludes category=5 Adversarial questions and scores only the first four categories, or 1,540 questions, matching the Mem0 LoCoMo reporting scope.

GUMem vs Mem0

From the perspective of authority and accuracy, Mem0 is one of the most widely used Agent Memory products, so it can serve as a baseline for comparing memory recall quality and context cost. Compared with Mem0 New (2026.4), GUMem uses less than half the average context tokens in this LoCoMo evaluation while reaching a 92.9% overall Judge pass rate, higher than Mem0 New's 91.6%. This shows that GUMem can substantially reduce input context while maintaining higher overall answer correctness.

LoCoMo GUMem vs Mem0 New benchmark

Experiment Details

Next Step

Read Query Memory to understand how recall settings affect latency and result quality.

Benchmark Evaluation ​

LoCoMo ​

Dataset Characteristics ​

GUMem vs Mem0 ​