LongMemEval Benchmark

Benchmark: LongMemEval (ICLR 2025)

Evaluated on LongMemEval, the industry standard benchmark for long-term conversational memory. The benchmark contains 500 human-curated questions across 5 core capabilities, testing whether an assistant can retrieve and use long interaction histories instead of relying on the current chat only.

Benchmark Setup

FieldValue
BenchmarkLongMemEval (ICLR 2025)
Dataset500 human-curated questions
VariantLongMemEval_S, about 115k tokens per question
MethodHybrid RRF, BM25 plus Semantic Vector Search
Embeddingall-MiniLM-L6-v2, 384 dimensions
LLM calls0, pure retrieval with no generation cost
HardwareApple M1, 8GB RAM
Runtime14 minutes total

Awareness Memory Results

MetricScoreHitsNote
Recall@177.6%388 / 500
Recall@391.8%459 / 500
Recall@595.6%478 / 500Primary metric
Recall@1097.4%487 / 500

R@5 Leaderboard

SystemR@5Note
MemPalace (ChromaDB raw)96.6%R@5 only
Awareness Memory (Hybrid)95.6%Hybrid RRF
OMEGA95.4%QA Accuracy
Mastra (GPT-5-mini)94.9%QA Accuracy
Mastra (GPT-4o)84.2%QA Accuracy
Supermemory81.6%QA Accuracy
Zep / Graphiti71.2%QA Accuracy
GPT-4o (full context)60.6%QA Accuracy

MemPalace 96.6% is Recall@5 only, not QA Accuracy. Palace hierarchy was not used in that evaluation.

R@5 by Question Type

Question typeR@5
knowledge-update100.0%
multi-session98.5%
single-session-asst98.2%
temporal-reasoning94.7%
single-session-user88.6%
single-session-pref86.7%
Overall95.6%

Ablation Study

Retrieval methodR@5What it shows
Vector-only92.6%Semantic retrieval alone is strong but misses exact lexical cues.
BM25-only91.4%Full-text retrieval is strong but misses paraphrases.
Hybrid RRF95.6%Hybrid improves by about 3 points over either single method alone.

Method Notes

This benchmark was run with the open-source local version, Awareness-Local. Awareness Memory uses Hybrid RRF retrieval: BM25 full-text search plus semantic vector search, fused without LLM calls on the retrieval path. The evaluation uses all-MiniLM-L6-v2 embeddings, a lightweight 384-dimensional model, and the LongMemEval_S variant with about 115k tokens per question.

The lightweight embedding model was chosen deliberately so ordinary laptops can run the product locally. In practice, a stronger embedding model and the full cloud version should have a higher ceiling than this local lightweight setup.

Recall@5 is a retrieval metric: it means the relevant evidence appears in the top five retrieved memories. It is not the same thing as final QA accuracy. The leaderboard above keeps that distinction explicit because some public numbers are retrieval-only while others are QA accuracy.

References: