Benchmark: LongMemEval (ICLR 2025)

Evaluated on LongMemEval, the industry standard benchmark for long-term conversational memory. The benchmark contains 500 human-curated questions across 5 core capabilities, testing whether an assistant can retrieve and use long interaction histories instead of relying on the current chat only.

Benchmark Setup

Field	Value
Benchmark	LongMemEval (ICLR 2025)
Dataset	500 human-curated questions
Variant	LongMemEval_S, about 115k tokens per question
Method	Hybrid RRF, BM25 plus Semantic Vector Search
Embedding	all-MiniLM-L6-v2, 384 dimensions
LLM calls	0, pure retrieval with no generation cost
Hardware	Apple M1, 8GB RAM
Runtime	14 minutes total

Awareness Memory Results

Metric	Score	Hits	Note
Recall@1	77.6%	388 / 500
Recall@3	91.8%	459 / 500
Recall@5	95.6%	478 / 500	Primary metric
Recall@10	97.4%	487 / 500

R@5 Leaderboard

System	R@5	Note
MemPalace (ChromaDB raw)	96.6%	R@5 only
Awareness Memory (Hybrid)	95.6%	Hybrid RRF
OMEGA	95.4%	QA Accuracy
Mastra (GPT-5-mini)	94.9%	QA Accuracy
Mastra (GPT-4o)	84.2%	QA Accuracy
Supermemory	81.6%	QA Accuracy
Zep / Graphiti	71.2%	QA Accuracy
GPT-4o (full context)	60.6%	QA Accuracy

MemPalace 96.6% is Recall@5 only, not QA Accuracy. Palace hierarchy was not used in that evaluation.

R@5 by Question Type

Question type	R@5
knowledge-update	100.0%
multi-session	98.5%
single-session-asst	98.2%
temporal-reasoning	94.7%
single-session-user	88.6%
single-session-pref	86.7%
Overall	95.6%

Ablation Study

Retrieval method	R@5	What it shows
Vector-only	92.6%	Semantic retrieval alone is strong but misses exact lexical cues.
BM25-only	91.4%	Full-text retrieval is strong but misses paraphrases.
Hybrid RRF	95.6%	Hybrid improves by about 3 points over either single method alone.

Method Notes

This benchmark was run with the open-source local version, Awareness-Local. Awareness Memory uses Hybrid RRF retrieval: BM25 full-text search plus semantic vector search, fused without LLM calls on the retrieval path. The evaluation uses all-MiniLM-L6-v2 embeddings, a lightweight 384-dimensional model, and the LongMemEval_S variant with about 115k tokens per question.

The lightweight embedding model was chosen deliberately so ordinary laptops can run the product locally. In practice, a stronger embedding model and the full cloud version should have a higher ceiling than this local lightweight setup.

Recall@5 is a retrieval metric: it means the relevant evidence appears in the top five retrieved memories. It is not the same thing as final QA accuracy. The leaderboard above keeps that distinction explicit because some public numbers are retrieval-only while others are QA accuracy.

References:

LongMemEval paper: https://arxiv.org/abs/2410.10813
LongMemEval repository: https://github.com/xiaowu0162/LongMemEval
Awareness-Local: https://github.com/everest-an/Awareness-Market
Awareness benchmark runner: https://github.com/everest-an/Awareness/tree/main/benchmarks/longmemeval