Benchmark: LongMemEval (ICLR 2025)
Evaluated on LongMemEval, the industry standard benchmark for long-term conversational memory. The benchmark contains 500 human-curated questions across 5 core capabilities, testing whether an assistant can retrieve and use long interaction histories instead of relying on the current chat only.
Benchmark Setup
| Field | Value |
|---|---|
| Benchmark | LongMemEval (ICLR 2025) |
| Dataset | 500 human-curated questions |
| Variant | LongMemEval_S, about 115k tokens per question |
| Method | Hybrid RRF, BM25 plus Semantic Vector Search |
| Embedding | all-MiniLM-L6-v2, 384 dimensions |
| LLM calls | 0, pure retrieval with no generation cost |
| Hardware | Apple M1, 8GB RAM |
| Runtime | 14 minutes total |
Awareness Memory Results
| Metric | Score | Hits | Note |
|---|---|---|---|
| Recall@1 | 77.6% | 388 / 500 | |
| Recall@3 | 91.8% | 459 / 500 | |
| Recall@5 | 95.6% | 478 / 500 | Primary metric |
| Recall@10 | 97.4% | 487 / 500 |
R@5 Leaderboard
| System | R@5 | Note |
|---|---|---|
| MemPalace (ChromaDB raw) | 96.6% | R@5 only |
| Awareness Memory (Hybrid) | 95.6% | Hybrid RRF |
| OMEGA | 95.4% | QA Accuracy |
| Mastra (GPT-5-mini) | 94.9% | QA Accuracy |
| Mastra (GPT-4o) | 84.2% | QA Accuracy |
| Supermemory | 81.6% | QA Accuracy |
| Zep / Graphiti | 71.2% | QA Accuracy |
| GPT-4o (full context) | 60.6% | QA Accuracy |
MemPalace 96.6% is Recall@5 only, not QA Accuracy. Palace hierarchy was not used in that evaluation.
R@5 by Question Type
| Question type | R@5 |
|---|---|
| knowledge-update | 100.0% |
| multi-session | 98.5% |
| single-session-asst | 98.2% |
| temporal-reasoning | 94.7% |
| single-session-user | 88.6% |
| single-session-pref | 86.7% |
| Overall | 95.6% |
Ablation Study
| Retrieval method | R@5 | What it shows |
|---|---|---|
| Vector-only | 92.6% | Semantic retrieval alone is strong but misses exact lexical cues. |
| BM25-only | 91.4% | Full-text retrieval is strong but misses paraphrases. |
| Hybrid RRF | 95.6% | Hybrid improves by about 3 points over either single method alone. |
Method Notes
This benchmark was run with the open-source local version, Awareness-Local. Awareness Memory uses Hybrid RRF retrieval: BM25 full-text search plus semantic vector search, fused without LLM calls on the retrieval path. The evaluation uses all-MiniLM-L6-v2 embeddings, a lightweight 384-dimensional model, and the LongMemEval_S variant with about 115k tokens per question.
The lightweight embedding model was chosen deliberately so ordinary laptops can run the product locally. In practice, a stronger embedding model and the full cloud version should have a higher ceiling than this local lightweight setup.
Recall@5 is a retrieval metric: it means the relevant evidence appears in the top five retrieved memories. It is not the same thing as final QA accuracy. The leaderboard above keeps that distinction explicit because some public numbers are retrieval-only while others are QA accuracy.
References:
- LongMemEval paper: https://arxiv.org/abs/2410.10813
- LongMemEval repository: https://github.com/xiaowu0162/LongMemEval
- Awareness-Local: https://github.com/everest-an/Awareness-Market
- Awareness benchmark runner: https://github.com/everest-an/Awareness/tree/main/benchmarks/longmemeval