Memory Benchmarks (LoCoMo)
You can't improve what you can't measure. LoCoMo and LongMemEval provide standardized test suites that reveal exactly where your memory system fails: single-hop recall, multi-hop reasoning, temporal ordering, or open-ended generation.
Imagine you hire a new assistant and give them a quiz after their first week. One section tests whether they remember your coffee order (direct recall). Another checks if they can connect two separate conversations to plan your schedule (multi-hop reasoning). A third tests whether they know you switched from tea to coffee last Wednesday (temporal awareness). Your overall score is useful, but the per-section breakdown tells you exactly what to train. LoCoMo (Long-Context Conversational Memory) and LongMemEval are standardized benchmarks that work the same way for AI memory systems. They provide multi-session conversations paired with ground-truth questions. Each question targets a specific memory capability. Running your system against these benchmarks produces per-category scores that turn "memory quality" from a vague claim into a concrete number y…
About this tutorial
This hands-on Jupyter notebook is part of Agent Memory Techniques, a free open-source repository by Nir Diamant covering agent memory techniques with runnable code examples and detailed explanations.
