Retrieval is solved. The reader isn't.
We hit 99.4% recall@10 on LongMemEval-S and ~58% end-to-end QA. The gap is the whole story, and we won't paper over it.
Here are two numbers from our LongMemEval-S run, and they look like they're in tension: 99.4% session recall@10 (100% at @50), but ~58% end-to-end QA accuracy. If retrieval almost never misses, why does the system get four in ten questions wrong?
Because those measure two different jobs. Recall@10 asks: did the right evidence land in the top results? Almost always, yes — the relevant session is in the top 10, and effectively always in the top 50. End-to-end QA asks a harder question: given that evidence, did the model read it correctly and produce the right answer? That second step isn't retrieval. It's reading comprehension, and it belongs to the LLM doing the reasoning, not to the memory layer.
We're being precise about this on purpose. It would be easy to quote 99.4% and let you assume it describes accuracy. It doesn't, and we won't claim accuracy superiority — the end-to-end number is reader-limited, bounded by whatever model you point at the retrieved context, not by Locamem. Swap in a stronger reader and that 58% moves; our recall doesn't budge, because it was already at the ceiling.
This is why every Locamem result returns a per-facet score breakdown. You see why something was retrieved — lexical match, SimHash similarity, recency, temporal validity — not a single opaque relevance number. When the reader gets an answer wrong, you can confirm the evidence was actually there and the failure was downstream. That's an auditable system, not a black box.
The industry habit of collapsing retrieval and generation into one accuracy figure hides exactly the boundary that matters when you're debugging an agent. We separate them. Locamem's job is to put the right memory in front of the reader, fast, every time, with a receipt. We do that job at 99.4%. The reader's job is the reader's job — and we'll tell you which is which.