Measured on LongMemEval-S · 500 questions

Retrieval is effectively solved.
On-device.

Real numbers, reproducible from the repo. The honest headline: a single local SQLite file retrieves the right memory ~99% of the time at k=10 — with zero network calls and zero per-recall cost.

99.4%
Session recall@10
100%
Session recall@50
~9ms
Recall latency (p50)
$0
Per recall · 0 API calls
Retrieval quality

Session recall on LongMemEval-S

Did the store surface a gold-evidence session in the top-k? Measured across all 500 questions, embeddings off (the air-gapped path).

kSession recall@kNotes
@1099.4% (497/500)Only 3 misses across the whole set.
@2099.6% (498/500)Default reader window.
@50100% (500/500)Every gold session is in the candidate set.
Speed & cost

Sub-10ms recall, $0 per call

MetricLocamem (local)Cloud memory (typical)
Recall latency~9 ms (SimHash band + FTS5; hybrid)Network round trip: tens–hundreds of ms + tail latency
Write / ingest~3.8 ms per memoryAPI write + async indexing
API calls / recall0≥1 (vector search), often + embedding call
Cost / 1,000 recalls$0Metered API + embedding/model cost
FootprintOne SQLite file; runs on a laptopManaged datastore + vector index, server-side
How it's measured

Methodology & reproducibility

Dataset

LongMemEval-S

500 long-horizon questions over ~48-session haystacks — the standard long-term-memory benchmark. We report session-level retrieval recall (the metric that decides whether the reader even sees the evidence).

Engine

SimHash + FTS5

64-bit SimHash (LSH) ∪ FTS5 full-text ∪ optional on-device embeddings, fused and scored with a per-facet breakdown. Embeddings off for the air-gapped numbers above.

No overfitting

Firewalled

The QA path never reads the answer or answer_session_ids — a runtime assert enforces it. Numbers come from the public set as a report, not a tuning signal.

# reproduce, end to end
git clone https://github.com/TeamWilcoe/locamem && cd locamem
python benchmarks/build_failure_dossier.py   # session recall@10/20/50, CPU-only
python benchmarks/bench_longmemeval_qa.py --use-solvers --model claude  # end-to-end
Where we're honest

Retrieval ≠ end-to-end accuracy

We separate the two on purpose, and we don't claim to beat anyone on answer accuracy.

Retrieval (our headline)

~99% recall@10

The store reliably surfaces the right evidence. This is what Locamem owns, on-device, at $0 — and it's genuinely strong.

End-to-end QA (reader-limited)

~58%, and we say so

End-to-end accuracy depends on the reader model, not retrieval. Even an oracle GPT-4o handed perfect evidence tops out near 82%. We report ~58% with the current reader and treat the gap as a reader problem — not a retrieval one, and not a claim of accuracy superiority over cloud products.

Roadmap: RRF rank-fusion + a local cross-encoder rerank are the next retrieval upgrades; reader-side anti-hedge + aggregation work targets the end-to-end gap. Both are tracked openly in the repo.

See it for yourself

Run the live recall demo, then install in one line.

No account. No keys. One SQLite file and an MCP server, on your machine.

$ curl -fsSL https://locamem.com/install | bash