Benchmarks — Locamem

Capability benchmark

Six tracks. Full marks. Nothing leaked.

An open, reproducible harness scoring what agent memory actually has to do — recall, facts, preferences, contradictions, scoped privacy, and latency — over a three-tenant fixture set, graded by the real engine. No hand-written scores.

Track	Result	What it proves
Conversational recall	12 / 12 · weight 25	Single-hop, multi-hop, temporal, decisions & incidents.
Fact recall	10 / 10 · weight 15	User, company, and project facts across sessions.
Preference recall	8 / 8 · weight 15	Style, workflow, and format preferences.
Contradiction handling	8 / 8 · weight 15	Newer supersedes older; full history stays queryable.
Scoped privacy	12 / 12 · 0 leaks · weight 15	Tenant boundaries enforced on every read.
Latency	p50 ~1ms · p95 <10ms · weight 15	Fast enough to sit in front of every model call.

Score 100/100 across 52 cases over three tenants, fixture 131778f7cae6. The artifact JSON is committed at benchmarks/capability_artifact.json — re-run it yourself with python benchmarks/bench_capability.py. (Latency is machine-dependent; the committed run measured p50 1.15ms / p95 5.1ms.)

Four tracks a cloud service can't pass

The same harness, with the network cut.

Zero network egress

Verified, not promised

The harness re-runs every recall with the process's sockets physically blocked. They all still pass — proof the recall path never touches the network.

$0 per recall

0 API calls

Recall is a local index lookup — no vector-search API, no embedding call, no per-recall meter, at any volume.

Air-gapped parity

Same 100 / 100

The entire score is produced with embeddings and the network off. Cutting the cord changes nothing.

Auditable results

100% carry a breakdown

Every result returns its per-facet score — content · keyword · salience · temporal — not one opaque relevance number.

Head to head

Locamem vs. the field.

Our numbers come from the open harness above. Competitor numbers are each vendor's own published figures — linked and unedited. LoCoMo, LongMemEval, DMR and BEAM use different rubrics and corpora, so the fair read is platform-by-platform, not one cherry-picked number.

System	Headline	Detail	Source
Locamem	Capability 100/100	recall@10 99.4% (LongMemEval-S) · p95 <10ms · 0 cross-scope leaks · 0 network egress · end-to-end QA ~58% (reader-limited)	Verified · open artifact
Mem0	LoCoMo 91.6–92.5	LongMemEval 93.4–94.4 · BEAM 1M 64.1 / 10M 48.6 — end-to-end QA	Vendor-published ↗
Zep	LongMemEval-S 71.2%	DMR 94.8% (a different benchmark) · with GPT-4o	Vendor-published ↗
Letta	LoCoMo 74.0%	Filesystem agent + GPT-4o-mini — agent runtime, not service recall	Vendor-published ↗
Pinecone Assistant	—	Publishes RAG-evaluation APIs, not persistent-memory benchmarks	Adjacent category

Competitor figures are end-to-end QA on different datasets; we list them as published and have not reproduced them. Locamem leads on retrieval recall, capability, latency, and local-first guarantees — and we report our own end-to-end QA honestly (~58%, reader-limited) rather than claim accuracy superiority.

Retrieval quality

Session recall on LongMemEval-S

Did the store surface a gold-evidence session in the top-k? Measured across all 500 questions, embeddings off (the air-gapped path).

k	Session recall@k	Notes
@10	99.4% (497/500)	Only 3 misses across the whole set.
@20	99.6% (498/500)	Default reader window.
@50	100% (500/500)	Every gold session is in the candidate set.

How it's measured

Methodology & reproducibility

Dataset

LongMemEval-S

500 long-horizon questions over ~48-session haystacks. We report session-level retrieval recall — the metric that decides whether the reader even sees the evidence.

Engine

SimHash + FTS5

64-bit SimHash (LSH) ∪ FTS5 full-text ∪ optional on-device embeddings, fused and scored with a per-facet breakdown. Embeddings off for the air-gapped numbers.

No overfitting

Firewalled

The QA path never reads the answer or answer_session_ids — a runtime assert enforces it. Numbers come from the public set as a report, not a tuning signal.

# reproduce, end to end
git clone https://github.com/TeamWilcoe/locamem && cd locamem
python benchmarks/bench_capability.py         # capability 100/100 + p50/p95 + 0 leaks
python benchmarks/build_failure_dossier.py    # session recall@10/20/50, CPU-only
python benchmarks/bench_longmemeval_qa.py --use-solvers --model claude  # end-to-end

Where we're honest

Retrieval ≠ end-to-end accuracy

We separate the two on purpose, and we don't claim to beat anyone on answer accuracy.

Retrieval (our headline)

~99% recall@10

The store reliably surfaces the right evidence. This is what Locamem owns, on-device, at $0 — and it's genuinely strong.

End-to-end QA (reader-limited)

~58%, and we say so

End-to-end accuracy depends on the reader model, not retrieval. Even an oracle GPT-4o handed perfect evidence tops out near 82%. We report ~58% with the current reader and treat the gap as a reader problem — not a claim of accuracy superiority.

See it for yourself

Run the live recall demo, then install in one line.

No account. No keys. One SQLite file and an MCP server, on your machine.

$ curl -fsSL https://locamem.com/install | bash

Run the demo →