Deterministic
by design.
Reproducible recall. Predictable latency. Verified on live cloud.
Positional recall
Hallucinations
Types classified
Recall modes
Save your agent from context rot.
Offload bulky tool output before it eats the context window — without losing access to any of it.
Raw tool output
files · JSON · logs · diffs · PDFs
< 500 chars — pass through
~15%Small, cheap, keeps flowing.
≥ 500 chars — externalized
~85%Classified, typed, stored. Agent gets a handle + metadata.
offloaded before it can bloat the context window. Still reachable through artifact_id the moment the agent needs it.
Two modes. One call.
Fast
Hybrid SQL + rerank. Skips LLM hop.
~750 ms
p50
Deep
Adds LLM query analysis + entity expansion.
~1.1 s
p50
No recency bias. No primacy bias.
A single fact is planted at a known turn, then the full 170-turn session ingests. A recall probe targeting that fact fires afterward. Each dot is one probe. Its height is the top-hit cosine similarity score from the recall pipeline; its horizontal position is where the needle sat in the session timeline.
Probes
7 / 7
returned target in top three hits
Lowest score
0.54
still +0.04 above abstention threshold
Spread
0.54 – 0.71
no positional drift — early turns score as well as late
Why a flat curve is the good outcome
Long-context LLMs and naive RAG both suffer from “lost-in-the-middle” — facts placed in the middle of a session are retrieved less reliably than facts near the start or end. A flat line here means the retrieval score doesn't care where in the session the fact was mentioned. The agent is just as likely to find a turn-3 detail as a turn-165 one.
Every probe clears the abstention threshold, so the retrieved top hit is above the “I don't know” floor the system uses to avoid confabulation — it would rather say nothing than return something below that line.
Upstream oil & gas — where verbatim matters.
A digit off on a well API points to a different well. A rounded median loses the comparison. The pipeline kept them intact.
Column names
Schema order preserved byte-for-byte
Well IDs
API numbers retrievable by partial match
Comparison ratios
Median values + exact ratio between plays
Safety certs
Graph traversal → top hit by entity salience
Deliverables
Final artifact filename at hit #1
The same properties matter anywhere technical specificity drives decisions — medical, legal, regulatory, financial.
Predictable envelopes.
Wall-clock p50 and p95 from the client against an idle server. Non-recall tools are bounded and uniform.
list
enumerate · paginated
explore
4-axis graph walk
compress
SQL template · no LLM
fetch
artifact bytes
recall · fast
hybrid + rerank
recall · deep
+ LLM analysis
How we measured.
No self-scoring shortcuts. Every probe is hand-written, every ground truth is sourced from the raw session transcript, and an independent judge grades each probe against that ground truth.
47 hand-authored probes · 12 categories
Each probe is a question + ground-truth expected content + grading rubric. No autogenerated probes.
Independent per-probe judgment
Claude Opus 4.7 (1M context) · running in Claude Code
- ›Ground truth hand-authored by the human operator directly from the raw session transcript — not the model.
- ›The judge sees the probe question, the top-five recall hits, the ground truth, and a grading rubric. Grades PASS / PARTIAL / FAIL, with evidence quoted from the hits.
- ›PARTIAL weighted as 0.5 — prevents the scoreboard from hiding near-misses. Strict-only score (partial = fail) and lenient (partial = pass) published alongside for full transparency.
- ›No self-scoring shortcut — the judge never grades its own memory extraction. Judge is an independent Claude session; the system under test is the MemoSift cloud.
Run conditions
Live cloud · reproducible from the repo
- ›Every probe hits the same Railway deployment users hit. No local stub, no mocked DB, no replay from cache.
- ›Sessions replayed turn-by-turn through the full ingest pipeline (Phase 1 sync + Phase 2 async). Probes fire only after
drain_complete: true. - ›Session scope: short (10 turns), medium (77 turns), long (170 turns). 11 query categories × 3 sessions = 47 probes.
- ›Harness, ground-truth files, and raw reports live under
benchmarking/in the repo. Anyone can rerun.
How MemoSift stacks up.
Results from established evaluation frameworks that we've run end-to-end against the deployed cloud. Every card shows our score next to published baselines.
Needle-in-a-Haystack
Gkamradt (adapted for agent memory)A planted fact is retrieved after the full session ingests.
LongMemEval · category parity
Zhong et al. 2024 (probe-category adaptation)Five long-memory capabilities: information extraction, multi-session reasoning, knowledge updates, temporal reasoning, abstention.
MTEB retrieval (inherited)
Muennighoff et al.Embedding retrieval quality across 58 tasks.
Additional benchmark runs (LoCoMo, LongMemEval full harness, RULER, HELMET, LongBench v2, MemGPT bench) are scheduled in the harness plan under docs/superpowers/plans/. Each lands here only after the end-to-end run is reproducible from the repo with a published comparison table.