MNEMOSYNE
A sequence model whose memory is itself a small neural network that learns at inference time.
MNEMOSYNE expands to Memory-Net Optimized by Surprise, Yielding Neural Endurance
Implementation status
From-scratch reimplementation — runnable & tested on CPU (2026-05)
A from-scratch reference implementation lives in architectures/02-mnemosyne/mnemosyne-lm and passes 121 tests on CPU. The chunked-parallel test-time-training memory matches the sequential reference to < 10−10 fp64 at chunk_size = 1. This is a small-scale reimplementation of published ideas (test-time training; Titans), not an original architecture and not a trained model: the work so far is correctness-tested (including an overfit smoke train), with no learning-on-data or comparative result yet. The "target marquee result" below is a goal from the spec, not a measured outcome.
The thesis in one paragraph
Any fixed-size recurrent state is fundamentally lossy as context grows — you cannot losslessly compress 1M tokens into a 16K-dim vector. This is the recall ceiling that hybrid architectures (Jamba, CHIMERA) paper over by adding exact attention selectively. A different and more conceptually radical answer: stop pretending the recurrent state is a vector at all. Make it a small neural network whose weights are updated at test time as the model processes context. The state of the world for token t is no longer “the last hidden vector” — it is “the current parameters of a memory MLP that has been trained on tokens 1…t−1.” Reading from memory is a forward pass through that MLP; writing to memory is a gradient step. This generalizes test-time training (TTT, Sun et al. 2024), Titans / MIRAS (Behrouz et al. 2025), and RWKV-7's data-dependent state evolution.
Architecture in one figure
token x_t
│
▼
┌──────────────────┐
│ Read memory MLP │ forward pass:
│ M_φ_{t-1}(x_t) │ retrieve compressed past
└─────────┬────────┘
│
▼
┌─────────────────────┐
│ Surprise gate │ compute prediction error
│ s_t = ||x_t − x̂_t|| │ (signed; learned threshold)
└─────────┬───────────┘
│
▼
┌─────────────────────┐
│ Inner gradient step │ if s_t large: update φ
│ φ_t = φ_{t-1} │ if s_t small: ~skip (forget gate)
│ − η ∇L(φ; x_t)│
└─────────┬───────────┘
│
▼
┌──────────────────┐
│ Sparse attn │ top-k high-surprise tokens
│ sidecar │ kept verbatim for associative recall
└─────────┬────────┘
│
▼
h_t → FFN → output
The reference implementation in the master prompt's Appendix B is ChunkedTTT — a chunked-parallel TTT that matches the sequential reference to < 10−10 at chunk_size = 1, with bounded approximation error (< 1 % relative) at larger chunk sizes. This is what makes MNEMOSYNE trainable: the inner-loop optimizer is itself amenable to chunked parallel computation, so training cost is comparable to a dense Transformer.
Key contributions
- Neural memory module — a 2-layer MLP per memory layer whose parameters are the recurrent state, updated at every token via a one-step inner gradient.
- Surprise-gated write controller — determines how much to update memory based on the prediction error of the current token against current memory.
- Sparse exact-attention fallback — selectively preserves the top-k highest-surprise tokens for verbatim recall. (TTT/Titans consistently shows pure neural memory still loses on associative recall.)
- Chunked-parallel TTT — the marquee technical piece. Makes inner-gradient training amortize across chunks; matches sequential reference at
chunk_size = 1to< 10−10fp64. - Liu-2026 equivalence — TTT-with-KV-binding is mathematically equivalent to a learned linear attention operator. Allows training in a form that admits clean parallelism; interpret as TTT only for understanding and inference.
Phased plan
| Phase | Deliverable | Status |
|---|---|---|
| 0 — Bootstrap | Repo scaffold; reference SequentialTTT and ChunkedTTT | done |
| 1 — Memory MLP + surprise gate | Single memory layer; surprise gating numerically stable | done |
| 2 — Chunked-parallel TTT | Parity test passing; mini-batch trick verified | done |
| 3 — Sparse attention sidecar | Top-k high-surprise token cache (gradient-truncated, K-capped) | done |
| 4 — Full integration | Memory + attention sidecar compose; trained nano on 1B tokens | partial (block + overfit smoke train; not 1B tokens) |
| 5 — Scaling + 256K needle | ≥ 80% needle-in-haystack at 256K with constant decode memory | not started |
| 6 — Paper | Manuscript with streaming-adaptation demonstration | not started |
Required reading
- Sun et al. 2024 — “Learning to (Learn at Test Time): RNNs with Expressive Hidden States”
- Sun et al. 2025 — TTT-E2E (end-to-end test-time-training)
- Behrouz et al. 2025 — Titans: Learning to Memorize at Test Time
- Behrouz et al. 2025 — MIRAS unifying framework (Google Research)
- Liu et al. 2026 — “Test-Time Training with KV Binding Is Secretly Linear Attention”
- Peng et al. 2024 — RWKV-7 “Goose”
- Schlag et al. 2021 — “Linear Transformers Are Secretly Fast Weight Programmers”
- Ba et al. 2016 — “Using Fast Weights to Attend to the Recent Past”
Target marquee result goal — not yet measured
A goal from the spec — not a measured result
256K needle-in-a-haystack ≥ 80 % with constant decode memory; clear streaming-adaptation advantage over a Mamba-2 baseline. This requires GPU-scale training that has not been run.