MNEMOSYNE

A sequence model whose memory is itself a small neural network that learns at inference time.

MNEMOSYNE expands to Memory-Net Optimized by Surprise, Yielding Neural Endurance

Implementation status

From-scratch reimplementation — runnable & tested on CPU (2026-05)

A from-scratch reference implementation lives in architectures/02-mnemosyne/mnemosyne-lm and passes 121 tests on CPU. The chunked-parallel test-time-training memory matches the sequential reference to < 10⁻¹⁰ fp64 at chunk_size = 1. This is a small-scale reimplementation of published ideas (test-time training; Titans), not an original architecture and not a trained model: the work so far is correctness-tested (including an overfit smoke train), with no learning-on-data or comparative result yet. The "target marquee result" below is a goal from the spec, not a measured outcome.

The thesis in one paragraph

Any fixed-size recurrent state is fundamentally lossy as context grows — you cannot losslessly compress 1M tokens into a 16K-dim vector. This is the recall ceiling that hybrid architectures (Jamba, CHIMERA) paper over by adding exact attention selectively. A different and more conceptually radical answer: stop pretending the recurrent state is a vector at all. Make it a small neural network whose weights are updated at test time as the model processes context. The state of the world for token t is no longer “the last hidden vector” — it is “the current parameters of a memory MLP that has been trained on tokens 1…t−1.” Reading from memory is a forward pass through that MLP; writing to memory is a gradient step. This generalizes test-time training (TTT, Sun et al. 2024), Titans / MIRAS (Behrouz et al. 2025), and RWKV-7's data-dependent state evolution.

Architecture in one figure

           token x_t
               │
               ▼
        ┌──────────────────┐
        │ Read memory MLP  │  forward pass:
        │  M_φ_{t-1}(x_t)  │  retrieve compressed past
        └─────────┬────────┘
                  │
                  ▼
        ┌─────────────────────┐
        │ Surprise gate       │  compute prediction error
        │ s_t = ||x_t − x̂_t|| │  (signed; learned threshold)
        └─────────┬───────────┘
                  │
                  ▼
        ┌─────────────────────┐
        │ Inner gradient step │  if s_t large: update φ
        │  φ_t = φ_{t-1}      │  if s_t small: ~skip (forget gate)
        │       − η ∇L(φ; x_t)│
        └─────────┬───────────┘
                  │
                  ▼
        ┌──────────────────┐
        │ Sparse attn      │  top-k high-surprise tokens
        │ sidecar          │  kept verbatim for associative recall
        └─────────┬────────┘
                  │
                  ▼
              h_t → FFN → output

The reference implementation in the master prompt's Appendix B is ChunkedTTT — a chunked-parallel TTT that matches the sequential reference to < 10⁻¹⁰ at chunk_size = 1, with bounded approximation error (< 1 % relative) at larger chunk sizes. This is what makes MNEMOSYNE trainable: the inner-loop optimizer is itself amenable to chunked parallel computation, so training cost is comparable to a dense Transformer.

Key contributions

Neural memory module — a 2-layer MLP per memory layer whose parameters are the recurrent state, updated at every token via a one-step inner gradient.
Surprise-gated write controller — determines how much to update memory based on the prediction error of the current token against current memory.
Sparse exact-attention fallback — selectively preserves the top-k highest-surprise tokens for verbatim recall. (TTT/Titans consistently shows pure neural memory still loses on associative recall.)
Chunked-parallel TTT — the marquee technical piece. Makes inner-gradient training amortize across chunks; matches sequential reference at chunk_size = 1 to < 10⁻¹⁰ fp64.
Liu-2026 equivalence — TTT-with-KV-binding is mathematically equivalent to a learned linear attention operator. Allows training in a form that admits clean parallelism; interpret as TTT only for understanding and inference.

Phased plan

MNEMOSYNE's phased implementation plan.
Phase	Deliverable	Status
0 — Bootstrap	Repo scaffold; reference `SequentialTTT` and `ChunkedTTT`	done
1 — Memory MLP + surprise gate	Single memory layer; surprise gating numerically stable	done
2 — Chunked-parallel TTT	Parity test passing; mini-batch trick verified	done
3 — Sparse attention sidecar	Top-k high-surprise token cache (gradient-truncated, K-capped)	done
4 — Full integration	Memory + attention sidecar compose; trained nano on 1B tokens	partial (block + overfit smoke train; not 1B tokens)
5 — Scaling + 256K needle	≥ 80% needle-in-haystack at 256K with constant decode memory	not started
6 — Paper	Manuscript with streaming-adaptation demonstration	not started

Required reading

Sun et al. 2024 — “Learning to (Learn at Test Time): RNNs with Expressive Hidden States”
Sun et al. 2025 — TTT-E2E (end-to-end test-time-training)
Behrouz et al. 2025 — Titans: Learning to Memorize at Test Time
Behrouz et al. 2025 — MIRAS unifying framework (Google Research)
Liu et al. 2026 — “Test-Time Training with KV Binding Is Secretly Linear Attention”
Peng et al. 2024 — RWKV-7 “Goose”
Schlag et al. 2021 — “Linear Transformers Are Secretly Fast Weight Programmers”
Ba et al. 2016 — “Using Fast Weights to Attend to the Recent Past”

Target marquee result goal — not yet measured

A goal from the spec — not a measured result

256K needle-in-a-haystack ≥ 80 % with constant decode memory; clear streaming-adaptation advantage over a Mamba-2 baseline. This requires GPU-scale training that has not been run.