MNEMOSYNE

A sequence model whose memory is itself a small neural network that learns at inference time.

MNEMOSYNE  expands to  Memory-Net Optimized by Surprise, Yielding Neural Endurance

Implementation status

From-scratch reimplementation — runnable & tested on CPU (2026-05)

A from-scratch reference implementation lives in architectures/02-mnemosyne/mnemosyne-lm and passes 121 tests on CPU. The chunked-parallel test-time-training memory matches the sequential reference to < 10−10 fp64 at chunk_size = 1. This is a small-scale reimplementation of published ideas (test-time training; Titans), not an original architecture and not a trained model: the work so far is correctness-tested (including an overfit smoke train), with no learning-on-data or comparative result yet. The "target marquee result" below is a goal from the spec, not a measured outcome.

The thesis in one paragraph

Any fixed-size recurrent state is fundamentally lossy as context grows — you cannot losslessly compress 1M tokens into a 16K-dim vector. This is the recall ceiling that hybrid architectures (Jamba, CHIMERA) paper over by adding exact attention selectively. A different and more conceptually radical answer: stop pretending the recurrent state is a vector at all. Make it a small neural network whose weights are updated at test time as the model processes context. The state of the world for token t is no longer “the last hidden vector” — it is “the current parameters of a memory MLP that has been trained on tokens 1…t−1.” Reading from memory is a forward pass through that MLP; writing to memory is a gradient step. This generalizes test-time training (TTT, Sun et al. 2024), Titans / MIRAS (Behrouz et al. 2025), and RWKV-7's data-dependent state evolution.

Architecture in one figure

           token x_t
               │
               ▼
        ┌──────────────────┐
        │ Read memory MLP  │  forward pass:
        │  M_φ_{t-1}(x_t)  │  retrieve compressed past
        └─────────┬────────┘
                  │
                  ▼
        ┌─────────────────────┐
        │ Surprise gate       │  compute prediction error
        │ s_t = ||x_t − x̂_t|| │  (signed; learned threshold)
        └─────────┬───────────┘
                  │
                  ▼
        ┌─────────────────────┐
        │ Inner gradient step │  if s_t large: update φ
        │  φ_t = φ_{t-1}      │  if s_t small: ~skip (forget gate)
        │       − η ∇L(φ; x_t)│
        └─────────┬───────────┘
                  │
                  ▼
        ┌──────────────────┐
        │ Sparse attn      │  top-k high-surprise tokens
        │ sidecar          │  kept verbatim for associative recall
        └─────────┬────────┘
                  │
                  ▼
              h_t → FFN → output

The reference implementation in the master prompt's Appendix B is ChunkedTTT — a chunked-parallel TTT that matches the sequential reference to < 10−10 at chunk_size = 1, with bounded approximation error (< 1 % relative) at larger chunk sizes. This is what makes MNEMOSYNE trainable: the inner-loop optimizer is itself amenable to chunked parallel computation, so training cost is comparable to a dense Transformer.

Key contributions

Phased plan

MNEMOSYNE's phased implementation plan.
PhaseDeliverableStatus
0 — BootstrapRepo scaffold; reference SequentialTTT and ChunkedTTTdone
1 — Memory MLP + surprise gateSingle memory layer; surprise gating numerically stabledone
2 — Chunked-parallel TTTParity test passing; mini-batch trick verifieddone
3 — Sparse attention sidecarTop-k high-surprise token cache (gradient-truncated, K-capped)done
4 — Full integrationMemory + attention sidecar compose; trained nano on 1B tokenspartial (block + overfit smoke train; not 1B tokens)
5 — Scaling + 256K needle≥ 80% needle-in-haystack at 256K with constant decode memorynot started
6 — PaperManuscript with streaming-adaptation demonstrationnot started

Required reading

  1. Sun et al. 2024 — “Learning to (Learn at Test Time): RNNs with Expressive Hidden States”
  2. Sun et al. 2025 — TTT-E2E (end-to-end test-time-training)
  3. Behrouz et al. 2025 — Titans: Learning to Memorize at Test Time
  4. Behrouz et al. 2025 — MIRAS unifying framework (Google Research)
  5. Liu et al. 2026 — “Test-Time Training with KV Binding Is Secretly Linear Attention”
  6. Peng et al. 2024 — RWKV-7 “Goose”
  7. Schlag et al. 2021 — “Linear Transformers Are Secretly Fast Weight Programmers”
  8. Ba et al. 2016 — “Using Fast Weights to Attend to the Recent Past”

Target marquee result goal — not yet measured

A goal from the spec — not a measured result

256K needle-in-a-haystack ≥ 80 % with constant decode memory; clear streaming-adaptation advantage over a Mamba-2 baseline. This requires GPU-scale training that has not been run.