CHIMERA

The first sequence model that learns, per-token and per-layer, whether to look back exactly, recurrently, or not at all.

CHIMERA  expands to  Conditionally Hybrid Mixture of Exact and Recurrent Attention

Implementation status

From-scratch reimplementation — runnable & tested on CPU (2026-05)

A from-scratch reference implementation lives in architectures/01-chimera/chimera-lm and passes 76 tests on CPU (fp64 parity, including prefill ≡ decode bit-identity for the multi-mode KV cache). This is a small-scale reimplementation of published ideas (Mixture-of-Depths; SSM/attention hybrids), not an original architecture and not a trained model. Honest finding so far: a nano 3-way MQAR head-to-head in which the router learns to route query positions to attention — directional, not a quality or comparative claim. The "target marquee result" below is a goal from the spec, not a measured outcome.

The thesis in one paragraph

Transformer attention is O(T²) compute and O(T) KV-memory per token at decode. State-space models (Mamba, Mamba-2) and modern linear RNNs (RWKV-7, RetNet) make this O(T) / O(1) but trade exact recall — their fixed-size state catastrophically fails on associative recall, needle-in-haystack at large T, and multi-hop retrieval. The community response has been hybridization: interleave SSM blocks with a small fraction of attention blocks (Jamba, Samba, Zamba, Hymba, Griffin/Hawk). This works empirically but the hybridization ratio is a hand-tuned hyperparameter, fixed at design time, identical for every token. CHIMERA's thesis: the hybridization ratio should be learned, conditional on the token and the layer. Some tokens (function words, predictable continuations) need O(1) recurrent compression; some tokens (rare entities, retrieval-cued continuations, code identifiers) need exact attention back to a specific prior location. The model should route each token through the cheapest sufficient sequence-mixing primitive.

Architecture in one figure

           token x_t
               │
               ▼
        ┌──────────────┐
        │  Router φ(x) │      4 modes:
        └──────┬───────┘        m0: SSM (Mamba-2)
               │                m1: sliding-window attention (W=512)
               │                m2: full attention
               │                m3: identity (skip mixer)
               ▼
        ┌──────────────┐
        │ Mode m_t = top-1 │
        │ (with capacity   │
        │ factor)          │
        └──────┬───────┘
               │
               ▼
       ┌────────────────┐
       │ Multi-mode KV  │   per mode, append the token's K/V to
       │ cache (ring    │   that mode's cache slot at this layer
       │ buffer per m)  │
       └────────┬───────┘
                │
                ▼
            mixed h_t
                │
                ▼
              FFN
                │
                ▼
            output

Per-layer router φ is a small MLP from hidden state to 4-way logits. Training uses an aux-loss-free balancer (DeepSeek-V3 style) so no mode collapses to “always-pick-m2”. The multi-mode KV cache is the technical centerpiece: each token's K/V lives in the cache for its chosen mode only, with a ring-buffer eviction policy for sliding-window and SSM modes. Causal consistency under cache eviction is proved in Appendix A.2 of the master prompt.

Key contributions

Phased plan

CHIMERA's seven-phase implementation plan.
PhaseDeliverableStatus
0 — BootstrapRepo scaffold, lit notes, reference multi-mode KV cachedone
1 — Single-mode baselinesPure SSM, pure SWA, pure full-attn at nano scaledone
2 — Router + multi-mode cacheWorking CHIMERA layer; prefill/decode bit-identitydone
3 — Balancer trainingAux-loss-free balancer; mode distribution converges (nano scale)done (nano)
4 — Recall ablationsMQAR recall match dense; needle-in-haystack at 32Kpartial (toy MQAR only)
5 — Scaling + throughputPareto plot, KV memory measurementnot started
6 — PaperManuscript with interpretability visualizationsnot started

Required reading

  1. Gu & Dao 2023 — Mamba (selective SSM)
  2. Dao & Gu 2024 — Mamba-2 / SSD (Transformers are SSMs)
  3. Lieber et al. 2024 — Jamba (Transformer-Mamba hybrid)
  4. Glorioso et al. 2024 — Zamba; Ren et al. 2024 — Samba
  5. Raposo et al. 2024 — Mixture-of-Depths
  6. Arora et al. 2024 — Zoology / Based (the recall problem in linear models)
  7. Waleffe et al. 2024 — Empirical study of Mamba hybrid scaling laws
  8. Fedus et al. 2022 — Switch Transformer (routing, load balancing)
  9. DeepSeek-AI 2024 — DeepSeek-V3 (fine-grained MoE; aux-loss-free balancing)
  10. Peng et al. 2024 — RWKV-7 (data-dependent state evolution)

Target marquee result goal — not yet measured

A goal from the spec — not a measured result

≥ 3× KV memory reduction at ≥ 1.8× decode throughput vs. a dense Transformer at 32K context, matching MQAR recall. This requires GPU-scale training that has not been run.