CHIMERA
The first sequence model that learns, per-token and per-layer, whether to look back exactly, recurrently, or not at all.
CHIMERA expands to Conditionally Hybrid Mixture of Exact and Recurrent Attention
Implementation status
From-scratch reimplementation — runnable & tested on CPU (2026-05)
A from-scratch reference implementation lives in architectures/01-chimera/chimera-lm and passes 76 tests on CPU (fp64 parity, including prefill ≡ decode bit-identity for the multi-mode KV cache). This is a small-scale reimplementation of published ideas (Mixture-of-Depths; SSM/attention hybrids), not an original architecture and not a trained model. Honest finding so far: a nano 3-way MQAR head-to-head in which the router learns to route query positions to attention — directional, not a quality or comparative claim. The "target marquee result" below is a goal from the spec, not a measured outcome.
The thesis in one paragraph
Transformer attention is O(T²) compute and O(T) KV-memory per token at decode. State-space models (Mamba, Mamba-2) and modern linear RNNs (RWKV-7, RetNet) make this O(T) / O(1) but trade exact recall — their fixed-size state catastrophically fails on associative recall, needle-in-haystack at large T, and multi-hop retrieval. The community response has been hybridization: interleave SSM blocks with a small fraction of attention blocks (Jamba, Samba, Zamba, Hymba, Griffin/Hawk). This works empirically but the hybridization ratio is a hand-tuned hyperparameter, fixed at design time, identical for every token. CHIMERA's thesis: the hybridization ratio should be learned, conditional on the token and the layer. Some tokens (function words, predictable continuations) need O(1) recurrent compression; some tokens (rare entities, retrieval-cued continuations, code identifiers) need exact attention back to a specific prior location. The model should route each token through the cheapest sufficient sequence-mixing primitive.
Architecture in one figure
token x_t
│
▼
┌──────────────┐
│ Router φ(x) │ 4 modes:
└──────┬───────┘ m0: SSM (Mamba-2)
│ m1: sliding-window attention (W=512)
│ m2: full attention
│ m3: identity (skip mixer)
▼
┌──────────────┐
│ Mode m_t = top-1 │
│ (with capacity │
│ factor) │
└──────┬───────┘
│
▼
┌────────────────┐
│ Multi-mode KV │ per mode, append the token's K/V to
│ cache (ring │ that mode's cache slot at this layer
│ buffer per m) │
└────────┬───────┘
│
▼
mixed h_t
│
▼
FFN
│
▼
output
Per-layer router φ is a small MLP from hidden state to 4-way logits. Training uses an aux-loss-free balancer (DeepSeek-V3 style) so no mode collapses to “always-pick-m2”. The multi-mode KV cache is the technical centerpiece: each token's K/V lives in the cache for its chosen mode only, with a ring-buffer eviction policy for sliding-window and SSM modes. Causal consistency under cache eviction is proved in Appendix A.2 of the master prompt.
Key contributions
- Per-token, per-layer routing across four sequence-mixing primitives — generalizes Mixture-of-Depths (skip layers) to mixture-of-mixers.
- Multi-mode KV cache with proved prefill/decode equivalence under ring-buffer eviction. Reference impl at
chimera-lm/cascade/...passes the bit-identity test (< 10−15fp64 prefill vs. decode for every mode + every routing pattern). - Aux-loss-free balancer dynamics — EMA-based load balancing without an explicit load-balancing loss term (DeepSeek-V3 trick, adapted to mode-experts).
- Interpretable inspection: the router's mode choices are directly observable per token; researchers can visualize which tokens triggered full attention vs. recurrent compression.
Phased plan
| Phase | Deliverable | Status |
|---|---|---|
| 0 — Bootstrap | Repo scaffold, lit notes, reference multi-mode KV cache | done |
| 1 — Single-mode baselines | Pure SSM, pure SWA, pure full-attn at nano scale | done |
| 2 — Router + multi-mode cache | Working CHIMERA layer; prefill/decode bit-identity | done |
| 3 — Balancer training | Aux-loss-free balancer; mode distribution converges (nano scale) | done (nano) |
| 4 — Recall ablations | MQAR recall match dense; needle-in-haystack at 32K | partial (toy MQAR only) |
| 5 — Scaling + throughput | Pareto plot, KV memory measurement | not started |
| 6 — Paper | Manuscript with interpretability visualizations | not started |
Required reading
- Gu & Dao 2023 — Mamba (selective SSM)
- Dao & Gu 2024 — Mamba-2 / SSD (Transformers are SSMs)
- Lieber et al. 2024 — Jamba (Transformer-Mamba hybrid)
- Glorioso et al. 2024 — Zamba; Ren et al. 2024 — Samba
- Raposo et al. 2024 — Mixture-of-Depths
- Arora et al. 2024 — Zoology / Based (the recall problem in linear models)
- Waleffe et al. 2024 — Empirical study of Mamba hybrid scaling laws
- Fedus et al. 2022 — Switch Transformer (routing, load balancing)
- DeepSeek-AI 2024 — DeepSeek-V3 (fine-grained MoE; aux-loss-free balancing)
- Peng et al. 2024 — RWKV-7 (data-dependent state evolution)
Target marquee result goal — not yet measured
A goal from the spec — not a measured result
≥ 3× KV memory reduction at ≥ 1.8× decode throughput vs. a dense Transformer at 32K context, matching MQAR recall. This requires GPU-scale training that has not been run.