CHIMERA

The first sequence model that learns, per-token and per-layer, whether to look back exactly, recurrently, or not at all.

CHIMERA expands to Conditionally Hybrid Mixture of Exact and Recurrent Attention

Implementation status

From-scratch reimplementation — runnable & tested on CPU (2026-05)

A from-scratch reference implementation lives in architectures/01-chimera/chimera-lm and passes 76 tests on CPU (fp64 parity, including prefill ≡ decode bit-identity for the multi-mode KV cache). This is a small-scale reimplementation of published ideas (Mixture-of-Depths; SSM/attention hybrids), not an original architecture and not a trained model. Honest finding so far: a nano 3-way MQAR head-to-head in which the router learns to route query positions to attention — directional, not a quality or comparative claim. The "target marquee result" below is a goal from the spec, not a measured outcome.

The thesis in one paragraph

Transformer attention is O(T²) compute and O(T) KV-memory per token at decode. State-space models (Mamba, Mamba-2) and modern linear RNNs (RWKV-7, RetNet) make this O(T) / O(1) but trade exact recall — their fixed-size state catastrophically fails on associative recall, needle-in-haystack at large T, and multi-hop retrieval. The community response has been hybridization: interleave SSM blocks with a small fraction of attention blocks (Jamba, Samba, Zamba, Hymba, Griffin/Hawk). This works empirically but the hybridization ratio is a hand-tuned hyperparameter, fixed at design time, identical for every token. CHIMERA's thesis: the hybridization ratio should be learned, conditional on the token and the layer. Some tokens (function words, predictable continuations) need O(1) recurrent compression; some tokens (rare entities, retrieval-cued continuations, code identifiers) need exact attention back to a specific prior location. The model should route each token through the cheapest sufficient sequence-mixing primitive.

Architecture in one figure

           token x_t
               │
               ▼
        ┌──────────────┐
        │  Router φ(x) │      4 modes:
        └──────┬───────┘        m0: SSM (Mamba-2)
               │                m1: sliding-window attention (W=512)
               │                m2: full attention
               │                m3: identity (skip mixer)
               ▼
        ┌──────────────┐
        │ Mode m_t = top-1 │
        │ (with capacity   │
        │ factor)          │
        └──────┬───────┘
               │
               ▼
       ┌────────────────┐
       │ Multi-mode KV  │   per mode, append the token's K/V to
       │ cache (ring    │   that mode's cache slot at this layer
       │ buffer per m)  │
       └────────┬───────┘
                │
                ▼
            mixed h_t
                │
                ▼
              FFN
                │
                ▼
            output

Per-layer router φ is a small MLP from hidden state to 4-way logits. Training uses an aux-loss-free balancer (DeepSeek-V3 style) so no mode collapses to “always-pick-m2”. The multi-mode KV cache is the technical centerpiece: each token's K/V lives in the cache for its chosen mode only, with a ring-buffer eviction policy for sliding-window and SSM modes. Causal consistency under cache eviction is proved in Appendix A.2 of the master prompt.

Key contributions

Per-token, per-layer routing across four sequence-mixing primitives — generalizes Mixture-of-Depths (skip layers) to mixture-of-mixers.
Multi-mode KV cache with proved prefill/decode equivalence under ring-buffer eviction. Reference impl at chimera-lm/cascade/... passes the bit-identity test (< 10⁻¹⁵ fp64 prefill vs. decode for every mode + every routing pattern).
Aux-loss-free balancer dynamics — EMA-based load balancing without an explicit load-balancing loss term (DeepSeek-V3 trick, adapted to mode-experts).
Interpretable inspection: the router's mode choices are directly observable per token; researchers can visualize which tokens triggered full attention vs. recurrent compression.

Phased plan

CHIMERA's seven-phase implementation plan.
Phase	Deliverable	Status
0 — Bootstrap	Repo scaffold, lit notes, reference multi-mode KV cache	done
1 — Single-mode baselines	Pure SSM, pure SWA, pure full-attn at nano scale	done
2 — Router + multi-mode cache	Working CHIMERA layer; prefill/decode bit-identity	done
3 — Balancer training	Aux-loss-free balancer; mode distribution converges (nano scale)	done (nano)
4 — Recall ablations	MQAR recall match dense; needle-in-haystack at 32K	partial (toy MQAR only)
5 — Scaling + throughput	Pareto plot, KV memory measurement	not started
6 — Paper	Manuscript with interpretability visualizations	not started

Required reading

Gu & Dao 2023 — Mamba (selective SSM)
Dao & Gu 2024 — Mamba-2 / SSD (Transformers are SSMs)
Lieber et al. 2024 — Jamba (Transformer-Mamba hybrid)
Glorioso et al. 2024 — Zamba; Ren et al. 2024 — Samba
Raposo et al. 2024 — Mixture-of-Depths
Arora et al. 2024 — Zoology / Based (the recall problem in linear models)
Waleffe et al. 2024 — Empirical study of Mamba hybrid scaling laws
Fedus et al. 2022 — Switch Transformer (routing, load balancing)
DeepSeek-AI 2024 — DeepSeek-V3 (fine-grained MoE; aux-loss-free balancing)
Peng et al. 2024 — RWKV-7 (data-dependent state evolution)

Target marquee result goal — not yet measured

A goal from the spec — not a measured result

≥ 3× KV memory reduction at ≥ 1.8× decode throughput vs. a dense Transformer at 32K context, matching MQAR recall. This requires GPU-scale training that has not been run.