Five post-Transformer ideas, reimplemented from scratch
Minimal, annotated, runnable reimplementations of recent (2024–2026) language-model architecture ideas, each with honest small-scale CPU benchmarks.
These are educational reimplementations of published work, not original architectures and not trained-at-scale models. Each prototype targets a different ceiling the Transformer is hitting, using a different primitive; in principle they compose — a routed (CHIMERA), byte-patched (HELIX), block-diffusion-decoded (CASCADE) model with memory (MNEMOSYNE) layers and a reasoning (NOESIS) head. Every number below is reproduced by a script in the code; goals that have not been measured are labelled as such.
The orthogonality argument
Each candidate addresses a different kind of bottleneck and uses a different kind of primitive. They were deliberately chosen so they do not collapse onto each other — five different directions, not five versions of the same idea.
Sequence-mixing primitive
(CHIMERA: route across mixers)
│
│
Memory model ◄─────────┼─────────► Tokenization layer
(MNEMOSYNE: │ (HELIX: bytes → learned
state = trainable │ hierarchical patches)
function) │
│
Decoding paradigm Reasoning protocol
(CASCADE: block (NOESIS: latent
masked diffusion continuous thought
+ adaptive K) + adaptive budget + RL)
The five projects
CASCADE Phases 0–3 complete (CPU)
Pain attacked: sequential AR decoding throughput bottleneck.
Primitive: block masked diffusion with entropy-adaptive denoising step count and full KV-cache reuse across and within blocks; AR-to-diffusion distillation path.
Marquee target: ≥ 4× decode throughput vs. an AR teacher at within-2 pp quality, with streaming preserved.
Current state: from-scratch reimplementation; 55 tests pass on CPU (2 Phase-4 skips). Multi-layer KV-cache equivalence reproduces at 1.57 × 10−15 fp64; adaptive-K head reaches Spearman = 1.000 between difficulty and chosen step-count on a synthetic task; an end-to-end run converges 9.52 → 1.22 on a deliberately memorizable batch (a convergence sanity check, not a benchmark). No trained-at-scale or comparative result yet. See results →
HELIX Phase 4 — pre-scaling
Pain attacked: tokenization brittleness, multilingual inequity, static compute allocation.
Primitive: tokenizer-free byte-level model with learned differentiable hierarchical patching and cross-scale inference routing.
Marquee target: ≥ 20 % perplexity reduction on FLORES low-resource; ≥ 10 pp on CUTE character-level; robust to 10 % byte noise.
Current state: from-scratch reimplementation; 94 tests pass on CPU. The end-to-end hierarchical model learns: bits-per-byte drops 8.24 → 4.89 over 80 CPU steps on ~50 KB of English. No trained-at-scale or comparative result yet. See site →
CHIMERA reference impl · 76 tests
Pain attacked: long-context recall ceiling of pure SSMs.
Primitive: per-token learned routing across SSM, sliding-window attention, full attention, and identity.
Marquee target (a goal, not measured): ≥ 3× KV-memory reduction at ≥ 1.8× decode throughput vs. a dense Transformer at 32K, matching MQAR recall.
Current state: from-scratch reimplementation; 76 tests pass on CPU (fp64 parity, incl. prefill ≡ decode bit-identity for the multi-mode KV cache). Honest nano finding: a 3-way MQAR head-to-head where the router learns to send query positions to attention — directional, not a quality claim. No trained-at-scale result yet. See site →
MNEMOSYNE reference impl · 121 tests
Pain attacked: any fixed recurrent state is fundamentally lossy at long context.
Primitive: a neural memory module (MLP) whose weights update at inference via surprise-gated gradient steps, plus a sparse exact-recall sidecar.
Marquee target (a goal, not measured): 256K needle-in-a-haystack ≥ 80 % with constant decode memory; clear streaming-adaptation advantage over a Mamba-2 baseline.
Current state: from-scratch reimplementation; 121 tests pass on CPU. The chunked-parallel TTT memory matches the sequential reference to < 10−10 fp64 at chunk_size = 1. Correctness-tested end to end (overfit smoke train); no learning-on-data or comparative result yet. See site →
NOESIS primitives · 87 tests
Pain attacked: token-space CoT wastes compute on linguistic glue.
Primitive: continuous-thought reasoning with stochastic latent (enabling clean RL), adaptive think-budget policy, and latent verifier.
Marquee target (a goal, not measured): ≥ 3× fewer total inference tokens than a discrete CoT-RL baseline at matched MATH accuracy.
Current state: from-scratch reimplementation of the primitives; 87 tests pass on CPU. The stochastic latent loop's REINFORCE score-function gradient is verified against the analytical form and finite-difference. No training loop, task, or comparative result yet. See site →
Aggregate status
Per-project status snapshot.
| Project |
Layer |
Status |
Tests |
Headline result |
| CASCADE |
Decoding |
CPU phases complete |
55 ✓ / 2 skip |
Cache equiv 1.57e-15; adaptive-K Spearman 1.000 (synthetic) |
| HELIX |
Tokenization |
Pre-scaling |
94 ✓ |
BPB 8.24 → 4.89 in 80 CPU steps |
| CHIMERA |
Sequence-mixer |
Reference impl (CPU) |
76 ✓ |
Nano MQAR routing head-to-head (directional) |
| MNEMOSYNE |
Memory |
Reference impl (CPU) |
121 ✓ |
Chunked-TTT ≡ sequential < 1e-10 (fp64) |
| NOESIS |
Reasoning |
Primitives (CPU) |
87 ✓ |
REINFORCE gradient verified vs. finite-diff |
Why these five
For each candidate, the following must hold (these are why these and not, say, neural Turing machines or quantum-attention proposals):
- Hardware-friendly: trains and serves on existing CUDA / Triton stack; no exotic accelerators required.
- Composable: plugs into MoE, RoPE, FlashAttention; doesn't break tokenizer-or-architecture pipelines (with the deliberate exception of HELIX, which makes the case for breaking tokenizers).
- Demonstrable on H100 budgets: the proposed phased plan completes at sub-2B scale in weeks, not months.
- Real-world pain point: addresses a bottleneck production users have complained about — KV memory, decode latency, multilingual cost, reasoning overhead, long-context recall.
- Has a Mercury or LLaDA proof point: at least one paper in 2024–2026 has shown the direction works at non-trivial scale.
The operating loop
Each project follows the same think-design-code-iterate loop, with explicit per-phase exit gates:
- THINK.md per phase: ≥ 3 alternative designs, the chosen design with explicit tradeoffs, failure modes ranked by likelihood, evidence-of-success plan.
- DESIGN.md with dataflow, tensor shapes, mermaid diagrams, public API contracts.
- CODE with type hints, shape-asserting docstrings, unit tests written before import.
- ITERATE against an explicit baseline. Anomalies get a
POSTMORTEM.md, never silent fixes.
BLOCKERS.md if blocked. No fabricated results.