Five post-Transformer ideas, reimplemented from scratch

Minimal, annotated, runnable reimplementations of recent (2024–2026) language-model architecture ideas, each with honest small-scale CPU benchmarks.

These are educational reimplementations of published work, not original architectures and not trained-at-scale models. Each prototype targets a different ceiling the Transformer is hitting, using a different primitive; in principle they compose — a routed (CHIMERA), byte-patched (HELIX), block-diffusion-decoded (CASCADE) model with memory (MNEMOSYNE) layers and a reasoning (NOESIS) head. Every number below is reproduced by a script in the code; goals that have not been measured are labelled as such.

The orthogonality argument

Each candidate addresses a different kind of bottleneck and uses a different kind of primitive. They were deliberately chosen so they do not collapse onto each other — five different directions, not five versions of the same idea.

                          Sequence-mixing primitive
            (CHIMERA: route across mixers)
                          │
                          │
   Memory model ◄─────────┼─────────► Tokenization layer
   (MNEMOSYNE:            │           (HELIX: bytes → learned
    state = trainable     │            hierarchical patches)
    function)             │
                          │
              Decoding paradigm           Reasoning protocol
              (CASCADE: block             (NOESIS: latent
               masked diffusion           continuous thought
               + adaptive K)               + adaptive budget + RL)

The five projects

CASCADE  Phases 0–3 complete (CPU)

Pain attacked: sequential AR decoding throughput bottleneck.

Primitive: block masked diffusion with entropy-adaptive denoising step count and full KV-cache reuse across and within blocks; AR-to-diffusion distillation path.

Marquee target: ≥ 4× decode throughput vs. an AR teacher at within-2 pp quality, with streaming preserved.

Current state: from-scratch reimplementation; 55 tests pass on CPU (2 Phase-4 skips). Multi-layer KV-cache equivalence reproduces at 1.57 × 10−15 fp64; adaptive-K head reaches Spearman = 1.000 between difficulty and chosen step-count on a synthetic task; an end-to-end run converges 9.52 → 1.22 on a deliberately memorizable batch (a convergence sanity check, not a benchmark). No trained-at-scale or comparative result yet. See results →

HELIX  Phase 4 — pre-scaling

Pain attacked: tokenization brittleness, multilingual inequity, static compute allocation.

Primitive: tokenizer-free byte-level model with learned differentiable hierarchical patching and cross-scale inference routing.

Marquee target: ≥ 20 % perplexity reduction on FLORES low-resource; ≥ 10 pp on CUTE character-level; robust to 10 % byte noise.

Current state: from-scratch reimplementation; 94 tests pass on CPU. The end-to-end hierarchical model learns: bits-per-byte drops 8.24 → 4.89 over 80 CPU steps on ~50 KB of English. No trained-at-scale or comparative result yet. See site →

CHIMERA  reference impl · 76 tests

Pain attacked: long-context recall ceiling of pure SSMs.

Primitive: per-token learned routing across SSM, sliding-window attention, full attention, and identity.

Marquee target (a goal, not measured): ≥ 3× KV-memory reduction at ≥ 1.8× decode throughput vs. a dense Transformer at 32K, matching MQAR recall.

Current state: from-scratch reimplementation; 76 tests pass on CPU (fp64 parity, incl. prefill ≡ decode bit-identity for the multi-mode KV cache). Honest nano finding: a 3-way MQAR head-to-head where the router learns to send query positions to attention — directional, not a quality claim. No trained-at-scale result yet. See site →

MNEMOSYNE  reference impl · 121 tests

Pain attacked: any fixed recurrent state is fundamentally lossy at long context.

Primitive: a neural memory module (MLP) whose weights update at inference via surprise-gated gradient steps, plus a sparse exact-recall sidecar.

Marquee target (a goal, not measured): 256K needle-in-a-haystack ≥ 80 % with constant decode memory; clear streaming-adaptation advantage over a Mamba-2 baseline.

Current state: from-scratch reimplementation; 121 tests pass on CPU. The chunked-parallel TTT memory matches the sequential reference to < 10−10 fp64 at chunk_size = 1. Correctness-tested end to end (overfit smoke train); no learning-on-data or comparative result yet. See site →

NOESIS  primitives · 87 tests

Pain attacked: token-space CoT wastes compute on linguistic glue.

Primitive: continuous-thought reasoning with stochastic latent (enabling clean RL), adaptive think-budget policy, and latent verifier.

Marquee target (a goal, not measured): ≥ 3× fewer total inference tokens than a discrete CoT-RL baseline at matched MATH accuracy.

Current state: from-scratch reimplementation of the primitives; 87 tests pass on CPU. The stochastic latent loop's REINFORCE score-function gradient is verified against the analytical form and finite-difference. No training loop, task, or comparative result yet. See site →

Aggregate status

Per-project status snapshot.
Project Layer Status Tests Headline result
CASCADE Decoding CPU phases complete 55 ✓ / 2 skip Cache equiv 1.57e-15; adaptive-K Spearman 1.000 (synthetic)
HELIX Tokenization Pre-scaling 94 ✓ BPB 8.24 → 4.89 in 80 CPU steps
CHIMERA Sequence-mixer Reference impl (CPU) 76 ✓ Nano MQAR routing head-to-head (directional)
MNEMOSYNE Memory Reference impl (CPU) 121 ✓ Chunked-TTT ≡ sequential < 1e-10 (fp64)
NOESIS Reasoning Primitives (CPU) 87 ✓ REINFORCE gradient verified vs. finite-diff

Why these five

For each candidate, the following must hold (these are why these and not, say, neural Turing machines or quantum-attention proposals):

  1. Hardware-friendly: trains and serves on existing CUDA / Triton stack; no exotic accelerators required.
  2. Composable: plugs into MoE, RoPE, FlashAttention; doesn't break tokenizer-or-architecture pipelines (with the deliberate exception of HELIX, which makes the case for breaking tokenizers).
  3. Demonstrable on H100 budgets: the proposed phased plan completes at sub-2B scale in weeks, not months.
  4. Real-world pain point: addresses a bottleneck production users have complained about — KV memory, decode latency, multilingual cost, reasoning overhead, long-context recall.
  5. Has a Mercury or LLaDA proof point: at least one paper in 2024–2026 has shown the direction works at non-trivial scale.

The operating loop

Each project follows the same think-design-code-iterate loop, with explicit per-phase exit gates:

  1. THINK.md per phase: ≥ 3 alternative designs, the chosen design with explicit tradeoffs, failure modes ranked by likelihood, evidence-of-success plan.
  2. DESIGN.md with dataflow, tensor shapes, mermaid diagrams, public API contracts.
  3. CODE with type hints, shape-asserting docstrings, unit tests written before import.
  4. ITERATE against an explicit baseline. Anomalies get a POSTMORTEM.md, never silent fixes.

BLOCKERS.md if blocked. No fabricated results.