Hierarchical Entropy-Linked Information eXchange

A tokenizer-free byte-level LM that spends FLOPs where it matters.

HELIX is a research prototype that pools raw bytes into learned variable-length patches, processes patches through a multi-level hierarchical transformer, and routes easy regions through a cheap path. The architecture is end-to-end differentiable and provably causal at the byte level.

94 / 94 tests passing
BPB 8.24 → 4.89 in 80 CPU steps
CPU-friendly dev surface
Phase 4 — pre-scaling

Quick start See the architecture

What it is, in three bullets

No tokenizer

HELIX consumes raw UTF-8 bytes. No BPE, no SentencePiece. Multilingual text, code, and binary data go through the same input pipeline. Character-level tasks like "how many r's in strawberry" aren't artificially hard.
Dynamic compute

A learned, differentiable segmenter pools bytes into variable-length patches based on entropy. High-entropy regions get small patches (more compute). Low-entropy regions get large patches (less compute). The bulk of the model runs at the patch level.
Hierarchical

Two levels of patching: bytes → patches → super-patches. Each level has its own segmenter and latent transformer. Causality is preserved end-to-end via a strict cross-attention pattern with a learned BOS group.

Architecture

The full forward pass, with the level-2 hierarchical refinement opt-in via HelixLMConfig.hierarchy.

End-to-end forward through HelixLM. The dashed line is the optional byte_repr residual into the decoder.

The most subtle correctness point: in hard pooling mode, patches that straddle the prefix/suffix boundary would otherwise leak future-byte information into earlier logits. HELIX fixes this by attending only to closed groups in every cross-attention (byte→patch and patch→super-patch), plus a learned BOS group so the first-group case has a valid key. The same primitive is shared across both levels via helix/utils/causal_cross_attention.py.

Results

Smoke training run

Trained on the HELIX spec markdown (~50 KB of real English) for 80 CPU steps with the default tiny_config().

BPB before training: 8.241
BPB after 80 steps: 4.894
Relative reduction: 40.6%
Wall time (CPU): ~4 s

Test: helix-lm/tests/test_training.py

FLOP comparison

Analytic per-byte FLOPs at the spec's small configuration, reproducing the headline in Appendix A.2.

HELIX: 246.7 MFLOPs / byte
BPE Transformer: 299.4 MFLOPs / byte
HELIX is cheaper by: 1.21×

Source: helix-lm/helix/utils/flop_counter.py

Model sizes

Tiny configs used in tests run end-to-end on CPU in under a second.

single-level: 152,801 params
hierarchical (+L2): 174,434 params
L2 overhead: ~14%
test wall time (CPU): ~10 s for 94 tests

Causal consistency

The strongest correctness check we know how to express without a real KV cache.

HelixLM(bytes[:T_short])["logits"][:, :T_short] is bit-identical to HelixLM(bytes[:T_long])["logits"][:, :T_short], with or without the level-2 hierarchy, in hard pooling mode.

Test: helix-lm/tests/test_causal_consistency.py

Quick start

Install

pip install -e ".[dev]"
# optional, training extras: pip install -e ".[dev,train]"
# optional, Linux + CUDA:     pip install -e ".[dev,train,cuda]"

Train a tiny model and watch the loss drop

import torch
from helix import HelixLM, tiny_config
from train.data.byte_stream import ByteFileDataset
from train.train import TrainConfig, train

cfg = tiny_config()
model = HelixLM(cfg)
data = ByteFileDataset("04_HELIX.md", seq_len=cfg.seq_len)
train(model, data, TrainConfig(n_steps=80, batch_size=4, lr=5e-3))
# BPB drops ~8.2 -> ~4.9 in ~30 s on CPU.

Hierarchical model

from helix import HelixLM, tiny_hierarchical_config
model = HelixLM(tiny_hierarchical_config())
out = model(bytes_in)
# out["L2_boundary_probs"], out["patch_to_super"] expose the level-2 segmentation.

Decode

prompt = torch.randint(0, 256, (1, 8), dtype=torch.int64)
generated = model.generate(prompt, max_new_bytes=16, greedy=True)
# step-by-step generation is provably equivalent to prefill.

Component status

Implementation status by HELIX phase
Phase	Component	Status
0	Bootstrap (pyproject, scaffold)	done
0	14 lit notes	not started — refused to fabricate
1	N-gram hash embeddings	tested
1	Local encoder (windowed causal)	tested
2	Differentiable soft pool + segmenter	tested
2	Entropy heuristic ByteLM	tested
3	Local decoder + BOS-causal cross-attn	tested
3	Latent transformer (single-level)	tested
3	Level-2 hierarchy	tested
3	Byte-level causal consistency	passing
3	HelixLM.generate()	tested
3	Cross-scale router	stub
3	KV cache	stub
4	Trainer + BPB eval	tested, learning verified
5–6	Scaling, ablations, paper	not started

What I'd build next

Pretrain a calibration ByteLM and hook up the entropy warmup loss. Should make the segmenter much less prone to collapse mode #1 at the start of real training.
Switch the data source to FineWeb-Edu via a streaming reader. The current smoke test uses the spec markdown as a stand-in.
Implement the KV cache so generate() drops from O(T²) to O(T). Correctness invariants are already proven, so this is purely a performance concern.
Cross-scale router after training surfaces a path imbalance to balance against.
Lit notes by actually reading the 14 papers, not before.

Caveats

Not paper-ready. This is a research scaffold. The trainer is single-process, CPU-friendly, and intentionally small. Real Phase 4–6 work requires GPUs and FineWeb-Edu data.
14 lit notes are unwritten. The spec required summaries of 14 papers, including BLT, MegaByte, ByT5, MambaByte, Dynamic Token Pooling. I refused to fabricate them — the placeholder explains exactly which papers and how to write the notes once you read them.
Strict-causal cross-attention is a deliberate deviation from the naive spec reading. The naive reading ("cross-attend to all patches with index ≤ current") leaks future bytes into earlier logits. The fix attends only to closed groups plus a learned BOS. Documented in the project README's Spec deviations section.
flash-attn is gated behind the Linux + CUDA extra. The default install path uses portable SDPA; same FLOPs, slower and more memory.