Hierarchical Entropy-Linked Information eXchange

A tokenizer-free byte-level LM that spends FLOPs where it matters.

HELIX is a research prototype that pools raw bytes into learned variable-length patches, processes patches through a multi-level hierarchical transformer, and routes easy regions through a cheap path. The architecture is end-to-end differentiable and provably causal at the byte level.

Quick start See the architecture

What it is, in three bullets

Architecture

The full forward pass, with the level-2 hierarchical refinement opt-in via HelixLMConfig.hierarchy.

HELIX architecture data flow Bytes flow into the local encoder (windowed causal byte Transformer with n-gram embeddings), producing byte representations. The segmenter emits per-byte boundary probabilities, which combined with the byte representations are soft-pooled into level-1 patches. A level-1 latent transformer processes patches causally. Optionally, a level-2 stack segments level-1 patches into super-patches, runs a level-2 latent transformer, and feeds back via strict-causal cross-attention. Finally the local decoder produces next-byte logits, taking both the refined patches and an optional byte representation residual. bytes (B, T) int LocalEncoder windowed causal + n-gram hash embed byte_repr Segmenter causal MLP head boundary_probs soft_pool (hard) cumsum + BOS-causal patch_repr (B, P, d) Latent L1 causal across patches patch_out Level-2 stack (optional) L2 segmenter + soft_pool + Latent L2 + causal xattn Local Decoder BOS-causal cross-attn logits (B, T, 256) bytes patches refined byte_repr residual (optional)
End-to-end forward through HelixLM. The dashed line is the optional byte_repr residual into the decoder.

The most subtle correctness point: in hard pooling mode, patches that straddle the prefix/suffix boundary would otherwise leak future-byte information into earlier logits. HELIX fixes this by attending only to closed groups in every cross-attention (byte→patch and patch→super-patch), plus a learned BOS group so the first-group case has a valid key. The same primitive is shared across both levels via helix/utils/causal_cross_attention.py.

Results

Smoke training run

Trained on the HELIX spec markdown (~50 KB of real English) for 80 CPU steps with the default tiny_config().

BPB before training
8.241
BPB after 80 steps
4.894
Relative reduction
40.6%
Wall time (CPU)
~4 s

Test: helix-lm/tests/test_training.py

FLOP comparison

Analytic per-byte FLOPs at the spec's small configuration, reproducing the headline in Appendix A.2.

HELIX
246.7 MFLOPs / byte
BPE Transformer
299.4 MFLOPs / byte
HELIX is cheaper by
1.21×

Source: helix-lm/helix/utils/flop_counter.py

Model sizes

Tiny configs used in tests run end-to-end on CPU in under a second.

single-level
152,801 params
hierarchical (+L2)
174,434 params
L2 overhead
~14%
test wall time (CPU)
~10 s for 94 tests

Causal consistency

The strongest correctness check we know how to express without a real KV cache.

HelixLM(bytes[:T_short])["logits"][:, :T_short] is bit-identical to HelixLM(bytes[:T_long])["logits"][:, :T_short], with or without the level-2 hierarchy, in hard pooling mode.

Test: helix-lm/tests/test_causal_consistency.py

Quick start

Install

pip install -e ".[dev]"
# optional, training extras: pip install -e ".[dev,train]"
# optional, Linux + CUDA:     pip install -e ".[dev,train,cuda]"

Train a tiny model and watch the loss drop

import torch
from helix import HelixLM, tiny_config
from train.data.byte_stream import ByteFileDataset
from train.train import TrainConfig, train

cfg = tiny_config()
model = HelixLM(cfg)
data = ByteFileDataset("04_HELIX.md", seq_len=cfg.seq_len)
train(model, data, TrainConfig(n_steps=80, batch_size=4, lr=5e-3))
# BPB drops ~8.2 -> ~4.9 in ~30 s on CPU.

Hierarchical model

from helix import HelixLM, tiny_hierarchical_config
model = HelixLM(tiny_hierarchical_config())
out = model(bytes_in)
# out["L2_boundary_probs"], out["patch_to_super"] expose the level-2 segmentation.

Decode

prompt = torch.randint(0, 256, (1, 8), dtype=torch.int64)
generated = model.generate(prompt, max_new_bytes=16, greedy=True)
# step-by-step generation is provably equivalent to prefill.

Component status

Implementation status by HELIX phase
Phase Component Status
0Bootstrap (pyproject, scaffold)done
014 lit notesnot started — refused to fabricate
1N-gram hash embeddingstested
1Local encoder (windowed causal)tested
2Differentiable soft pool + segmentertested
2Entropy heuristic ByteLMtested
3Local decoder + BOS-causal cross-attntested
3Latent transformer (single-level)tested
3Level-2 hierarchytested
3Byte-level causal consistencypassing
3HelixLM.generate()tested
3Cross-scale routerstub
3KV cachestub
4Trainer + BPB evaltested, learning verified
5–6Scaling, ablations, papernot started

What I'd build next

  1. Pretrain a calibration ByteLM and hook up the entropy warmup loss. Should make the segmenter much less prone to collapse mode #1 at the start of real training.
  2. Switch the data source to FineWeb-Edu via a streaming reader. The current smoke test uses the spec markdown as a stand-in.
  3. Implement the KV cache so generate() drops from O(T2) to O(T). Correctness invariants are already proven, so this is purely a performance concern.
  4. Cross-scale router after training surfaces a path imbalance to balance against.
  5. Lit notes by actually reading the 14 papers, not before.

Caveats