Smoke training run
Trained on the HELIX spec markdown (~50 KB of real English) for 80 CPU steps with the default tiny_config().
- BPB before training
- 8.241
- BPB after 80 steps
- 4.894
- Relative reduction
- 40.6%
- Wall time (CPU)
- ~4 s
Hierarchical Entropy-Linked Information eXchange
HELIX is a research prototype that pools raw bytes into learned variable-length patches, processes patches through a multi-level hierarchical transformer, and routes easy regions through a cheap path. The architecture is end-to-end differentiable and provably causal at the byte level.
HELIX consumes raw UTF-8 bytes. No BPE, no SentencePiece. Multilingual text, code, and binary data go through the same input pipeline. Character-level tasks like "how many r's in strawberry" aren't artificially hard.
A learned, differentiable segmenter pools bytes into variable-length patches based on entropy. High-entropy regions get small patches (more compute). Low-entropy regions get large patches (less compute). The bulk of the model runs at the patch level.
Two levels of patching: bytes → patches → super-patches. Each level has its own segmenter and latent transformer. Causality is preserved end-to-end via a strict cross-attention pattern with a learned BOS group.
The full forward pass, with the level-2 hierarchical refinement opt-in via HelixLMConfig.hierarchy.
byte_repr residual into the decoder.The most subtle correctness point: in hard pooling mode, patches that straddle the prefix/suffix boundary would otherwise leak future-byte information into earlier logits. HELIX fixes this by attending only to closed groups in every cross-attention (byte→patch and patch→super-patch), plus a learned BOS group so the first-group case has a valid key. The same primitive is shared across both levels via helix/utils/causal_cross_attention.py.
Trained on the HELIX spec markdown (~50 KB of real English) for 80 CPU steps with the default tiny_config().
Analytic per-byte FLOPs at the spec's small configuration, reproducing the headline in Appendix A.2.
Tiny configs used in tests run end-to-end on CPU in under a second.
The strongest correctness check we know how to express without a real KV cache.
HelixLM(bytes[:T_short])["logits"][:, :T_short] is bit-identical to HelixLM(bytes[:T_long])["logits"][:, :T_short], with or without the level-2 hierarchy, in hard pooling mode.
pip install -e ".[dev]"
# optional, training extras: pip install -e ".[dev,train]"
# optional, Linux + CUDA: pip install -e ".[dev,train,cuda]"
import torch
from helix import HelixLM, tiny_config
from train.data.byte_stream import ByteFileDataset
from train.train import TrainConfig, train
cfg = tiny_config()
model = HelixLM(cfg)
data = ByteFileDataset("04_HELIX.md", seq_len=cfg.seq_len)
train(model, data, TrainConfig(n_steps=80, batch_size=4, lr=5e-3))
# BPB drops ~8.2 -> ~4.9 in ~30 s on CPU.
from helix import HelixLM, tiny_hierarchical_config
model = HelixLM(tiny_hierarchical_config())
out = model(bytes_in)
# out["L2_boundary_probs"], out["patch_to_super"] expose the level-2 segmentation.
prompt = torch.randint(0, 256, (1, 8), dtype=torch.int64)
generated = model.generate(prompt, max_new_bytes=16, greedy=True)
# step-by-step generation is provably equivalent to prefill.
| Phase | Component | Status |
|---|---|---|
| 0 | Bootstrap (pyproject, scaffold) | done |
| 0 | 14 lit notes | not started — refused to fabricate |
| 1 | N-gram hash embeddings | tested |
| 1 | Local encoder (windowed causal) | tested |
| 2 | Differentiable soft pool + segmenter | tested |
| 2 | Entropy heuristic ByteLM | tested |
| 3 | Local decoder + BOS-causal cross-attn | tested |
| 3 | Latent transformer (single-level) | tested |
| 3 | Level-2 hierarchy | tested |
| 3 | Byte-level causal consistency | passing |
| 3 | HelixLM.generate() | tested |
| 3 | Cross-scale router | stub |
| 3 | KV cache | stub |
| 4 | Trainer + BPB eval | tested, learning verified |
| 5–6 | Scaling, ablations, paper | not started |
generate() drops from O(T2) to O(T). Correctness invariants are already proven, so this is purely a performance concern.