software
Open-source research code, agent infrastructure, and curated maps.
Open source is where the second half of the work lives: the system that meets each proven bound, the infrastructure the group runs on, and the maps I wish had existed when I started. Below is a curated selection across seven themes — the full set is on GitHub.
realm-retrieve
ReaLM-Retrieve. When to retrieve during reasoning, decided by an information-theoretic stopping rule rather than heuristics — the adaptive-RAG policy for large reasoning models.
View repositoryPaper code 7
One repository per publication — the theory and the system that meets it, in the same artifact.
- deterministic-horizon
Tight upper and lower bounds on how far chain-of-thought carries a transformer before tool delegation becomes necessary — with explicit constants.
- SAGA
Workflow-atomic GPU-cluster scheduler for AI agents — within 1.31× of Bélády-optimal KV-cache eviction, with OpenMP-accelerated kernels and LangChain / AutoGen / CrewAI bridges.
- ke-bounds
Computable bounds on knowledge-editing side effects, plus the impossibility result ruling out perfect locality and generalization at once.
- FinGround
A three-stage verify-then-ground pipeline for financial document QA that detects and grounds hallucinated claims.
- RouteNLP
Conformal-coverage router for cost-aware LLM cascade serving.
- AgentEval
DAG-structured, step-level evaluation harness for multi-step agents, with error-propagation tracking.
- ComplianceNLP
Knowledge-graph-augmented retrieval for multi-framework regulatory gap detection.
Post-Transformer architectures 6
An exploratory program testing five candidate sequence architectures beyond attention. They are designed to compose.
- research-prototypes
The program site — five post-Transformer candidates evaluated head to head, with the orthogonality argument for why each targets a different ceiling. Browse the live showcase.
- chimera-lm
Per-token learned routing across SSM, sliding-window, and full-attention mixers.
- helix-lm
Tokenizer-free, byte-level language model — Hierarchical Entropy-Linked Information eXchange.
- cascade-lm
Block-diffusion language model — parallel multi-token decoding within each block.
- mnemosyne-lm
Test-time neural memory — a memory module that learns and updates during inference, not just training.
- noesis-lm
Continuous-thought reasoning — thinks in latent space and allocates its own thinking budget per token.
Agent & MCP infrastructure 8
Local-first when possible, verifiable when not.
- Vannevar
An agentic harness with citation-grade memory: every fact carries a source URI, a temporal validity window, and an append-only provenance ledger. MCP-native, multi-frontend, fully self-hostable.
- agent-memory
Verifiable memory for LLM agents — every recalled claim is HMAC-signed back to its originating trajectory span.
- mcp-gateway
Turns any OpenAPI 3.x spec into a Model Context Protocol server, with auth, rate-limiting, and OpenTelemetry built in.
- mcp-postgres
A Postgres MCP server for agents with layered safety — role grants, a pglast AST guard, per-transaction envelopes, audit logging, and pgvector. PG 13–17.
- paperbase-mcp
A research MCP server composing arXiv, Semantic Scholar, and OpenAlex — related work, citation graphs, and BibTeX in the chat.
- mcp-jupyter
An MCP server that hands coding agents live Jupyter kernel state — variables, dataframe summaries, plots, tracebacks — not just the notebook JSON.
- agent-tracer-2
OpenTelemetry-native, local-first observability for AI agents — DuckDB on disk, a localhost viewer, and adapters for Anthropic, OpenAI, LangGraph, AutoGen, and CrewAI.
- browser-skills
Fifteen reusable, agent-agnostic browser recipes plus an MCP server, so browser-using agents stop re-discovering cookie banners and infinite scroll.
Trustworthy & verification 4
Guarantees that survive an audit, not just a benchmark.
- TrustKGRAG
Probabilistic certified robustness and anomaly detection against knowledge-graph poisoning in retrieval-augmented generation.
- conformalized-neural-operators
Distribution-free, spatially adaptive uncertainty quantification for neural-operator PDE surrogates via physics-informed conformal prediction.
- VerBPM
A temporal-logic framework for formal verification and repair of LLM-generated business-process models.
- SafeAnchor
Safety-preserving continual domain adaptation of LLMs via Fisher-based subspace identification and orthogonal gradient projection.
Evaluation & auditing 4
If a number can be gamed, assume it has been. Probes that check the benchmark before you trust the score.
- bench_audit
A library of probes for agent benchmarks — contamination, gold-answer leaks, harness-injection, and reward hacking, with confidence intervals on every result.
- benchprobe
Audits AI-agent benchmarks for the eight exploit families that quietly inflate reported scores.
- rag-bench
A small, reproducible benchmark for RAG pipelines.
- agent_eval
An open-source benchmark for Claude Code skill bundles and CLAUDE.md configs — pass@k, cost, and reliability.
Research maps & atlases 4
What I had to learn the hard way, verified and written down for the next person.
- awesome-llm-reasoning-foundations
A rigorously verified map of the theoretical foundations of LLM reasoning — transformer expressivity, chain-of-thought error bounds, circuit complexity, logical characterizations, learnability.
- llm-impossibility-results
An assumption-explicit catalog of published impossibility and lower-bound results for LLMs and agents — circuit-complexity ceilings, hallucination bounds, watermarking, alignment.
- awesome-reasoning-models-theory
A theory-first map of why reasoning models (o1/o3, DeepSeek-R1, Claude-thinking, QwQ) actually work — chapters, annotated papers, model comparisons, and reproduction notebooks.
- awesome-llm-circuits-atlas
An interactive atlas of discovered circuits and sparse-autoencoder features in LLMs, with Colab reproductions on open-weights models.
Interpretability & developer tools 4
Make model internals visible; keep the agent stack honest.
- see-the-ai-think
Watch an LLM think — sparse-autoencoder features firing live across every token, on a laptop, no GPU required.
- promptlock
A production prompt workflow — semantic diff, eval-on-PR, lockfile, drift detection, and rollback for markdown prompts in a repo.
- llm-fossils
A reproducible catalog of LLM behaviors that vanished as models scaled.
- semantic-grep
Local semantic code search — a CLI and MCP server that run entirely on your machine.