software | Dongxin Guo

Open source is where the second half of the work lives: the system that meets each proven bound, the infrastructure the group runs on, and the maps I wish had existed when I started. Below is a curated selection across seven themes — the full set is on GitHub.

bettyguo on GitHub

38 curated repositories

8 peer-reviewed artifacts

7 research themes

Spotlight SIGIR '26

realm-retrieve

ReaLM-Retrieve. When to retrieve during reasoning, decided by an information-theoretic stopping rule rather than heuristics — the adaptive-RAG policy for large reasoning models.

View repository ↗

117 stars on GitHub

01 Paper code 02 Post-Transformer architectures 03 Agent & MCP infrastructure 04 Trustworthy & verification 05 Evaluation & auditing 06 Research maps & atlases 07 Interpretability & developer tools

Paper code 7

One repository per publication — the theory and the system that meets it, in the same artifact.

deterministic-horizon

ICML '26 Python

Tight upper and lower bounds on how far chain-of-thought carries a transformer before tool delegation becomes necessary — with explicit constants.
SAGA

HPDC '26 C++

Workflow-atomic GPU-cluster scheduler for AI agents — within 1.31× of Bélády-optimal KV-cache eviction, with OpenMP-accelerated kernels and LangChain / AutoGen / CrewAI bridges.
ke-bounds

TMLR '26 Python

Computable bounds on knowledge-editing side effects, plus the impossibility result ruling out perfect locality and generalization at once.
FinGround

ACL '26 Industry Python

A three-stage verify-then-ground pipeline for financial document QA that detects and grounds hallucinated claims.
RouteNLP

ACL '26 Industry Python

Conformal-coverage router for cost-aware LLM cascade serving.
AgentEval

ACL '26 Industry Python

DAG-structured, step-level evaluation harness for multi-step agents, with error-propagation tracking.
ComplianceNLP

ACL '26 Industry Python

Knowledge-graph-augmented retrieval for multi-framework regulatory gap detection.

Post-Transformer architectures 6

An exploratory program testing five candidate sequence architectures beyond attention. They are designed to compose.

research-prototypes

Python

The program site — five post-Transformer candidates evaluated head to head, with the orthogonality argument for why each targets a different ceiling. Browse the live showcase.
chimera-lm

Python

Per-token learned routing across SSM, sliding-window, and full-attention mixers.
helix-lm

Python

Tokenizer-free, byte-level language model — Hierarchical Entropy-Linked Information eXchange.
cascade-lm

Python

Block-diffusion language model — parallel multi-token decoding within each block.
mnemosyne-lm

Python

Test-time neural memory — a memory module that learns and updates during inference, not just training.
noesis-lm

Python

Continuous-thought reasoning — thinks in latent space and allocates its own thinking budget per token.

Agent & MCP infrastructure 8

Local-first when possible, verifiable when not.

Vannevar

Rust

An agentic harness with citation-grade memory: every fact carries a source URI, a temporal validity window, and an append-only provenance ledger. MCP-native, multi-frontend, fully self-hostable.
agent-memory

Python

Verifiable memory for LLM agents — every recalled claim is HMAC-signed back to its originating trajectory span.
mcp-gateway

Go

Turns any OpenAPI 3.x spec into a Model Context Protocol server, with auth, rate-limiting, and OpenTelemetry built in.
mcp-postgres

Python

A Postgres MCP server for agents with layered safety — role grants, a pglast AST guard, per-transaction envelopes, audit logging, and pgvector. PG 13–17.
paperbase-mcp

Python

A research MCP server composing arXiv, Semantic Scholar, and OpenAlex — related work, citation graphs, and BibTeX in the chat.
mcp-jupyter

Python

An MCP server that hands coding agents live Jupyter kernel state — variables, dataframe summaries, plots, tracebacks — not just the notebook JSON.
agent-tracer-2

Python

OpenTelemetry-native, local-first observability for AI agents — DuckDB on disk, a localhost viewer, and adapters for Anthropic, OpenAI, LangGraph, AutoGen, and CrewAI.
browser-skills

Python

Fifteen reusable, agent-agnostic browser recipes plus an MCP server, so browser-using agents stop re-discovering cookie banners and infinite scroll.

Trustworthy & verification 4

Guarantees that survive an audit, not just a benchmark.

TrustKGRAG

Python

Probabilistic certified robustness and anomaly detection against knowledge-graph poisoning in retrieval-augmented generation.
conformalized-neural-operators

Python

Distribution-free, spatially adaptive uncertainty quantification for neural-operator PDE surrogates via physics-informed conformal prediction.
VerBPM

Python

A temporal-logic framework for formal verification and repair of LLM-generated business-process models.
SafeAnchor

Python

Safety-preserving continual domain adaptation of LLMs via Fisher-based subspace identification and orthogonal gradient projection.

Evaluation & auditing 4

If a number can be gamed, assume it has been. Probes that check the benchmark before you trust the score.

bench_audit

Python

A library of probes for agent benchmarks — contamination, gold-answer leaks, harness-injection, and reward hacking, with confidence intervals on every result.
benchprobe

Python

Audits AI-agent benchmarks for the eight exploit families that quietly inflate reported scores.
rag-bench

Python

A small, reproducible benchmark for RAG pipelines.
agent_eval

Python

An open-source benchmark for Claude Code skill bundles and CLAUDE.md configs — pass@k, cost, and reliability.

Research maps & atlases 4

What I had to learn the hard way, verified and written down for the next person.

awesome-llm-reasoning-foundations

A rigorously verified map of the theoretical foundations of LLM reasoning — transformer expressivity, chain-of-thought error bounds, circuit complexity, logical characterizations, learnability.
llm-impossibility-results

An assumption-explicit catalog of published impossibility and lower-bound results for LLMs and agents — circuit-complexity ceilings, hallucination bounds, watermarking, alignment.
awesome-reasoning-models-theory

Jupyter Notebook

A theory-first map of why reasoning models (o1/o3, DeepSeek-R1, Claude-thinking, QwQ) actually work — chapters, annotated papers, model comparisons, and reproduction notebooks.
awesome-llm-circuits-atlas

An interactive atlas of discovered circuits and sparse-autoencoder features in LLMs, with Colab reproductions on open-weights models.

Interpretability & developer tools 4

Make model internals visible; keep the agent stack honest.

see-the-ai-think

Python

Watch an LLM think — sparse-autoencoder features firing live across every token, on a laptop, no GPU required.
promptlock

Go

A production prompt workflow — semantic diff, eval-on-PR, lockfile, drift detection, and rollback for markdown prompts in a repo.
llm-fossils

Jupyter Notebook

A reproducible catalog of LLM behaviors that vanished as models scaled.
semantic-grep

Python

Local semantic code search — a CLI and MCP server that run entirely on your machine.