Dongxin Guo
PhD Candidate @ HKU Computer Science · COO @ Stellaris AI · AI Researcher @ Brain Investing
I am a final-year PhD candidate in the Department of Computer Science at The University of Hong Kong, advised by Prof. Siu-Ming Yiu.
My research establishes the fundamental limits of modern AI systems, and builds the infrastructure that lives correctly inside those limits. I work on what large language models can and cannot do, what guarantees their reasoning can carry, and how multiple AI agents can coordinate without manipulation. Then I ship the systems that turn those theorems into products you can actually deploy. Recent work appears at top venues across machine learning (ICML), NLP (ACL), information retrieval (SIGIR), and high-performance systems (HPDC).
How I work
A consistent methodological signature runs through every paper, whatever the topic. It is the recipe I keep using, whether the next problem is a transformer expressivity bound or a multi-tenant GPU scheduler.
- Tight bounds, not loose ones. I aim for matching upper and lower bounds with explicit constants. The kind of proof that says “this is the limit, and nothing in this class of systems can beat it.” The Deterministic Horizon (ICML 2026) does this for how far a transformer can carry a chain of thought before its internal bookkeeping breaks down. The bound is not “somewhere around here.” It is a specific architectural ceiling, with both directions proved.
- Impossibility, paired with what to do instead. When something cannot be done, the theorem becomes a design constraint rather than a wall. My ICML 2026 Position paper proves that today’s “explainable AI” methods cannot satisfy the regulatory mandates of major financial authorities, and the same proof characterizes what a compliant explanation system would have to look like. FinGround (ACL 2026) is the constructive counterpart for hallucinations: an impossibility-style argument tells you which claims need verifying, and the system verifies them.
- Guarantees that survive the real world. Distribution-free coverage, conformal prediction, fair scheduling. The guarantees in my work do not assume the test data resembles the training data, the workload stays still, or the user is honest. RouteNLP (ACL 2026) wraps cost-aware model routing in a conformal cascade, so the quality guarantee holds under the distribution shift that production traffic actually creates. SAGA (HPDC 2026) gives provable per-tenant fairness when many users share the same GPUs.
- Sharp thresholds that explain emergent behavior. When a model suddenly starts reasoning at a certain training step, when a human-AI team stops outperforming individuals, when retrieval starts hurting accuracy instead of helping, there is usually a closed-form quantity that predicts the transition. When Can Human-AI Teams Outperform Individuals? (CogSci 2026) gives one such threshold and rules out collaboration above it. ReaLM-Retrieve (SIGIR 2026) uses uncertainty thresholds to decide when a reasoning model should pause and look something up.
- Theory paired with the system that respects it. The same paper that proves the bound contains the algorithm that meets it. OT-Route recasts Mixture-of-Experts routing as an optimal-transport problem, proves a load-balancing guarantee, and ships the Sinkhorn-based algorithm that achieves it. It replaces the hand-tuned auxiliary losses the field has relied on for years.
What I have built
The published portfolio falls into three interconnected threads. Each one is anchored by a limit (an architectural ceiling, an impossibility result, or a fairness guarantee) and a deployable system that lives correctly inside it.
- When should an LLM stop thinking and start using tools? Language models can extend their reasoning by generating long internal chains of thought, but doing more is not always better. My ICML paper The Deterministic Horizon shows that a transformer can only keep track of so many evolving facts on its own before its reasoning quietly degrades, and that the right move past that ceiling is not “think harder” but “delegate to a tool.” My SIGIR paper ReaLM-Retrieve carries the same idea into retrieval-augmented generation. Rather than dumping documents into the context once at the start, it watches the model’s reasoning step by step and pulls in evidence only at the moments where the model is genuinely uncertain. Together, the two papers argue that when a model reaches outside itself (for a calculator, a search engine, or a knowledge base) is itself a research question with a principled answer.
- Trustworthy LLMs in regulated, production settings. Four ACL Industry Track papers, paired with an ICML Position paper, form a deployment-grade stack for the parts of industry that cannot tolerate hallucinations or opaque decisions. They address the four hardest practical questions in turn: which model in a portfolio should answer this query (cost-aware routing with statistical quality guarantees); did a multi-step agent actually do its job (an evaluation framework that opens up the agent’s workflow as a structured graph rather than collapsing it into a single end-to-end score); can a financial institution trust an LLM to read tens of thousands of regulatory updates a year (a knowledge-graph-grounded compliance pipeline); and how do we stop a finance system from inventing numbers (a hallucination-grounding system that decomposes every answer into atomic claims and verifies each against source documents). The accompanying ICML Position paper takes the harder line. Today’s “explainable AI” methods, tested against the actual mandates of major financial regulators, simply cannot satisfy them. It is a deliberate impossibility argument, intended to reframe what compliance-grade AI has to look like.
- The infrastructure that makes LLM serving work. Once language models start running real workloads, the bottleneck stops being the model and becomes the system around it. SAGA, my HPDC paper, addresses the case of AI agents that fire off dozens of LLM calls per task. Today’s GPU schedulers were built for one-shot inference and treat each call as independent, throwing away the shared context between steps. SAGA instead schedules the entire agent workflow as a single unit, with provable fairness across tenants sharing the same cluster. A companion line of work, OT-Route, attacks a different bottleneck inside the model itself: how Mixture-of-Experts routing decides which expert sees which token. It reframes that problem as one of optimal transport, with formal load-balancing guarantees.
The through-line is intentional. The theorems I prove tell you what cannot be done. The systems I build make precise what can.
From research to deployment
The same guarantees that anchor the papers anchor the products. As COO of Stellaris AI I lead the LLM infrastructure stack. That is the production layer where the bounds proved on paper turn into latency budgets, fairness constraints, and trust-grade evaluation pipelines. As AI Researcher at Brain Investing (an HKU FinTech spin-out), I build AI-driven quantitative trading systems where statistical guarantees translate directly into risk-managed P&L. The cycle runs both ways. Deployment surfaces the limits worth proving, and the proofs become the constraints that keep deployment honest.
Current Projects
I am actively recruiting collaborators on the following projects, all in active development beyond the published record above. If any align with your interests, please reach out.
- Logical characterization of transformer expressivity. Connecting softmax attention’s descriptive complexity to formal logic hierarchies, with the goal of unifying planning-depth bounds, compositional-reasoning thresholds, and chain-of-thought length theory under one framework.
- Crash recovery for multi-agent LLM workflows. A fault-tolerant runtime with formally verified saga-based compensation semantics, so multi-agent pipelines survive agent crashes, API outages, and silent confabulations without losing intermediate reasoning state.
- Strategy-proof coordination of LLM agents. Auction, voting, and bargaining protocols for agents with prompt-dependent preferences and bounded contingent reasoning, where standard VCG provably fails. The goal is to characterize the exact frontier between manipulable and strategy-proof regimes.
- Conflict detection and verification for retrieval-augmented generation. A verification layer that distinguishes shallow RAG conflicts (resolvable by latent refinement) from deep ones (requiring explicit verification), routing accordingly under tight token-overhead budgets.
- Query-adaptive retrieval for reasoning tasks. Extending ReaLM-Retrieve with information-theoretic stopping rules and intervention policies that decide when, what, and how much to retrieve mid-reasoning.
- Multi-tenant LLM serving under fairness constraints. Speculation-aware schedulers, KV-cache eviction policies, and workflow-atomic batching with provable per-tenant SLO and fairness bounds at production cluster scale.
- Mechanistic auditing of evaluation-awareness in reasoning models. Locating, decomposing, and erasing the residual-stream directions that gate context-dependent behavioural divergence in open-weight reasoning models, with pre-registered behavioural kill-switches.
