Why You Can't Fix Every LLM Error, But Can Fix the Ones That Matter

You cannot build a finite library of fixes that makes an LLM reliable on every possible input. But inside a single bounded deployment, a legal-review pipeline, a medical RAG system, a code-repair agent, you can, because the failures that actually occur cluster into a small, slow-growing catalogue. That is the core claim of “The Architecture of Errors: From Universal Impossibility to Patch-Local LLM Reliability,” from Mikhail L. Arbuzov and five co-authors (independent researchers and Palo Alto Networks).

The paper reframes reliability from an asymptotic scaling problem (“can our fixes keep up as sequences get longer?”) into a finite discovery problem (“have we catalogued enough failure modes inside this specific deployment yet?”). The second question has an answer.

Only 5 to 10% of tokens are “key” decisions where errors concentrate; the rest become nearly deterministic given enough context
Failure modes in a fixed domain grow logarithmically with observed failures, calibrated rate sigma between 0.87 and 1.85
The required per-hard-decision intervention budget grows polylogarithmically in sequence length, then plateaus
Practical implication: domain libraries on the order of tens of interventions, not infinite dictionaries
Calibrated against three taxonomies covering 83 models (ErrorAtlas), HumanEval code errors, and 304,865 math errors (MWPES-300K)

The Two Propositions

The whole argument rests on a tension between two results.

Proposition 1, universal impossibility. No finite intervention dictionary can guarantee residual error below any target for every distinguishable failure mode across an unbounded domain. Chasing exhaustive, universal reliability is a losing game by construction.

Proposition 2, patch-local sufficiency. Fix the deployment context, and the required library size grows polylogarithmically with sequence length, then becomes a domain constant once the local failure catalogue saturates. The impossibility is real but global; locally, the problem is finite.

The bridge between them is an empirical claim about how fast new failure modes show up.

Why Failure Modes Grow Slowly

The framework decomposes reliability into three layers. First, sparsity: only 5 to 10% of tokens are genuine long-range decisions; the rest are nearly determined once context is present. Second, stratification: within those key tokens, only a fraction produce actual hard failures, the “manifold-transition” decisions where the model has no stable representation. Third, the mode catalogue: within a bounded domain, those hard failures repeat, falling into a small set of recurring, distinguishable modes.

That last layer is where the leverage is. The authors model catalogue discovery as logarithmic in the number of observed failures, with rate sigma calibrated to between 0.87 and 1.85 across their three anchor taxonomies (they use 1.85 as a conservative planning value). Logarithmic growth is brutal in your favor: at sigma 1.85, finding five more tail modes takes roughly 15x more observed failures, and ten more takes about 220x. The catalogue effectively saturates.

Cluster-selective interventions, before to after

Each fix targets one capability axis. From the paper’s 28-citation harvest.

Arithmetic: Python execution (GSM-Hard)20.1% → 61.5%

Code logic: execution feedback (HumanEval)80% → 96.3%

Reasoning: process reward models (Math-Shepherd)28.6% → 43.5%

Tool calls: structured uncertainty (SAGE-Agent)36.5% → 65.2%

The authors back this with a harvest of 28 quantitatively-anchored citations spanning six independent capability axes: arithmetic, code logic, format/structure, reasoning steps, hallucinations, and tool calls. The pattern across all of them is that each intervention is cluster-selective. A Python interpreter sweeps arithmetic errors; constrained decoding zeroes out invalid-format tokens; RAG handles a band of hallucinations. Crucially, the residual errors after each fix belong to structurally different failure classes, not the one you just targeted. That is what makes a small library viable: you provision one capability per cluster, and capabilities are coarser than error categories, so a handful covers a lot.

Caveats

The load-bearing claim, logarithmic mode discovery, is calibrated against published taxonomies, not proven. The authors are upfront that a domain with genuinely power-law (Heaps-law) discovery would keep the qualitative polylog story but blow up the quantitative rates. The hard-token fraction beta is latent; they give no empirical range for it.

Coverage is also narrow. The three anchor taxonomies are general, code, and math; nothing tests long-running agentic workflows or multi-turn tool use over millions of tokens, exactly the regimes where the failure catalogue might grow faster than logarithmically. And the framework relocates long-context difficulty rather than dissolving it: where the number of hard decisions itself grows with task length (deep compositional chains, long agent horizons), reliability stays hard. The contribution is telling you which intervention to provision along the real decay axis, not removing the axis.

Still, the engineering takeaway is clean and usable. Stop budgeting for a universal fix-everything library. Pick your bounded domain, log the failures that actually occur, watch the discovery curve flatten, and provision one capability per recurring cluster. The errors you can’t fix in general are mostly errors you’ll never see in your deployment.