← Back to overview
the problem

End the junk drawer.

When an agent makes a mistake, the instinct is to add more instructions. More rules. More examples. More context. This is backwards.

agent errs human adds focus dilutes trust erodes degradation loop
the degradation loop

The default trajectory of every project.

01

Agent makes a mistake

The output is wrong. A convention is missed. A file gets clobbered. Something breaks.

02

Human adds instructions

A new rule appears in the instruction file. "Never do X." "Always check Y." The motivation is reasonable.

03

Context dilutes focus

Every token of instruction is a token unavailable for actual work. The agent has less room to think.

04

A different mistake happens

The agent makes a new error. The human adds more rules. The file grows. Both sides work harder and achieve less.

Each iteration makes the instruction file larger, the agent less focused, and the output worse. This loop is the default trajectory of every project that manages context through accumulation rather than curation.

The research confirms what practitioners are discovering.

A 2026 ETH Zurich study evaluated 138 task instances across 12 repositories. LLM-generated instruction files reduced agent success rates by 3% and increased costs by over 20%.

Developer-written instructions fared better — but still carried a significant cost penalty. The pattern is consistent: untargeted upfront context is expensive at best and harmful at worst.

Even well-written context hurts when it is loaded indiscriminately.

0% −3% LLM-gen +20% cost +4% human ETH Zurich · 2026 138 tasks · 12 repos

Bigger windows do not fix this.

AI labs are pushing two-million-token windows and near-instant prompt caching. The technical constraint is relaxing. The human constraint is not.

A human cannot effectively govern, debug, or curate a two-million-token instruction file. When the agent hallucinates inside that file, the human has no way to determine which of the thousands of loaded instructions contributed.

Bigger windows without better structure just make the failure harder to diagnose.

4K 128K 2M tokens ? ? ? ? same problem, harder to debug

The reason is mechanical.

AI agents reason within a finite context window. Every token of instruction is a token unavailable for the actual work — reading code, understanding state, solving the problem.

Overload the window with instructions and the agent has less room to think. The capability is still there. It is being diluted — not by reducing what the agent can do, but by forcing it to spend its capacity navigating irrelevant context instead of reasoning about the problem.

lean ctx reasoning space room to think typical instructions work compromised bloated accumulated rules ? deficit

Most projects are running their agents at a deficit before the first line of code is read.

There is a way out.

The lean approach reverses the degradation loop. Structure replaces accumulation. Curation replaces fear.

The Framework →