AI Summary

Where the Tokens Go: 59% of Agentic Coding Cost Is Code Review

A Concordia study traces token spend across a multi-agent coding system and finds 59.4% of it goes to code review, not writing code. Input tokens are the hidden tax: more than half of all consumption is the model re-reading context.

When a multi-agent system writes software, almost none of the cost is in writing the software. A new study traces every token through a full agentic development run and finds that Code Review alone eats 59.4% of consumption, while the actual initial coding takes 8.6% and design takes 2.4%. The expensive part of AI coding is the part that re-reads, critiques, and rewrites.

That comes from “Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering,” from a team at Concordia University. They ran ChatDev, a role-playing multi-agent framework, on 30 tasks using GPT-5’s reasoning model, then mapped its internal phases onto six universal development stages and added up the tokens.

  • Code Review consumed 59.4% of all tokens; initial Coding just 8.6%, and Design only 2.4%
  • Code Completion, when it ran (6 of 30 tasks), took 26.8%, making refinement the dominant cost overall
  • Of all tokens, 53.9% were input, 24.4% output, and 21.6% reasoning, roughly a 2:1 input-to-output ratio
  • The Coding phase inverts that pattern: 58% output tokens, the only stage that generates more than it consumes

Refinement Is the Cost Center

The intuition most engineers carry is that generating code is the expensive AI step. The data says the opposite. Writing the first draft is cheap. What costs money is the loop that comes after: reviewing, completing, and fixing it.

Share of total tokens by development stage

ChatDev + GPT-5 reasoning, 30 tasks. Code Completion ran in only 6 tasks.

Code Review59.4%
Code Completion26.8%
Coding8.6%
Design2.4%

This reframes how to think about agent budgets. The cost of an agentic build does not scale with how much code it produces; it scales with how many review-and-revise cycles it takes to get there. A greenfield script that compiles on the first pass is cheap. A refactor that triggers round after round of critique is where the bill comes from.

Input Tokens Are the Communication Tax

The second finding is about composition. Across the whole pipeline, 53.9% of tokens were input, not output. The agents spend more of their budget reading context, prior code, instructions, each other’s messages, than producing anything new. The authors call this the communication tax of multi-agent systems: every handoff re-feeds the accumulated state back into the model.

The per-stage breakdown shows just how lopsided this gets in the verification-heavy phases.

Token composition by stage

Blue = input, gray = output, light = reasoning. Percent of that stage’s tokens.

Documentation 80.2% input / 8.3% output / 11.5% reasoning
Code Review 51.4% input / 24.7% output / 23.9% reasoning
Coding 6.9% input / 58.0% output / 35.1% reasoning

Documentation is the clearest case: it is 80.2% input, because the agent ingests the entire finished codebase just to describe it. Coding is the lone outlier going the other way, 58% output, since its job is to emit new source. Everywhere else, the model is mostly paying to re-read what already exists. Since input and output tokens are often priced differently, this composition matters as much as the raw count for anyone estimating a bill.

How the Map Was Built

ChatDev simulates a software company with agents playing CEO, CTO, programmer, reviewer, and tester roles that pass work between each other through structured chats. The researchers ran 30 tasks from the ProgramDev dataset through it on gpt-5-2025-08-07 (400K context window, 128K max output, temperature fixed at 1.0), logged token counts at every step, and collapsed ChatDev’s native phases into six general stages so the results could generalize beyond one framework’s naming. Tasks spanned a wide complexity range, from roughly 17,000 to 40,000 reasoning tokens each.

Caveats

This is one framework, one model, 30 tasks. ChatDev’s role-play architecture is unusually conversation-heavy, so its communication tax may run higher than leaner agent designs; a different orchestration could shift the curve. GPT-5’s reasoning tokens are also specific to this model family. And the coverage is uneven: Code Completion fired in only 6 of 30 tasks and Testing in 12, so those stage-level numbers rest on thin samples. The phase-to-stage mapping is, as the authors note, one of several defensible ways to draw the lines.

The headline holds regardless of those qualifications. If you are budgeting for agentic coding, do not price the code, price the review loop. The authors suggest the practical move is a human checkpoint before the review phase kicks off, catching a bad design before the system spends 60% of its budget iterating on it.

#research #agents #llms #cost

Liked this? We send one like it every week.

Best papers, one email. No spam.