When NOT to Use Composer 2.5: 6 Cases Where Opus 4.7 Still Wins

The single most-quoted line in every Composer 2.5 launch piece is that it is "ten times cheaper than the frontier models." That number is accurate at the per-token level. It is also misleading at the per-completed-task level, because a model that needs three attempts to land a task is not three times cheaper than a model that lands it in one — it is more expensive, and meaningfully so, once you account for engineer time and the retry tax.

So the right way to read the cost story is not "Composer 2.5 is ten times cheaper than Opus 4.7." It is "Composer 2.5 is ten times cheaper per token, and somewhere between three times cheaper and slightly more expensive per completed task, depending on which task type you are running it on." The six cases below are the categories where the per-task math turns against Composer hardest, where the retry tax eats the saving, and where Opus 4.7 — or, in some cases, GPT-5.5 — is the rational choice despite the cost differential.

I have been running both models on production traffic since Composer 2.5 shipped on May 18th. The categories below come from the patterns where I have consistently seen Composer either fail outright or succeed only after enough retries that the cost advantage evaporated. This is the practitioner's case against the default, framed around real workloads rather than benchmark categories.

The Economics Lie: 10× Cheaper Isn't 10× Cheaper

Before the cases, the math that motivates them. Composer 2.5 standard tier lists at $0.50 input / $2.50 output per million tokens. Opus 4.7 lists at roughly $5.00 / $25.00. The per-token ratio is exactly 10×.

On a workload where Composer lands the task in one pass and Opus also lands it in one pass, the cost saving is genuine — Composer wins by 10×, end of story. The interesting question is what fraction of your real workloads look like that, and what fraction look like the alternative: Composer takes three attempts, Opus takes one. On that second shape, the math becomes 3 × $0.80 vs 1 × $7.00 — roughly $2.40 against $7.00, which is still cheaper, but only by a factor of three. On the worst shape, where Composer never reliably lands the task at all and the human ends up doing the work, the comparison is meaningless: there is no Composer cost that beats a model that actually completes the work.

The right mental model is that the per-token ratio is a ceiling on the saving, not a floor. The realised saving depends entirely on what fraction of your tasks Composer can complete in one or two passes versus the fraction where it loops indefinitely or needs human intervention. The cases below are where that fraction is poor. The detailed head-to-head with the same model lives in the Composer 2.5 vs Claude Opus 4.7 head-to-head; this piece is about the categories where the head-to-head conclusively tilts away from the cheaper model.

Case 1 — Large Architectural Reviews

The first failure mode is the one that hits hardest because it is the one teams most often expect Composer to handle.

A real workload: "Read this codebase and tell me what is structurally wrong with the auth module before I refactor it." Composer 2.5 is trained as a tool-using coding agent. Its training distribution is weighted toward file reads, terminal commands, test runs, and edits in tight loops. It is not weighted toward "load 40K lines of code into your head and reason about coupling, cohesion, and design assumptions." Concretely, on the audits I have run: Composer's architectural reviews tend to surface surface-level observations (this file is large, this function is long, this name is unclear) and miss the structural ones (this module assumes single-tenant when the product is multi-tenant, this lock is held across an async boundary, this state machine has an unreachable terminal state).

Opus 4.7's broader training and longer-context recall makes it materially better at this category. The 1M-token context window I covered in the Opus 4.7 context piece lets the model load the whole module and reason about it as one unit. Composer, capped at 200K tokens, has to load fragments and assemble its understanding piece-by-piece; the structural picture goes missing in the assembly.

The right call: route architectural reviews to Opus 4.7. The cost differential is real but small in absolute terms — a deep architectural review is a $20 task on Opus and a $4 task on Composer — and the quality difference compounds, because the architectural decisions that come out of the review govern weeks of subsequent code.

Case 2 — Security-Critical Code Paths

Auth boundaries. Crypto. Input parsing on untrusted data. Permission checks. Anything where a wrong answer ships a vulnerability into production.

The case here is not about Composer being worse at security in some intrinsic sense — both models will write competent security-aware code most of the time. The case is about the asymmetric cost-of-error. A missed CVE is a multi-month incident at best and a company-ending event at worst. The years of token savings from running Composer over Opus on every other workload are dwarfed by the cost of one production vulnerability that the cheaper model missed. The expected-value math doesn't even need precise numbers to come out the same way.

The discipline: route any code touching auth, crypto, untrusted input parsing, permission checks, secrets handling, or anything that ends up in a security review to Opus 4.7. The cost premium is the price of the lower error rate. The framework that decides this kind of thing — when the marginal capability is worth the marginal cost — is the same one I covered in the Sonnet-vs-Opus piece; the answer there and the answer here are aligned.

Case 3 — Ambiguous Specs and Requirements Elicitation

The most subtle of the six cases, and the one that costs teams the most quietly because the failure mode looks like progress.

A real workload: "Add a feature that lets admins suspend users." A genuinely ambiguous request — does suspended mean the user cannot log in, the user is hidden from search, the user's data is preserved or deleted, the suspension is reversible, who can do the suspending, and so on. The right next step is not to write the code. The right next step is to surface the questions and ask the human which interpretation they meant.

Opus 4.7 does this well — the model is trained to flag ambiguity and ask before committing to an interpretation. Composer 2.5 is trained to plough on. Its training distribution rewards completing tool-using loops, not pausing them for clarification. Concretely, in the workloads I have shipped: Composer will pick the most likely interpretation of an ambiguous spec, write the code, and present it as done. The code works, by some definition of works, but it is solving the wrong problem. The retry tax in this case is not "the model wrote bad code" — it is "the model wrote code for the wrong feature" and now a human has to either accept it, rewrite it, or start the conversation that should have happened before any code was written.

The right call: route requirements-elicitation work — particularly anything from a non-engineering stakeholder, anything where the acceptance criteria are not yet pinned down — to Opus. Use Composer for the implementation phase, once the spec is concrete. The two-stage workflow captures the saving where it is real (implementation) and avoids the failure mode where it is most expensive (elicitation).

Case 4 — Deep Legacy-Code Spelunking

Reading 10-year-old code, written in a paradigm that has since fallen out of fashion, and figuring out what it does and why.

Composer 2.5's training distribution skews modern. The synthetic training tasks Cursor describes in their launch post are weighted toward contemporary patterns — modern TypeScript, current Python idioms, today's framework conventions. The model is excellent on this distribution and weaker outside it. Legacy COBOL, old-style Perl, jQuery-era JavaScript, pre-ES6 patterns, FORTRAN bindings — Composer can read them, but its inferences about why the code is shaped the way it is are weaker than Opus 4.7's, which has broader training across older paradigms.

The specific failure mode is not "Composer cannot read the code." It is "Composer reads the code, infers a modern interpretation of what the code is trying to do, and proposes a refactor that breaks an invariant the original code was holding for a reason that no longer fits any modern paradigm." The refactor compiles. The tests pass. The production incident happens two weeks later when a downstream consumer of the old behaviour fails.

The right call: legacy-code work, particularly anything from before roughly 2015, gets routed to Opus. The cost differential is small relative to the cost of breaking a load-bearing legacy invariant by mistake.

Case 5 — Cross-Repo Refactors With Implicit Contracts

The fifth failure mode is structural and not about model capability per se — it is about the harness.

Composer 2.5's agent loop is bounded to what is in front of it: the files in the current workspace, the tools that have been configured, the context that has been loaded. That works well for a refactor that lives inside one repository. It breaks when the refactor reasons about implicit contracts that cross repositories — a change to a shared library that consumers depend on without explicit version pinning, a change to an API shape that downstream services have come to depend on, a change to a database schema that an unrelated service silently reads from.

Opus 4.7, deployed in a harness that gives it access to the broader system (multi-repo search, cross-codebase reasoning), can flag these implicit dependencies. Composer 2.5, in Cursor's harness, sees only the current repo and the explicit dependencies. The implicit-contract reasoning that crosses repo boundaries goes missing.

The right call: any refactor that touches a shared library, a public API surface, a database schema consumed by other services, or anything else with implicit cross-repo dependencies, gets routed to Opus 4.7 — ideally in a harness that has access to the full system context, not just the local repo. The cost differential is again small relative to the cost of breaking a downstream consumer the agent did not know existed.

Case 6 — When a Wrong Answer Is Worse Than No Answer

The catch-all that covers the long tail. Medical software, legal-compliance contexts, financial systems where a wrong calculation ships an audit failure, anything safety-critical, anything where the cost of a confident-but-wrong answer is asymmetrically worse than the cost of "I do not know, escalate to a human."

The case here is the calibration story. Opus 4.7 — and Anthropic's training broadly — has been heavily tuned for calibrated uncertainty. The model is trained to flag what it does not know and to refuse confidently-wrong answers in domains it is not equipped for. Composer 2.5 is trained to complete coding tasks; its calibration on non-coding decisions (regulatory compliance, medical implications, legal interpretation) is much weaker. The model will produce a confident-sounding answer in those domains because that is the shape of output its training rewarded.

The discipline: if a wrong answer in your workload is materially worse than no answer, the model choice is not Composer. It is Opus, or it is a human-in-the-loop pattern that puts a checkpoint between the model and production. The cost differential is irrelevant because the failure mode is irrelevant — you are not optimising for token spend, you are optimising for catastrophic-failure avoidance.

The Routing Decision Tree

A clean version of the rule that comes out of the six cases:

Is the task a security-critical code path? Use Opus 4.7.
Is the task a large architectural review or cross-repo refactor with implicit contracts? Use Opus 4.7.
Is the task ambiguous-spec elicitation, or work involving non-engineering stakeholders with un-pinned acceptance criteria? Use Opus 4.7 for the elicitation phase, then Composer for the implementation.
Is the task deep legacy-code work in a paradigm from before ~2015? Use Opus 4.7.
Is the task one where a wrong answer is asymmetrically worse than no answer? Use Opus 4.7, or use a human-in-the-loop checkpoint.
Anything else. Use Composer 2.5. This is the bulk — refactoring, multi-file edits, agent loops on routine work, test scaffolding, documentation, the long tail of in-IDE coding tasks. The cost saving is real here, and the capability gap is small enough not to matter.

The honest split across the workloads I have audited is roughly 80% Composer, 20% Opus. That ratio holds up across the teams I have worked with, regardless of the team's specific domain. The 80% is where the per-token saving is real and compounds; the 20% is where the retry tax or the catastrophic-failure cost dominates and the cheaper model is not the cheaper choice.

What's Not Changed

The unchanging caveats:

Composer 2.5 is still the right default. This piece reads like a case against Composer; it is not. Eighty percent of coding work is where Composer wins on cost and the capability gap is small. The point is to recognise the 20% where the math flips, not to pin everything to Opus.
Both models still hallucinate. Neither model is reliable enough to ship code without review, eval suites, schema validation, and the usual production discipline. The cases above are about where the wrong-rate is too high for the cost saving; they are not about the wrong-rate being zero on the more expensive model.
The boundary will move. Six months from now, some of these cases will fall to Composer 3 or Composer 2.6 or whatever ships next. The routing layer should be refreshed every quarter, not pinned to today's numbers. The frame in the cost-engineering piece applies — orchestration tuned for the old curve becomes a liability when the curve moves.
Multi-provider risk. Don't bet the company on Composer or Opus. Keep your routing layer ready to fall back from either one to GPT-5.5 or Gemini Flash.

The Practitioner's Take

The honest summary is that "use Composer 2.5 for everything" is the same kind of mistake as "use Opus 4.7 for everything" — both pin the routing decision to one model on one set of trade-offs, and both leave value on the table for the workloads where the other model is the rational choice.

The interesting engineering work is not picking a winner between the two. It is the routing layer that decides per task which model handles it, based on the task's category and the task's cost-of-error profile. For most teams, the right architecture is Composer 2.5 as the default with a clear list of exceptions routed to Opus 4.7, where the exception list is the six cases above. That captures the cost saving where it is real, avoids the retry tax where it dominates, and avoids the catastrophic-failure mode where the cheaper model is the wrong choice regardless of cost.

The frame from the broader builder's guide on Composer 2.5 applies: the model shifts the economics underneath the routing layer; it does not eliminate the need for one. The teams that ship the routing layer capture the saving. The teams that pin to one model and hope discover at month-end either that their bill is higher than they expected or that their incident rate is.

The discipline is not picking the cheap model. It is recognising the 20% where the cheap model is not actually the cheap choice, and routing accordingly.