REF / WRITING · SOFTWARE

Cost-Engineering Composer 2.5 at Production Scale: The Real 10x Playbook

Most teams capture 2-3x of Composer 2.5's potential 10x cost savings - the production playbook for token economics, caching, batching, and tier choice.

DomainSoftware
Formatessay
Published22 May 2026
Tagscomposer-2-5 · cursor · cost-optimization

The headline number on Composer 2.5 is "ten times cheaper than the frontier models." That's accurate as a list-price comparison. It is also, in my experience working with teams migrating coding workloads at production scale, what most of them actually capture: about a quarter to a third of the available savings. The other two thirds get left on the floor in orchestration choices, prompt design habits inherited from the frontier era, and tier decisions made on autopilot.

This is the playbook for capturing the rest. It is intentionally a senior-engineering piece — not a survey of Composer 2.5 as a product, but a deep treatment of the production engineering required to actually realise the cost advantage. For the underlying release detail, Cursor Composer 2.5: Frontier-Class Coding is the reference. For the broader tier-routing framework that sits above this work, my tier-economics piece is the predecessor. This article is what happens inside the box once you've decided to use Composer 2.5.

The 10× Headline vs the 2–3× Reality

A pattern from the migration audits I've run over the last week: a team switches their production coding workload from a frontier model to Composer 2.5, expects the bill to drop by an order of magnitude, and watches it drop by a factor of two or three. The disappointment is real, the explanation is consistent, and the fix is the rest of this article.

The 10× headline assumes:

  • Average input/output token mix shifts toward Composer 2.5's relatively cheap input pricing.
  • Prompt caching captures the prefix-stable portion of each request at the $0.15/M cached input rate.
  • The cost-aware tier choice (standard vs fast) is made deliberately based on workload latency requirements.
  • The orchestration layer is tuned for the new cost curve — batching where viable, async where appropriate, escalation reserved for the cases that actually need it.

The 2–3× reality reflects:

  • Workloads inherited from frontier-era orchestration, where the prompt-design assumption was "tokens are expensive but we can pay" and so every prompt is verbose and every output is asked to be comprehensive.
  • Prompt caching either not implemented at all, or implemented but reading low cache-hit rates because invalidation is too aggressive.
  • Default routing to the fast tier (6× more expensive than standard) because the previous frontier model was always interactive-latency.
  • Batching ignored because the frontier model's per-call cost meant batching mostly didn't matter.

Every one of these is fixable. The remaining sections are how.

Output-Token Economics: Why 5× Input Cost Reshapes Prompt Design

The single most useful number to internalise about Composer 2.5's pricing is the ratio between input and output costs. Standard tier: $0.50 per million input tokens, $2.50 per million output tokens. Output is five times more expensive than input.

This is similar to but not identical to the frontier models' ratios. What changes in practice is the absolute numbers. On Claude Opus 4.7 at $5/$25, output is also 5× more expensive than input — but the absolute price of both is high enough that prompt-design choices tend to be optimised for quality first and cost second. On Composer 2.5 at $0.50/$2.50, the absolute costs are low enough that teams stop thinking about cost altogether — until the volume scales up.

Why coding agents are output-heavy

A typical coding-agent trajectory looks like this: model reads the codebase (input), produces a plan (output), reads relevant files (input), produces an implementation (output), runs tests (input), produces a fix (output), repeats. The input phase is mostly the context window prefix and tool-call results; the output phase is plans, code, explanations, and tool-call arguments.

Across a thirty-step trajectory, output tokens typically run 1.5–3× more than input tokens in raw count. Combined with the 5× per-token cost ratio, output ends up accounting for 85% of the bill on a typical coding agent run. This number is bigger than people expect, and it shifts where the optimisation lever actually is.

The "verbose plan" anti-pattern

The single biggest output-token sink I see across migrations is plans that are too verbose. A coding agent asked to "produce an implementation plan" on a frontier model trained on the verbose plans of the last two years typically produces 800–1,500 tokens of plan output, much of it rehashing context the model already has, padding the answer with thoroughness, and including caveats that aren't actionable.

On a frontier model at $25/M output, a verbose plan costs about three cents. Nobody cares. On Composer 2.5 at $2.50/M, the same verbose plan costs a third of a cent. Still small per request. At 10,000 requests a day, that's $30 daily, $900 monthly, all spent on plan verbosity that doesn't change outcomes.

The fix is prompt-level. Explicit output budgets in the system prompt — "produce a plan in no more than 200 tokens, focused on the specific steps; do not restate context" — typically cut plan length by 60–70% without measurable quality loss on standard refactor tasks. The same logic applies to explanations, post-action summaries, and verbal scratchpad output.

Output-length budgeting as a first-class concern

The general principle: in the Composer 2.5 era, output length is a configurable concern, not a free variable. Every prompt that asks the model to produce output should specify how much output is appropriate. Every orchestration layer should track output tokens per task and surface anomalies. Every cost-aware team should set an output-token budget per task type and treat exceeding it as a tunable signal — like p99 latency or error rate — rather than as invisible bill growth.

This is the same discipline that the broader cost-vs-capability framework for Flash-tier models applies, made concrete for one specific model.

Prompt Caching at $0.15/M: Where the Next 30% Comes From

Composer 2.5's cached input rate is $0.15 per million tokens — 70% cheaper than the standard $0.50/M input rate. For prefix-stable workloads, this is a meaningful additional saving on top of the base pricing. Most teams I've audited are either not using caching at all or are using it badly enough to capture only a fraction of the available benefit.

Prefix-stable workloads

The caching primitive only helps when the request prefix is stable across calls. Some examples that are prefix-stable:

  • A coding agent that runs the same long system prompt + the same codebase context + a variable per-task instruction. The prefix (system + context) caches; the per-task instruction does not.
  • Batch document processing where each document is preceded by the same instruction template.
  • Agent loops where each iteration includes the same long retrieval context but a varying step prompt.

Some examples that are not prefix-stable without restructuring:

  • Conversational chat where the conversation history accumulates per turn. Each new turn invalidates the previous cache.
  • Random-access tool calling where the tool list and ordering varies per request.
  • Per-user personalisation where the system prompt includes user-specific information.

The right pattern is to separate prefix-stable from prefix-unstable parts of the prompt at the architecture level. The stable parts go into a cached prefix; the unstable parts go after the cache boundary. This is the same architectural decision pattern from Claude's prompt caching playbook, applied to Composer 2.5's specific pricing.

Cache invalidation patterns

The single most common mistake I see in caching implementations: invalidating the cache too aggressively. Teams worry about staleness ("what if the cached context is out of date") and rebuild the cache on every request. The result is paying the full input price every time while believing they're using the cache.

The right model: treat cache validity as a deliberate, monitored choice, not an implicit invariant. For most coding agent workloads, the cached prefix (system prompt, codebase context, instruction template) is stable for hours to days. Aggressive invalidation makes the cache useless. Lazy invalidation — invalidate on the schedule that actually matches when the underlying context changes — captures the full saving.

The cache-hit measurement gap

A surprisingly common failure mode: teams have caching implemented, believe it's working, and discover during audit that the cache-hit rate is in the 20–40% range when it should be 80–95%. The reason: the orchestration layer is generating slightly different prompts (timestamp embedded, random ID prepended, request ID in the prefix) that cause cache misses without anyone noticing.

The discipline: measure cache-hit rate explicitly. The Composer 2.5 API surface returns cache-hit metadata; log it, alert on dips, and treat sub-80% cache hit rate as a real incident on workloads where the prefix should be stable. Teams that don't measure this typically discover the gap when they audit the bill three months in.

Batching and Async Queue Architecture

The frontier-model era didn't reward batching aggressively. Per-call costs were high enough that the overhead of building async batching infrastructure rarely paid back. Composer 2.5 inverts the calculation.

The right pattern at scale: separate workloads by latency tolerance and route accordingly.

  • Interactive (sub-second response required): in-editor completion, chat-mode coding, anything with a human waiting. Use the fast tier ($3/$15). Roughly 6× more expensive than standard but justified by the wall-clock improvement.
  • Near-real-time (sub-minute response acceptable): PR analysis, code review on commit, lint-fix suggestions. Use the standard tier ($0.50/$2.50). Async queue; results flow into the UI when ready.
  • Batch (sub-hour response acceptable): nightly test generation, documentation refresh, dependency audit, bulk refactoring. Use the standard tier with concurrency tuned to the per-second cost ceiling. Often suitable for cloud-agent deployment where per-second billing matters.
  • Background (sub-day response acceptable): large-scale codebase analysis, cross-repository reporting, periodic synthetic-data generation. Standard tier, low concurrency, no urgency.

The 6× premium on the fast tier — which the original article notes — only makes sense for the first category. Routing all four categories to the fast tier (the default in many migrations) is one of the largest single cost mistakes I see, and one of the easiest to fix. Workload routing by latency tier typically captures 30–50% of the available savings on its own.

Standard vs Fast Tier: When 6× More Is Worth It

A sensitivity analysis worth doing explicitly: when does the fast tier's 6× premium actually pay back?

It pays back when:

  • A human is waiting for the response, and the wall-clock improvement materially affects their throughput.
  • The task is short (sub-1K-output-token), where the fast tier's lower latency dominates total cost.
  • The agent is in an interactive loop where slower response causes the user to disengage.

It doesn't pay back when:

  • The task is async (no human waiting for a specific response).
  • The task is long (multi-thousand-output-token), where the per-token cost difference adds up faster than the wall-clock improvement matters.
  • The orchestration layer can run other useful work while waiting.

A concrete example. A typical refactor agent runs for 2–5 minutes on standard tier, 30–90 seconds on fast tier. If the human will look at the output anyway in five minutes (because they're doing other work), the standard tier captures 6× the cost saving for what is functionally zero quality difference. If the human is staring at the screen waiting, the fast tier earns its premium.

The orchestration layer should treat tier selection as a first-class parameter, not a default. For most production workloads, the right split is 60–80% standard tier with the remainder routed to fast — not the other way around.

Monitoring Per-Task Cost: The SRE-Grade Observability You Need

The cost-engineering work above is invisible without observability. The instrumentation I run at the audits:

  • Per-task token attribution. Each task in production emits input tokens, output tokens, cached vs uncached split, tier used, and total cost. These get logged to the analytics pipeline alongside latency, success, and outcome.
  • Per-task-type dashboards. Costs are grouped by task type (refactor, code review, test generation, etc.) and plotted over time. A regression — an upgrade that suddenly increases output verbosity, a prompt change that drops cache-hit rate — is visible within hours, not at month-end audit.
  • Cost-per-outcome metrics. The most useful single metric: cost-per-successful-task, segmented by task type. This catches the case where a "cheaper" change increases retry rate and ends up more expensive overall.
  • Anomaly alerting on cost. SLO-style alerts on cost per task. If a deploy moves the cost-per-task metric by more than 20% in either direction, the on-call engineer gets paged. (Direction matters: a sudden drop is also worth investigating — it might be a regression that's producing shorter, lower-quality output.)

This is SRE-grade observability applied to inference economics, not the "we'll check the bill at month-end" pattern that worked when bills were predictable. At Composer 2.5's scale, the bills are not predictable — small orchestration changes can move them by 30–50% in a day — and observability that matches the production engineering norm of the rest of your stack is the only way to stay on top of them.

The Migration Trap: Orchestration Tuned For The Old Curve

The single most consistent failure mode in Composer 2.5 migrations: the model gets swapped, the orchestration layer doesn't. Teams move from a frontier model to Composer 2.5 and run it on infrastructure tuned for the previous cost curve. The bill drops, but only by the per-token ratio — capturing about 2–3× of the available savings instead of the headline 10×.

The orchestration choices that need to be revisited after migration:

  • Prompt verbosity: trim outputs aggressively, set explicit output budgets per task type.
  • Cache discipline: add explicit caching primitives, measure hit rate, fix invalidation.
  • Tier routing: segment workloads by latency tolerance, default async work to standard tier.
  • Batching architecture: introduce async queues where they weren't justified before.
  • Escalation policy: the previous architecture probably had no escalation because everything ran on the same frontier model. Composer 2.5 + frontier escalation for the hard steps is now the right architecture.
  • Observability: instrument per-task cost the way you instrument per-task latency.

Without these changes, the migration captures the easy savings (the per-token cost delta) and leaves the structural savings (the orchestration improvements) on the table. With them, the 10× headline becomes the real number.

What's Not Changed

The unchanging caveats:

  • Capability still matters. Cost optimisation that drops your task success rate by 30% to save 30% is a bad trade. Track success rate alongside cost-per-task; don't optimise one without the other.
  • Output-budget caps can backfire. Setting too-aggressive output budgets on tasks that genuinely need more output produces truncated, lower-quality results. Tune per task type with evals, not by guess.
  • Multi-provider risk still applies. A cost-engineered Composer 2.5 deployment is still a single-vendor bet. Keep your routing layer ready to fall back. The self-hosted-vs-API methodology I've written about for the broader cost-analysis framing still applies.
  • Vendor pricing isn't static. Composer 2.5's pricing was set at launch; it will move. Build your cost model as a function of current prices and re-validate quarterly.

The Expert's Take

The Composer 2.5 era is the first one where cost-engineering matters at the production-engineering level rather than the procurement level. Previously, "cost optimisation" was mostly a negotiation problem — get the vendor down on per-million prices, commit to volume, eat the difference. The per-token margins were big enough that prompt and orchestration choices barely moved the bill.

Today the per-token costs are low enough that small orchestration changes — a 30% drop in average output length, a cache-hit rate going from 30% to 90%, a workload moving from fast tier to standard — can move the bill by an order of magnitude in either direction. Cost engineering is now a real production-engineering competency, and the teams that build it earlier capture compounding advantages over the teams that don't.

The right investment over the next quarter is not "switch to Composer 2.5 and check the bill." It is "build the observability, the cache discipline, the tier-routing logic, and the orchestration changes that turn the 10× headline into the real number." The teams that do this capture the savings. The teams that don't will discover at month-end that their bills dropped by half when they expected them to drop by 90%, and they will not understand why.

The work is unglamorous. So is the cost line of any senior engineering organisation. Both correlate.