REF / WRITING · SOFTWARE

Cursor Composer 2.5 Builder's Guide: 10× Cheaper Than Opus 4.7 (2026)

Composer 2.5 in production - benchmarks, training improvements, and the routing rules that decide when it beats Claude Opus 4.7 or GPT-5.5 for agentic coding.

DomainSoftware
Formatessay
Published19 May 2026
Tagscursor · composer · cursor-composer

Composer 2.5 shipped yesterday. The headline number, accurate as far as I can tell, is that it sits within striking distance of Claude Opus 4.7 and GPT-5.5 on coding benchmarks while costing roughly a tenth as much per task. If that holds up in your workloads, the per-task economics of agentic coding just changed shape, and a lot of the routing logic I have been running for the last six months needs a refresh.

This is what I have learned in a day of using it, and what I think it means for production builders.

The Benchmark Picture

The numbers Cursor published, with the prior-version baseline for context:

BenchmarkComposer 2Composer 2.5Frontier reference
SWE-Bench Multilingual73.7%79.8%Opus 4.7 / GPT-5.5 range
CursorBench v3.161.3%63.2%Parity with Opus 4.7, GPT-5.5
Terminal-Bench 2.061.7%mid-60sGPT-5.5 leads by ~13 points

Two things matter in that table. The SWE-Bench Multilingual jump (six points in a minor release) is large for this stage of the curve. and the Terminal-Bench gap is the one number Cursor did not lead with. If your agent's daily work is heavy on shell commands, log parsing, or terminal-mediated debugging, GPT-5.5 is still meaningfully ahead.

The standard caveat applies more loudly than usual here: CursorBench is Cursor's own benchmark, and the model was trained in the same harness it's being scored in. That doesn't mean the numbers are wrong. it means you should validate on your own evals before you re-route production traffic. The model genuinely is closer to frontier than it was six months ago; the exact size of "closer" depends on your task distribution.

The Cost Delta That Actually Matters

The pricing is where the strategic question changes:

ModelStandard inputStandard outputNotes
Composer 2.5$0.50/M$2.50/MDefault tier
Composer 2.5 (fast)$3.00/M$15.00/MFaster variant, identical intelligence
Claude Opus 4.7~$15/M~$75/MFrontier reference
GPT-5.5comparablecomparableFrontier reference

Cursor's published per-task cost on CursorBench is under a dollar; the frontier comparators land somewhere around eleven dollars on the same suite. That ratio is what gets quoted in headlines, and the headline isn't wrong. For a team running a hundred agentic tasks a day in CI or as part of a developer-tooling product, the monthly delta runs into five figures.

Two things I'd flag before celebrating the savings:

  • Output-token cost is the one that bites. Coding agents are output-heavy. The $2.50/M output figure is the number to watch, not the $0.50/M input.
  • The faster tier is six times more expensive. It is the default for a reason. responsiveness matters in interactive coding. but on long, async, headless runs the standard tier is the better economic choice. If you batch your agentic workloads, you should be on standard.

What's Improved Over Composer 2

Three training changes drive the lift. I'll explain each plainly because the technical-report version is dense.

25× more synthetic training tasks

Composer 2 was already trained on synthetic coding scenarios; 2.5 expands that pool by 25x and adds new task families. The one Cursor calls out is "feature deletion" puzzles. the model is shown a codebase with a feature removed and asked to rebuild it. This forces it to learn from the inverse of the usual task, and the team credits it with much of the long-horizon stability gain.

Textual feedback reinforcement learning

Composer 2 used end-of-run rewards. solve the task, get a positive signal; fail, get nothing. The credit-assignment problem on long agent runs is well-known: when something goes wrong on step 14 of a 30-step task, the model has no idea which earlier decision caused it. Composer 2.5 injects localized text feedback at the point of failure. "this tool call returned a 404; the path was wrong because X." That precision turns a sparse reward signal into a dense one, and it shows up in the model's behaviour on tasks that require sustained focus across many actions.

MoE-scale training infrastructure

Cursor moved the training pipeline to sharded Muon optimizers and dual-mesh HSDP. This is infrastructure plumbing, not a capability story, but it's why they could afford the 25x synthetic-task expansion at a price point they were willing to absorb. The base model remains Kimi K2.5, with these changes applied on top.

Net result: better at sustained work on long-running tasks, more reliable instruction adherence, and noticeably less drift in agent loops past iteration 15 or so.

When Composer 2.5 Is the Right Call

Three categories where I would route to it without thinking twice:

  • Sustained agent loops. Long-horizon refactors, multi-file feature work, codebase-wide search-and-modify. The textual-feedback-RL improvement is most visible here.
  • High-volume agentic workloads. CI jobs that run agents in the loop, internal developer tools, automated code review at scale. The cost delta compounds.
  • Workflows tightly coupled to Cursor's harness. Composer 2.5 was trained inside Cursor's tool harness, and the integration is tighter than swapping a frontier model in via API. Tool-call accuracy in particular benefits from the model knowing the harness it lives in — and the new Cursor 3 agents-window interface leans hard into this with parallel agents and /best-of-n.

When to Stay on Frontier Models

Three categories where I am not switching:

  • Terminal-heavy work. The 13-point Terminal-Bench gap is real. If your agent spends most of its time in a shell, GPT-5.5 stays primary for now.
  • High-stakes reasoning over long context. Opus 4.7 with its 1M context window still wins on tasks that need to hold a large codebase or a long set of constraints in memory simultaneously. I've written about the production patterns for 1M-context Opus 4.7 — those use cases still belong on Opus. Composer 2.5 is good; Opus 4.7 is still better at the very top of that distribution.
  • Tasks already validated and stable on a frontier model. Don't migrate a working production agent just because a cheaper model exists. Run shadow traffic, compare, and switch only when the eval gap is small enough to justify the change-management cost.

The Bigger Picture

Buried in the Composer 2.5 release is a more consequential disclosure: Cursor is training a much larger model from scratch in partnership with SpaceXAI, on Colossus 2 hardware, using roughly 10x the compute of Composer 2.5. This is a separate effort, not the next 2.x. It targets a capability jump rather than a cost-efficiency one, and it tells you where Cursor thinks the category is going.

Composer 2.5 isn't the ceiling. It is the new floor. Cursor is signalling that they intend to compete on the frontier itself within a year or two, not just on cost-efficiency below it.

What's Not Changed

The unchanging caveats:

  • Hallucinations still happen. Validate critical outputs. The cost savings disappear the first time you ship a hallucinated SQL migration to production.
  • Benchmarks are not your workload. Run your own evals before re-routing real traffic. The published numbers are directional, not predictive.
  • Multi-provider risk is unchanged. A cheaper model from a single vendor does not justify removing your fallback path. Keep the routing layer in place.

Migration Planning

A pragmatic rollout for moving agentic workloads to Composer 2.5:

  1. Pick the highest-volume, lowest-stakes agent loop you have. Internal tooling, batch jobs, CI-mediated agents. Switch it first.
  2. Run shadow traffic for a week. Log both responses. Compare quality, tool-call accuracy, iteration count. Don't trust your gut; trust the diff.
  3. Move to standard tier on async loads, fast tier on interactive ones. The interactive-vs-async decision is the lever with the biggest cost impact.
  4. Hold off on customer-facing high-stakes loops until eval delta is small. Migrate them last, once your confidence is calibrated.
  5. Refresh your routing layer. Whatever logic decides "use cheap model here, frontier model there" probably needs an update; Composer 2.5 has changed where the boundary sits. If you don't have a routing layer at all yet, my piece on Sonnet vs Opus lays out the decision framework — extend it with Composer 2.5 as the new cheap-and-good default.

The Practitioner's Take

The honest read on Composer 2.5 is that it doesn't change the rules of agentic coding so much as it shifts the economics underneath them. The frontier still leads on the hardest tasks. but the gap between "cheap enough for everything" and "expensive enough that you have to think about routing" just got narrower.

For a lot of teams, the rational move now is to default to Composer 2.5 and escalate to a frontier model only on tasks where evals show it matters. That is roughly the opposite of where the rational default was six months ago. and that is the change worth thinking about, not the benchmark numbers themselves.