Composer 2.5 Hands-On: What Eleven Real Tasks Actually Revealed

The benchmark numbers on Composer 2.5 are now well-covered ground. I've written about them at length in the original builder's guide, and I've written about the head-to-head comparisons in the vs Opus 4.7 piece. What I hadn't done — and what most of the coverage I've read also hasn't done — is sit down with the model and actually drive it across a varied set of real coding tasks to see where the benchmark numbers translate and where they don't.

This piece is the result. Eleven distinct tasks across a multi-hour session, all on the fast tier, ranging from single-file HTML game generation to multi-step OS-level orchestration. The point is not to validate the headline benchmarks. It's to surface the things you only find out by holding the steering wheel — what shifts the production loop, what disappoints, and what genuinely surprised me.

The Setup

Cursor 3's Agents Window, fast tier throughout (the 6× premium over the standard tier, knowingly chosen for testing throughput rather than cost efficiency). The model is Cursor's own Composer 2.5 — built on Kimi K2.5 with substantial post-training, reinforcement learning, and harness-specific tuning on top.

Tasks spanned single-file HTML games (FPS, GTA-style embedded scenes, a beat-em-up), 3D scene generation (Seinfeld apartment 5A, a drum kit simulator), frontend design (two product sites), a C++ skateboard demo, and a stretch task that required setting up the Godot game engine on a fresh Ubuntu system from a downloaded zip before building a soft-body-physics demolition derby inside it. Some tasks ran in plan mode; some were intentional zero-shot prompts to test raw single-pass behaviour.

What follows is not the full play-by-play. It's the findings worth taking back to production.

Where Composer 2.5 Surprised Me

Image-based debugging is the standout finding

The single most consequential pattern I uncovered across the session: send the model a screenshot of a broken UI with no accompanying text. It identifies what's wrong and fixes it on the next iteration.

This worked repeatedly across very different tasks. On the browser-OS GTA embed, the mini-map was being rendered over the entire scene instead of a corner overlay — I sent a screenshot, no description, and Composer 2.5 diagnosed the rendering hierarchy issue and corrected it. On the C++ skateboard scene, the in-game text was sized large enough to dominate the frame — same pattern, screenshot only, fix on the next iteration. On the demolition derby, when the cars were deforming into unrecognisable shapes mid-fall, an image was enough to anchor the conversation to the actual visible problem.

The production implication is bigger than the trick suggests. The iteration loop on UI work has historically been "describe the problem in prose; argue with the model about what you meant." That loop is roughly a third as fast as "screenshot the problem; let the model see." Composer 2.5 isn't unique in supporting image inputs; what's unique is how reliably it converts an image into a correct diagnosis without a paragraph of accompanying explanation. For anyone shipping UI work iteratively, this single change in workflow saves more time than the benchmark improvements do.

Sub-agent and OS-level orchestration actually works

The Godot test was deliberately designed to push past the comfort zone. The prompt: "Build a soft-body-physics demolition derby in Godot. Godot isn't installed on this system — the zip file is in the working directory; you'll need to extract it, set it up, and use it for the build."

That's two distinct multi-step problems: an OS-level install-and-configure chain, plus a non-trivial creative coding task. The model spun a sub-agent for the install work, handled the extraction and the system configuration, returned to the main thread, and proceeded with the build. The Godot editor opened. The project scaffolded correctly. The first build had bugs (more on those below), but the orchestration didn't flake — and orchestration is exactly where this category of multi-step task collapses on lesser models.

This matters because it confirms that Composer 2.5's harness integration is doing real work beyond model capability. The same model running outside Cursor's sandboxing wouldn't have had reliable access to the OS-level operations. Inside the harness, those operations were treated as normal tool calls, with all the credential masking and Git-policy protection that the harness provides by default.

The agreeability pushback

A small test that yielded a disproportionately informative result. After working through a frontend prompt, I floated a logically incorrect premise — "reverse quantization, where you just make the model larger" — to see whether Composer 2.5 would validate it as a "great idea" and start designing around the nonsense, or push back.

It pushed back. Called out the logical flaw. Explained why making a model larger doesn't recover information that was lost in quantization. Suggested what I was probably actually thinking of. This is the behaviour I want from a production coding agent and the behaviour many earlier models did not consistently produce.

The reason this matters: sycophantic models that validate bad ideas waste more engineering time than any other single failure mode I see in production AI workflows. A model that builds elaborate scaffolding around a flawed premise is worse than no model at all. Composer 2.5's tendency to push back — at least in this one test — is a strong signal for production use where the user's input is going to be imperfect more often than not.

Game Generation: Capable but Inconsistent

Where it worked

Single-file HTML game generation produced genuinely strong results in several cases. The GTA-style browser-OS embed had collision detection, a working mini-map, and car physics calibrated to a believable speed — not the cartoonishly-fast vehicles that consistently come out of weaker models. The first-person shooter had appropriate weapon behaviour, shell-casing ejection, and a workable station environment, though it tripped on the standard sliders-locked-during-pointer-lock bug that almost every model in this category fails.

The standout single result was the PC-repair beat-em-up. First-pass result was solid; one iteration with the simple feedback "you can do better" produced an output meaningfully improved — punching and kicking mechanics, NPC enemies with consistent animation, environment that could be interacted with, working scoring, and even a Windows-8-style blue screen that the player can break and "fix." This is the kind of result where Composer 2.5 stacks favourably against frontier models on the same task: comparable creative output at the cost ratio described in the cost-engineering piece.

Where it didn't

The Godot demolition derby was an intentional stretch test based on a reference image of a soft-body-physics scene that no current model can replicate from a single screenshot. Through five or six iterations Composer 2.5 improved the output meaningfully — cars went from box-shapes-falling-from-the-sky to actual driveable vehicles with rough soft-body deformation — but never fully landed. The car physics never quite worked. Vehicles ended up stuck together in the final iteration.

The C++ skateboard scene had similar issues. The scene rendered, but the in-game text was funky, fonts were inconsistent, and the overall polish never reached "shippable." The model knew what it was trying to draw; the execution was uneven. Both of these are tasks I'd file under "future model bucket regardless of provider" — Composer 2.5 wasn't worse than the alternatives, it just isn't the right tool for stretch creative-coding work yet.

Frontend Design Was the Consistent Weak Point

The two frontend tests in the session both produced cookie-cutter outputs on first pass. A satirical "reverse quantization" sales page came out as a generic gradient-and-card layout. The ravioli-sauce product page had a decent background effect but lacked the polish of a model genuinely tuned for visual design — no signature product image where one would obviously have helped, an oddly placed spinning element, generally middle-of-the-pack output.

Iteration improved both modestly. Neither got to the level of output I've seen from purpose-built design tools or from frontier models specifically prompted for visual design. The pattern is straightforward and consistent with Composer 2.5's intended positioning: it's a model tuned for agentic coding tasks, not for visual judgement. The benchmark numbers reflect the same shape — it leads on SWE-Bench Multilingual and CursorBench, neither of which measures aesthetic quality.

For production purposes, the read is: don't route visual design work to Composer 2.5 as your default. Route it to a model whose training distribution prioritised that work, then come back to Composer 2.5 for the implementation.

A Convergence Finding Worth Flagging

The most curious moment in the session: on the Seinfeld-apartment-5A scene generation, Composer 2.5 produced a "paradox" callout with a red wireframe around the architecturally-impossible hallway that the show's original set features. This is the same anomalous output that Gemini 3.5 Flash produced on the same prompt about twelve hours earlier, almost identical down to the visual treatment.

This is probably nothing — likely an artifact of similar training data on the same niche prompt that's been making the rounds in model evaluation circles. But it's the kind of cross-model behavioural convergence that's becoming more common as labs train on overlapping data and as similar prompt patterns propagate through evaluation corpora. Worth flagging because it complicates the "this output is uniquely the model's voice" framing that benchmark posts sometimes lean on.

Cost Observations From a Multi-Hour Session

Across the full session on the fast tier — multiple hours of testing, eleven distinct tasks, several with multiple iteration cycles — I consumed approximately 6% of my usage limit. The fast tier runs roughly 6× more expensive than the standard tier; extrapolating, the same session on the standard tier would have consumed about 1% of the limit.

This is consistent with the published per-task cost numbers from the original builder's guide, but it's a useful real-world validation point: an entire afternoon of intensive testing, including expensive iteration loops on stretch tasks, costs roughly what a single complex run on a frontier model would cost. For teams running daily testing or experimentation budgets, the math is forgiving in a way it has not been for the last two years.

This is the kind of usage pattern that pairs naturally with the four-track parallel workflow I've written about — when iteration cost collapses, the right move is to iterate more, not less.

The Pattern Across All Eleven Tests

Where Composer 2.5 reliably holds up:

Structured, well-defined coding tasks with clear acceptance criteria
Iterative improvement loops where the model gets to see its own output via screenshots
Multi-step orchestration (sub-agents, OS-level operations, tool chains)
Mid-complexity creative coding (single-file games, scene generation, 3D demos)
Pushback on bad ideas rather than sycophantic validation

Where Composer 2.5 underperforms:

Visual design and aesthetic judgement
Single-shot creative output where iteration isn't available
Stretch tasks requiring novel synthesis (the demolition derby, novel algorithmic work)
Anything where the value depends on top-of-distribution reasoning rather than bulk-of-distribution competence

This matches the broader head-to-head pattern vs Claude Opus 4.7. Composer 2.5 wins decisively on the bulk of the production coding workload distribution. Frontier models still win on the tail. The interesting engineering work is in the routing layer that picks per task.

What's Not Changed

The unchanging caveats after eleven tasks:

The model still hallucinates on stretch tasks. The demolition derby and the skateboard scene both showed the model going down implementation paths that needed real correction. Validation discipline doesn't change.
Iteration count matters more than first-shot quality. The PC-repair beat-em-up was good on first pass and excellent on iteration two. Plan for at least one feedback loop; don't expect first-shot perfection.
Frontend design is a real weak spot. Don't migrate visual-design workloads on the strength of the coding-benchmark numbers.
Harness lock-in remains the strategic question. Composer 2.5 only runs inside Cursor. For multi-IDE or provider-neutral architectures, this is a forcing constraint.
Validate on your own workloads. My eleven-task session is one engineer's distribution. Yours will be different. Run your own evals before re-routing real traffic.

The Practitioner's Take

The honest summary after eleven tasks: Composer 2.5 lives up to its reputation for the kinds of work I'd actually route to it in production, and predictably underperforms on the kinds of work I wouldn't. The benchmark numbers are directionally right; the production-relevant patterns are slightly different from what the benchmarks suggest.

The single most consequential finding from the session — by a margin — is the image-based debugging behaviour. It shifts the iteration loop on UI work more than any of the benchmark improvements do. If you're shipping anything that has a visual surface, that pattern alone is worth restructuring your feedback loop around.

The right move for anyone evaluating Composer 2.5 isn't to read more reviews, including this one. It's to spend an afternoon running it on real workloads from your specific production distribution. The patterns become obvious quickly, and the patterns are what matter. Benchmark numbers are signposts; production behaviour is the territory.