Textual-Feedback RL: How Composer 2.5 Solved Credit Assignment in Coding Agents

The most consequential technical detail in Cursor's Composer 2.5 release is one sentence in the announcement: the model uses textual-feedback reinforcement learning — localised hints at the point of failure injected into the training trajectory, instead of only end-of-run rewards. That sentence is doing a lot of work. The headline benchmark improvements (the +6 points on SWE-Bench Multilingual over Composer 2, the +14 points on Finance Agent v2 between Gemini 3.1 Pro and 3.5 Flash, the across-the-board jumps in long-horizon coding evals over the last six months) all trace back to a paradigm shift in how coding agents are trained, and textual-feedback RL is the cleanest implementation of that shift to ship in a public model so far.

This is a deep technical piece. The premise is that anyone building, training, or evaluating coding agents in production needs to understand what credit assignment is, why sparse end-of-run rewards fail catastrophically on long horizons, and what dense localised feedback actually changes. If you've been wondering why Composer 2.5 holds up over thirty-step traces in a way the prior Flash-tier models never did, this article is the why.

For the release context, the practitioner overview of Composer 2.5 is the reference. This piece is the technical follow-up.

The Credit-Assignment Problem, Explained From Scratch

Reinforcement learning is, at heart, the problem of training a policy to produce good actions by giving it a reward signal at some point in the action sequence. The simple version: model takes an action, environment returns a reward, gradient update nudges the policy toward actions that produced positive rewards and away from ones that produced negative rewards. Iterate millions of times. The policy gets better.

The complication: in any non-trivial environment, the reward arrives well after the action that caused it. A coding agent at step 14 of a 30-step task makes a tool call. The eventual outcome (task solved or not) is determined twelve steps later by a chain of decisions that depend on what step 14 produced. When the reward arrives at step 30, the model has to figure out which earlier action actually mattered — which decisions contributed to the final outcome and which ones were essentially neutral.

That figuring-out problem is credit assignment. It is the foundational hard problem in long-horizon RL, and the difficulty of it scales with horizon length.

Two specific failure modes worth understanding:

Credit dilution. When the reward is a single positive or negative scalar at the end of a 30-step trajectory, the model's gradient update is the same for every step in the trajectory. Step 14's good decision and step 22's bad decision both get the same signal — proportional to the final reward, scaled by the policy gradient. The model is asked to reverse-engineer which steps mattered from a signal that, by construction, doesn't distinguish between them.

Reward sparsity. When most trajectories fail (which is true of any non-trivial agentic task at the start of training), the model receives almost no positive signal at all. Training collapses to the policy slowly learning what not to do, with very little signal for what to do instead. Sample efficiency goes through the floor; training time goes up by orders of magnitude.

Both of these failure modes get worse linearly with horizon length and quadratically with branching factor. A 30-step task with five viable next-actions at each step has 5^30 reachable trajectories. A sparse, end-of-run scalar reward provides one bit of information about that entire space. The model is asked to learn the right policy from one bit at a time.

Why Sparse End-of-Run Rewards Collapse on Long Horizons

The classic RL responses to credit assignment are well-known and mostly insufficient for long-horizon agentic coding.

Temporal-difference (TD) learning distributes the final reward backward through the trajectory, weighted by some discount factor. This works well when the environment is approximately Markovian and the value function is learnable. Coding agent trajectories aren't Markovian — the relevant state at step 14 depends on the entire conversation history, the tool-call outputs from steps 1–13, and the implicit context of the codebase. The value function is correspondingly hard to estimate.

Reward shaping injects intermediate rewards based on hand-crafted heuristics — "give partial credit when the agent calls the right tool category," "subtract a small reward for syntactically invalid output." This works in narrow domains but doesn't scale. The set of relevant intermediate signals for "did this coding agent make a good decision" is too large and too task-specific to hand-craft, and getting the shaping wrong introduces bias that the policy learns to exploit.

Reward modelling (RLHF and its descendants) trains a separate reward model to predict the eventual outcome from intermediate states, then uses that prediction as a dense reward signal during RL. This is genuinely useful and underpins much of the recent progress on chat models. It scales poorly to long-horizon agentic work because the reward model is asked to predict trajectory-level outcomes from intermediate states that don't strongly signal them. "Was this tool call good" is genuinely difficult to answer without seeing how the trajectory unfolds, and reward-model accuracy degrades as horizon length grows.

The net effect: for long-horizon agentic coding, the standard RL toolkit produces models that plateau at modest performance well below the capability ceiling of the underlying base model. Composer 2 and the earlier Flash-tier coding agents all hit this plateau. The architecture wasn't the bottleneck; the credit-assignment problem was.

What "Textual Feedback" Actually Injects

Composer 2.5's training paradigm is the cleanest production answer to this problem I've seen described. Instead of relying on end-of-run scalar rewards, the training loop injects localised text feedback at the exact step where something went wrong. The trajectory still ends with a final reward, but each failure step now carries an explicit, localised, dense signal about what went wrong and why.

Concretely, when a tool call returns an error or produces unexpected output, the training infrastructure synthesises a textual annotation at that step — something like "this tool call returned a 404 because the path you specified does not exist; the correct path was X based on the file structure visible at step Y." That annotation gets folded back into the trajectory for the gradient update. The model sees not just the wrong action but the wrong action plus a description of why it was wrong, anchored to the exact step where it happened.

Localised hints at the failure point

The mechanical structure: the training infrastructure has access to ground truth (the correct trajectory or at least a known-good outcome). When the agent's trajectory diverges from a known-good path at some step, the infrastructure generates a textual hint describing what went wrong at that step, in terms the model can interpret. The hint is anchored to the step, not to the trajectory as a whole.

This is what "localised" means in practice. The model isn't told "the trajectory failed because of something somewhere in steps 1–14." It's told "step 14's action was wrong for these specific reasons." That precision is the difference between a signal the model can learn from and a signal that gets diluted across thirty steps.

The dense-signal transformation

Mathematically, this turns a sparse signal into a dense one. Where a typical 30-step trajectory previously had ~1 bit of information at the end (success or failure), it now has localised signal at every step where something went wrong — which on early training trajectories is most steps. The information density of the training signal goes up by roughly two orders of magnitude per trajectory.

That higher information density compounds in two ways. First, sample efficiency improves dramatically — fewer trajectories are needed to learn a given policy improvement. Second, the policy learns what to do much earlier in training, because positive signals (correct actions) are also locally identifiable, not just average-correct-on-aggregate.

The trajectory rewrite

A subtle but important detail: the localised feedback is folded back into the trajectory before the gradient update. The model isn't just receiving a richer signal; it's receiving training data in which the wrong action at step 14 is paired with the textual explanation of why it was wrong. The next time the model encounters a similar situation, it has not just the policy update but the embedded reasoning trace about what to avoid.

This is closer to "training the model on annotated mistakes" than to "training the model on rewards." It also explains why the long-horizon improvement is so much larger than the short-horizon improvement — the localised feedback compounds with the number of decision points, so its leverage is greatest exactly where credit assignment hurts most.

How This Compares to RLHF and Standard PPO

A grounded comparison for ML practitioners:

vs RLHF. RLHF trains a reward model from human preferences over completed outputs, then uses that reward model during PPO-style RL. The signal density is somewhat better than end-of-run scalar rewards because the reward model can be queried at intermediate states, but the reward model itself has to be trained on completed outputs and tends to be inaccurate at intermediate states. Textual-feedback RL sidesteps this entirely — it doesn't train a reward model at all. It generates explicit textual annotations from ground truth and uses those directly.

vs PPO with reward shaping. Hand-crafted intermediate rewards (e.g. "give partial credit for syntactically valid output") are the closest classical analogue, but they suffer from the bias problem — the model learns to game whatever proxy you give it. Textual feedback is generated from the actual deviation between agent trajectory and known-good trajectory, so the signal is grounded in real outcome differences rather than in heuristic proxies.

vs process-reward modelling (PRM). PRM trains a separate model to score intermediate steps in reasoning traces (originally for math reasoning). It's conceptually similar to textual-feedback RL but requires a separate trained reward model. Textual-feedback RL uses ground-truth-grounded textual annotations directly, which removes the reward-model-accuracy bottleneck.

The net read: textual-feedback RL sits in the same lineage as RLHF and PRM but solves a specific weakness of both — it provides dense, locally anchored, ground-truth-grounded feedback without requiring a separately trained reward model. For tasks where ground truth or near-ground-truth is available (which is true of most coding tasks: tests passing, type checks succeeding, expected outputs matching), this is a strictly better training signal.

Why It Matters Disproportionately for Coding Agents

Three structural properties of coding workloads make textual-feedback RL especially well-suited to them.

Long-horizon by nature

Coding tasks routinely involve fifteen to fifty tool calls — read files, edit files, run tests, interpret output, repeat. Credit assignment over horizons this long is exactly where sparse end-of-run rewards collapse hardest. The training paradigm shines most where the problem is worst.

Deterministic failure signals

Coding failures are unusually amenable to textual annotation because they come with deterministic, machine-readable failure modes. A failed test produces a stack trace. A 404 response includes the path. A type error includes the type mismatch. The "what went wrong and why" annotation can be generated automatically from the deterministic failure signal, with high fidelity. This isn't true of most other agentic domains (negotiation, planning, creative work) where failure is fuzzier.

Tool-call traces as ground truth

Every tool call the agent makes leaves a structured trace — inputs, outputs, errors, side effects. Those traces are themselves a near-ground-truth representation of the agent's decisions. A training infrastructure that has access to the tool-call traces has the data it needs to generate localised textual feedback automatically, without humans-in-the-loop. This makes the paradigm operationally scalable in a way most RL improvements aren't.

The conjunction of these three properties — long horizons, deterministic failures, structured traces — is roughly unique to coding-agent training. It explains why Cursor (a company whose entire product is structured tool-call traces from coding agents) was positioned to ship this paradigm first.

What Builders Training Their Own Agents Can Take From This

Even if you aren't training a frontier model, the framework applies to anyone building agents that depend on RL fine-tuning, supervised fine-tuning over agent traces, or eval-driven iterative improvement.

If you're collecting agent traces in production, you're sitting on the raw material. The structured tool-call traces your existing agents produce contain everything needed to generate localised textual feedback for fine-tuning. The bottleneck is annotation infrastructure, not data.
The "annotated mistakes" framing extends to in-context learning, not just RL. Even without fine-tuning, you can include localised textual feedback from prior failed trajectories in the prompt for new runs. This is closer to retrieval-augmented few-shot than to RL, but the underlying logic — locally anchored feedback beats global feedback — translates directly.
Eval design matters more than ever. Textual-feedback RL works because the training infrastructure can compute ground truth at each step. If your evals only measure end-to-end outcomes, you're losing the same information density gains the training paradigm relies on. Build evals that produce step-by-step ground truth where possible.

The tool-use architectures I've written about for SMB workflows are an applied example of this thinking: deterministic tools at each step, structured outputs, fail-loud behaviour. Models trained with textual-feedback RL specifically reward this kind of architecture, because it's the architecture that produces the structured traces the training paradigm needs.

What's Not Solved

Three honest open problems:

Ground truth isn't always available

Textual-feedback RL relies on the training infrastructure being able to compute "the right answer" at intermediate steps. For coding tasks with tests, type checks, and expected outputs, this is usually feasible. For coding tasks without those signals — open-ended exploration, design decisions, novel algorithm development — ground truth at intermediate steps is harder. The paradigm degrades to standard RLHF in those cases.

Annotation quality bounds gain

The textual feedback is only as good as the synthesis process that generates it. If the annotations are vague, contradictory, or themselves wrong, the model learns from noise. Practical implementations need a quality bar on the annotation pipeline — which becomes its own engineering investment.

Multi-agent and tool-mediated interactions

When the agent's decisions interact with other agents, with environments that have hidden state, or with tools whose behaviour is non-deterministic, the "ground truth at this step" question gets harder. The cleanest application of textual-feedback RL is to single-agent, deterministic-tool, observable-environment coding tasks. Beyond that, the paradigm needs adaptation.

These are reasons the production failure modes for AI agents I've written about still apply — the training paradigm reduces failure surface, it doesn't eliminate it. And it's where the next paradigm shift will probably come from, when someone finds a way to generate dense localised feedback in settings where ground truth is harder to access. Probably from an organisation already running production agents at scale on the Claude Agent SDK or an equivalent platform — the people with the most structured trace data win the next round.

The Expert's Take

Textual-feedback RL is the most important training-side advance in agentic coding since the 2024 wave of long-context models, and it has barely been discussed outside Cursor's own technical report. The reason it matters is structural: it converts a problem (credit assignment over long horizons) that has bottlenecked agentic training for a decade into something tractable for any organisation that has structured trace data and reasonable annotation infrastructure.

The follow-on implications take a while to land. Within the year, every serious coding-agent training effort will use some variant of locally-anchored, ground-truth-grounded feedback rather than end-of-run scalar rewards. Within two years, the same approach extends into non-coding domains wherever structured traces and approximate ground truth are available — software engineering tasks broadly, devops, data engineering, possibly customer-support agents where conversation outcomes are measurable.

Composer 2.5 is the first publicly-deployed model trained this way at scale. It will not be the last. The signal to watch over the next few quarters is which other labs adopt the paradigm and what specific variations they introduce. The shape of the field for the next few years gets defined by what happens there.