Claude Sonnet vs Opus: A Practitioner's Guide to Choosing the Right Model

The Claude 4 family. Sonnet 4.6, Opus 4.6, and Opus 4.7. gives builders three meaningful tiers to choose from. Pick the wrong one and you're either burning money on a task that didn't need the firepower, or shipping a feature that almost works. The right choice usually isn't "the smartest model"; it's "the right tier for the job, plus a routing layer that escalates when needed."

This is the framework I use.

The Fundamental Trade-Off

Sonnet and Opus aren't different models doing the same thing. they're tuned for different operating points:

Sonnet is optimised for the ratio of capability to cost. It handles 90% of production tasks well, runs fast, and is priced for high volume.
Opus is optimised for capability ceiling. It does deeper reasoning, longer planning, and more nuanced synthesis. at higher cost and modestly higher latency.

The decision isn't "which is better." It's "where does this specific task sit on the difficulty axis, and is the marginal capability worth the marginal cost?"

When Sonnet Is Enough

Sonnet handles, well, with no measurable Opus advantage:

Classification and extraction. Categorising tickets, extracting structured data from documents, tagging content, identifying intent. Sonnet hits 95%+ accuracy on these and is 5× cheaper than Opus.
Standard summarisation. News articles, meeting notes, customer feedback. Opus produces marginally smoother prose; the information content is the same.
Real-time chat assistants. Customer support, tutoring, coding helpers. Latency matters for these and Sonnet is faster.
Translation and rephrasing. Sonnet handles all major languages competently.
Structured generation with clear schemas. Filling in forms, generating SQL from natural language, producing JSON in known shapes. The constrained output reduces the value of more sophisticated reasoning.

If you can write a tight system prompt and an output schema for the task, Sonnet is almost always the right choice.

When Opus Earns the Premium

Opus is genuinely better when the task requires:

Multi-step reasoning over an entire problem. Legal contract analysis, complex code review, debugging a multi-file issue, architectural critique. The ability to hold a complete picture and reason about second-order effects is where Opus pulls ahead.

Synthesis across many sources. Reading a 200K-token research dump and producing a coherent thesis. Sonnet can do it; Opus produces noticeably better synthesis.

Tasks where partial correctness is worse than failure. A legal opinion that's 80% right is dangerous. A code change that compiles but breaks an invariant is dangerous. For high-stakes outputs, the smaller error rate of Opus has real value.

Open-ended planning and agent work. When the model needs to decompose a complex task into sub-tasks, manage state across many tool calls, and recover from intermediate failures, Opus's planning ability is meaningfully stronger.

Deep code work. Refactoring a system module, reasoning about a performance regression, implementing a feature that touches 30 files. Sonnet is excellent at writing isolated functions; Opus is excellent at reasoning about systems.

Opus 4.6 vs Opus 4.7

Within the Opus tier, 4.7 is the current flagship. Two material differences from 4.6:

A 1M-token context window (vs 200K for 4.6). For very large documents, full codebases, or long agent sessions, this is a step-change. The pattern of "stuff your whole codebase in the prompt and ask questions" becomes practical.
Improved long-context recall. Performance on needle-in-a-haystack tasks at the deep end of the context window is meaningfully better.

If your use case routinely exceeds 200K tokens, 4.7 is the right choice. If not, 4.6 is more than capable and often fine.

The Hybrid Architecture

The production shape of mature Claude integrations: route by difficulty.

async function answerQuestion(question: string, context: string) {
  // Step 1 - cheap classifier on Sonnet
  const triage = await classifyDifficulty(question, context);
  // returns: "simple" | "moderate" | "complex"

  if (triage === "complex") {
    return await client.messages.create({
      model: "claude-opus-4-7",
      max_tokens: 4096,
      messages: [{ role: "user", content: buildPrompt(question, context) }],
    });
  }

  return await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    messages: [{ role: "user", content: buildPrompt(question, context) }],
  });
}

For most production systems, the routing distribution lands around:

80-90% of calls go to Sonnet
10-20% escalate to Opus

The triage step itself runs on Sonnet (it's a simple classification task) and adds ~$0.001 per request. negligible compared to the savings on the 80% that don't need Opus.

How to Decide for Your Use Case

The empirical method I trust:

Take 100 representative real inputs from your production traffic (or simulated equivalents).
Run them through Sonnet. Score the outputs.
Run the same 100 through Opus. Score those.
For inputs where Opus is meaningfully better, identify the pattern. Can you build a triage classifier that detects this pattern in advance?

Three outcomes:

Sonnet's outputs are equivalent to Opus's → ship on Sonnet, save the money.
Opus is universally better → either ship on Opus (rare; usually too expensive for production) or accept Sonnet quality and budget for prompt optimisation.
Opus is better on a definable subset → build the hybrid router. This is the most common outcome.

The Anti-Patterns

Three mistakes I see repeatedly:

Defaulting to Opus because "more capable is better." It costs 5× more for tasks where the difference is invisible to the user.
Defaulting to Sonnet for genuinely complex tasks. Then patching the prompt for weeks to coax good output instead of moving up a tier.
Routing on user tier instead of task difficulty. "Enterprise customers get Opus, everyone else Sonnet" is a UX choice masquerading as an engineering one. Route on what the task actually needs.

The One-Line Summary

Sonnet 4.6 for the volume. Opus 4.7 for the depth. A triage classifier between them. That's the architecture that ships.