Claude Sonnet 4.6 is the model most production AI features should be built on. It's the workhorse of the Claude 4 family. strong enough to handle complex reasoning, fast enough to drive real-time features, and priced for the volume that production usage actually generates. Opus is what you reach for when Sonnet isn't enough; Sonnet is what you ship.
This article is the production playbook: when to choose it, how to prompt it, and the cost discipline that keeps your bill from outpacing your revenue.
When Sonnet Is the Right Choice
The decision matrix I use in code review:
- High-volume, latency-sensitive features (chat, search, autocomplete, summarisation). Sonnet.
- Standard analysis and extraction (parsing structured data from documents, classifying tickets, drafting responses). Sonnet.
- Multi-step workflows where each step is bounded (a customer support agent that follows a defined playbook). Sonnet.
- Anything that runs millions of times a day: Sonnet, almost certainly.
Reach for Opus only when:
- The reasoning depth genuinely matters (legal analysis, deep code review, research synthesis)
- The task is rare enough that the per-call cost is irrelevant
- You've measured Sonnet's output and it's not good enough on a specific subtask
Most production AI features end up using Sonnet for 90%+ of calls and Opus for the 10% that genuinely need it. A two-model architecture is normal.
The Anthropic SDK in Production
The minimum viable production setup, in TypeScript:
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
maxRetries: 3,
timeout: 30_000,
});
export async function summarise(text: string): Promise<string> {
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 512,
system: "You produce concise, factual summaries. Never speculate beyond the source.",
messages: [{ role: "user", content: `Summarise:\n\n${text}` }],
});
const block = response.content[0];
return block.type === "text" ? block.text : "";
}
Three production defaults to set on day one:
max_tokensalways specified. Letting it default invites surprise costs. Pick a value tied to your use case (256 for short answers, 2048 for analysis, 8192 for long-form).systemprompt is where the constraints live. "Never speculate", "Always reply in JSON", "Output in en-GB". system, not user.- Retries with exponential backoff. The SDK handles this; just configure
maxRetries.
Prompt Caching: The Single Biggest Cost Lever
If you remember one thing from this article: use prompt caching for any prompt with a stable prefix you'll reuse. It cuts the cost of the cached portion by ~90% and reduces latency meaningfully on cache hits.
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
system: [
{
type: "text",
text: longSystemPrompt, // 4,000 tokens of guidance, examples
cache_control: { type: "ephemeral" },
},
],
messages: [{ role: "user", content: userQuery }],
});
The system prompt is cached for ~5 minutes after the first request. Every subsequent call within that window pays ~10% of the original cost for the cached portion, plus the full price for the uncached user message.
For a chatbot getting 200 requests per minute, this is the difference between a $4,000/month bill and a $600/month bill. For a RAG application with a 10K-token system prompt of retrieved context per request, caching the retrieved context (when it doesn't change between turns) is worth thousands.
The breakpoints to know:
- Cache writes cost ~25% more than the equivalent non-cached input.
- Cache reads cost ~10% of input pricing.
- Break-even: caching pays off after ~2 reads per write.
- For high-traffic endpoints, you'll easily get 50-500 reads per write window. the savings compound.
Structured Outputs: Tool Use Pattern
For features that need reliable JSON, the tool use pattern beats "please respond in JSON" prompting by a wide margin:
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
tools: [
{
name: "extract_invoice",
description: "Extract structured invoice data from text.",
input_schema: {
type: "object",
properties: {
vendor: { type: "string" },
date: { type: "string", format: "date" },
line_items: {
type: "array",
items: {
type: "object",
properties: {
description: { type: "string" },
amount: { type: "number" },
},
required: ["description", "amount"],
},
},
total: { type: "number" },
},
required: ["vendor", "date", "line_items", "total"],
},
},
],
tool_choice: { type: "tool", name: "extract_invoice" },
messages: [{ role: "user", content: rawInvoiceText }],
});
const toolUse = response.content.find((c) => c.type === "tool_use");
const data = toolUse?.input as InvoiceData;
Schema validation happens server-side. The model's output conforms to the schema or returns an error. no JSON.parse try/catch ceremony.
This pattern is the right answer for: data extraction, classification with confidence scores, generating structured records, function-call style features.
Streaming for Perceived Latency
For chat UIs and any feature where the user sees the response as it generates, streaming is mandatory. Time-to-first-token on Sonnet is typically a few hundred milliseconds; total response time scales with output length. Streaming makes the perceived experience instant.
const stream = await client.messages.stream({
model: "claude-sonnet-4-6",
max_tokens: 2048,
messages: [{ role: "user", content: userMessage }],
});
for await (const event of stream) {
if (event.type === "content_block_delta" && event.delta.type === "text_delta") {
process.stdout.write(event.delta.text);
}
}
In a Next.js / Edge runtime / SSE setup, pipe these events to the client and render them as they arrive. The first character appears in under 500ms even for long responses.
Cost Discipline
Three patterns that prevent the cost surprise:
- Set
max_tokensaggressively. Many features generate 2× more text than they need to. Force conciseness via prompt and cap withmax_tokens. - Cache aggressively. Any prompt prefix used more than twice in a 5-minute window should be cached.
- Log every call's input and output token counts. Build a dashboard. The first time someone changes a prompt and 2× the cost, you want to see it that day, not in next month's bill.
console.log("[claude]", {
model: response.model,
input_tokens: response.usage.input_tokens,
output_tokens: response.usage.output_tokens,
cache_read_tokens: response.usage.cache_read_input_tokens,
cache_write_tokens: response.usage.cache_creation_input_tokens,
});
Pipe to your logging stack. Build the dashboard in week one.
When to Move From Sonnet to Opus
The rule: ship on Sonnet, measure, upgrade specific calls if quality demands it.
Don't pre-optimise for Opus. Don't assume "more capable model = better feature." Most features that work on Opus also work on Sonnet at 1/5 the cost. The path that produces the best ROI is: launch on Sonnet, instrument the outputs, identify the specific subtasks where Opus measurably beats Sonnet, route only those calls upward. Everything else stays on Sonnet.
That two-model architecture. Sonnet for the volume, Opus for the depth. is the production shape of mature Claude integrations in 2026.