Claude Sonnet 4.6 for Production AI Features: A Builder's Guide

Claude Sonnet 4.6 is the model most production AI features should be built on. It's the workhorse of the Claude 4 family. strong enough to handle complex reasoning, fast enough to drive real-time features, and priced for the volume that production usage actually generates. Opus is what you reach for when Sonnet isn't enough; Sonnet is what you ship.

This article is the production playbook: when to choose it, how to prompt it, and the cost discipline that keeps your bill from outpacing your revenue.

When Sonnet Is the Right Choice

The decision matrix I use in code review:

High-volume, latency-sensitive features (chat, search, autocomplete, summarisation). Sonnet.
Standard analysis and extraction (parsing structured data from documents, classifying tickets, drafting responses). Sonnet.
Multi-step workflows where each step is bounded (a customer support agent that follows a defined playbook). Sonnet.
Anything that runs millions of times a day: Sonnet, almost certainly.

Reach for Opus only when:

The reasoning depth genuinely matters (legal analysis, deep code review, research synthesis)
The task is rare enough that the per-call cost is irrelevant
You've measured Sonnet's output and it's not good enough on a specific subtask

Most production AI features end up using Sonnet for 90%+ of calls and Opus for the 10% that genuinely need it. A two-model architecture is normal.

The Anthropic SDK in Production

The minimum viable production setup, in TypeScript:

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
  maxRetries: 3,
  timeout: 30_000,
});

export async function summarise(text: string): Promise<string> {
  const response = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 512,
    system: "You produce concise, factual summaries. Never speculate beyond the source.",
    messages: [{ role: "user", content: `Summarise:\n\n${text}` }],
  });

  const block = response.content[0];
  return block.type === "text" ? block.text : "";
}

Three production defaults to set on day one:

max_tokens always specified. Letting it default invites surprise costs. Pick a value tied to your use case (256 for short answers, 2048 for analysis, 8192 for long-form).
system prompt is where the constraints live. "Never speculate", "Always reply in JSON", "Output in en-GB". system, not user.
Retries with exponential backoff. The SDK handles this; just configure maxRetries.

Prompt Caching: The Single Biggest Cost Lever

If you remember one thing from this article: use prompt caching for any prompt with a stable prefix you'll reuse. It cuts the cost of the cached portion by ~90% and reduces latency meaningfully on cache hits.

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: longSystemPrompt,        // 4,000 tokens of guidance, examples
      cache_control: { type: "ephemeral" },
    },
  ],
  messages: [{ role: "user", content: userQuery }],
});

The system prompt is cached for ~5 minutes after the first request. Every subsequent call within that window pays ~10% of the original cost for the cached portion, plus the full price for the uncached user message.

For a chatbot getting 200 requests per minute, this is the difference between a $4,000/month bill and a $600/month bill. For a RAG application with a 10K-token system prompt of retrieved context per request, caching the retrieved context (when it doesn't change between turns) is worth thousands.

The breakpoints to know:

Cache writes cost ~25% more than the equivalent non-cached input.
Cache reads cost ~10% of input pricing.
Break-even: caching pays off after ~2 reads per write.
For high-traffic endpoints, you'll easily get 50-500 reads per write window. the savings compound.

Structured Outputs: Tool Use Pattern

For features that need reliable JSON, the tool use pattern beats "please respond in JSON" prompting by a wide margin:

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  tools: [
    {
      name: "extract_invoice",
      description: "Extract structured invoice data from text.",
      input_schema: {
        type: "object",
        properties: {
          vendor: { type: "string" },
          date: { type: "string", format: "date" },
          line_items: {
            type: "array",
            items: {
              type: "object",
              properties: {
                description: { type: "string" },
                amount: { type: "number" },
              },
              required: ["description", "amount"],
            },
          },
          total: { type: "number" },
        },
        required: ["vendor", "date", "line_items", "total"],
      },
    },
  ],
  tool_choice: { type: "tool", name: "extract_invoice" },
  messages: [{ role: "user", content: rawInvoiceText }],
});

const toolUse = response.content.find((c) => c.type === "tool_use");
const data = toolUse?.input as InvoiceData;

Schema validation happens server-side. The model's output conforms to the schema or returns an error. no JSON.parse try/catch ceremony.

This pattern is the right answer for: data extraction, classification with confidence scores, generating structured records, function-call style features.

Streaming for Perceived Latency

For chat UIs and any feature where the user sees the response as it generates, streaming is mandatory. Time-to-first-token on Sonnet is typically a few hundred milliseconds; total response time scales with output length. Streaming makes the perceived experience instant.

const stream = await client.messages.stream({
  model: "claude-sonnet-4-6",
  max_tokens: 2048,
  messages: [{ role: "user", content: userMessage }],
});

for await (const event of stream) {
  if (event.type === "content_block_delta" && event.delta.type === "text_delta") {
    process.stdout.write(event.delta.text);
  }
}

In a Next.js / Edge runtime / SSE setup, pipe these events to the client and render them as they arrive. The first character appears in under 500ms even for long responses.

Cost Discipline

Three patterns that prevent the cost surprise:

Set max_tokens aggressively. Many features generate 2× more text than they need to. Force conciseness via prompt and cap with max_tokens.
Cache aggressively. Any prompt prefix used more than twice in a 5-minute window should be cached.
Log every call's input and output token counts. Build a dashboard. The first time someone changes a prompt and 2× the cost, you want to see it that day, not in next month's bill.

console.log("[claude]", {
  model: response.model,
  input_tokens: response.usage.input_tokens,
  output_tokens: response.usage.output_tokens,
  cache_read_tokens: response.usage.cache_read_input_tokens,
  cache_write_tokens: response.usage.cache_creation_input_tokens,
});

Pipe to your logging stack. Build the dashboard in week one.

When to Move From Sonnet to Opus

The rule: ship on Sonnet, measure, upgrade specific calls if quality demands it.

Don't pre-optimise for Opus. Don't assume "more capable model = better feature." Most features that work on Opus also work on Sonnet at 1/5 the cost. The path that produces the best ROI is: launch on Sonnet, instrument the outputs, identify the specific subtasks where Opus measurably beats Sonnet, route only those calls upward. Everything else stays on Sonnet.

That two-model architecture. Sonnet for the volume, Opus for the depth. is the production shape of mature Claude integrations in 2026.