Building Production Agents with Claude Opus and Tool Use

The gap between "an LLM with tool use" and "a production agent that does real work" is wider than the demos suggest. The model can call your tools. but making it do so reliably, recovering when tools fail, knowing when to stop, and shipping outputs your users can trust is a body of engineering that doesn't show up in the API documentation.

This article is the production playbook for Claude-powered agents.

The Agent Loop, Properly

The minimum viable agent loop, in TypeScript:

import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();

type Tool = {
  name: string;
  description: string;
  input_schema: object;
  execute: (input: any) => Promise<unknown>;
};

async function runAgent(
  tools: Tool[],
  systemPrompt: string,
  userMessage: string,
  maxIterations = 10,
) {
  const toolDefs = tools.map(({ execute, ...rest }) => rest);
  const toolMap = new Map(tools.map((t) => [t.name, t]));

  let messages: Anthropic.MessageParam[] = [
    { role: "user", content: userMessage },
  ];

  for (let i = 0; i < maxIterations; i++) {
    const response = await client.messages.create({
      model: "claude-opus-4-6",
      max_tokens: 4096,
      system: systemPrompt,
      tools: toolDefs,
      messages,
    });

    messages.push({ role: "assistant", content: response.content });

    if (response.stop_reason === "end_turn") {
      const text = response.content.find((c) => c.type === "text");
      return text?.type === "text" ? text.text : "";
    }

    if (response.stop_reason === "tool_use") {
      const toolUses = response.content.filter((c) => c.type === "tool_use");
      const results = await Promise.all(
        toolUses.map(async (tu) => {
          try {
            const tool = toolMap.get(tu.name);
            if (!tool) throw new Error(`Unknown tool: ${tu.name}`);
            const output = await tool.execute(tu.input);
            return {
              type: "tool_result" as const,
              tool_use_id: tu.id,
              content: JSON.stringify(output),
            };
          } catch (err) {
            return {
              type: "tool_result" as const,
              tool_use_id: tu.id,
              content: `Error: ${(err as Error).message}`,
              is_error: true,
            };
          }
        }),
      );
      messages.push({ role: "user", content: results });
      continue;
    }

    throw new Error(`Unexpected stop reason: ${response.stop_reason}`);
  }

  throw new Error(`Agent did not complete within ${maxIterations} iterations`);
}

This is the foundation. Everything else is hardening it.

Tool Design: The Single Highest-Leverage Decision

The model is only as good as the tools you give it. Three principles:

1. One tool, one job. A tool that "lists or searches or filters or counts depending on parameters" confuses the model. Split it into list_orders, search_orders, count_orders. Each tool's name and description should leave no ambiguity about when to use it.

2. Descriptions written for the model, not your team. The description is part of the prompt. Spend real effort here.

{
  name: "search_orders",
  description: `Search past orders by customer email, date range, or order ID.
  Use this when the user asks about a specific past order or wants to find orders matching criteria.
  Returns up to 50 matching orders with id, date, total, and status.
  If you need full order details, follow up with get_order_details using the id.`,
  input_schema: { /* ... */ },
}

3. Force-fail unsafe operations explicitly. If a tool can do something destructive, name it that way (delete_account not update_account_status) and include a guard in the description: "This action is irreversible. Confirm with the user before calling."

Error Handling: Pass the Error Back to the Model

The single biggest mistake teams make: catching tool errors and silently retrying or returning fake success. The right pattern is to surface the error back to the model and let it decide what to do.

{
  type: "tool_result",
  tool_use_id: tu.id,
  content: `Error: Database query failed - connection timeout after 5s.
  The orders database may be temporarily unavailable.`,
  is_error: true,
}

Opus handles errors well when given full information. It will:

Retry with adjusted parameters when the error suggests its input was wrong
Try a different tool when one fails
Stop and ask the user when it can't recover

Hide the error and you get hallucinated success. Surface it and you get genuine recovery.

Stopping Conditions

The agent should know when to stop. Three signals:

stop_reason === "end_turn": the model decided it has the answer. Honour it.
Hard iteration cap. A bug or genuinely impossible task should not cost $50 in API calls. Cap iterations (typically 5-15 for production tasks).
Cost cap. Track tokens consumed across the loop. If it exceeds a budget, terminate and return partial output.

let totalInputTokens = 0;
let totalOutputTokens = 0;
const MAX_INPUT = 100_000;

for (let i = 0; i < maxIterations; i++) {
  if (totalInputTokens > MAX_INPUT) {
    return { partial: true, message: "Budget exceeded" };
  }
  const response = await client.messages.create({ /* ... */ });
  totalInputTokens += response.usage.input_tokens;
  totalOutputTokens += response.usage.output_tokens;
  // ...
}

Caching the Tool Definitions

Tool definitions are stable across requests within an agent type. Cache them:

const tools = [
  ...allButLast,
  { ...lastTool, cache_control: { type: "ephemeral" } },
];

The cached prefix includes all tool definitions up to and including the marked tool. For an agent with 12 tools and ~5K tokens of definitions, this saves the full re-processing cost on every iteration of the loop after the first.

Streaming for Long-Running Agents

Long agent runs feel broken if the user sees no output for 30 seconds. Stream the assistant's thinking:

const stream = await client.messages.stream({
  model: "claude-opus-4-6",
  max_tokens: 4096,
  system: systemPrompt,
  tools: toolDefs,
  messages,
});

for await (const event of stream) {
  if (event.type === "content_block_delta" && event.delta.type === "text_delta") {
    yield { type: "assistant_text", text: event.delta.text };
  }
  if (event.type === "content_block_start" && event.content_block.type === "tool_use") {
    yield { type: "tool_call_start", name: event.content_block.name };
  }
}

Push these events to the UI as they arrive. Show "Calling search_orders..." when a tool call starts. The agent feels alive instead of frozen.

Multi-Agent vs Single-Agent

The instinct when an agent gets complex: split it into a planner agent + executor agent + critic agent. This is sometimes the right answer and often the wrong one.

The right answer: split when the agents have genuinely different skills, contexts, or permissions. A research agent that can browse the web vs an analysis agent that can't is a real split. A "planner" and "executor" with the same tools and context is just adding latency.

The wrong answer: split for the sake of architectural cleanliness. A single Opus call with a well-designed prompt and tool set out-performs three coordinated calls in most cases, at lower cost and lower latency.

Observability: The Production Difference

A production agent that you can't observe is a production agent that fails silently. Log:

logger.info("agent.iteration", {
  agent_id: runId,
  iteration: i,
  stop_reason: response.stop_reason,
  input_tokens: response.usage.input_tokens,
  output_tokens: response.usage.output_tokens,
  tool_calls: toolUses.map((t) => t.name),
});

Build a dashboard with: average iterations per task, tool call distribution, error rate per tool, p99 total cost per run. The first time the average iterations jumps from 3 to 8, you want to see it that day.

The Production Checklist

Before any agent ships:

Iteration cap and cost cap enforced
Tool errors surfaced (not swallowed)
Tool definitions cached
Streaming enabled if user-facing
Observability: per-iteration logging, tool call metrics, cost tracking
Tool descriptions reviewed for clarity (read them as if you were the model)
Destructive tools require explicit confirmation language in the prompt
Manual review of 50 real runs to catch behaviours that aren't covered by automated tests

Hit those, and you have an agent that's safe to put in front of real users. Skip them, and you have a demo.