Claude Code in Production: 9 Months In, What Actually Works (2026)

Claude Code shipped quietly. There was no big launch event, no benchmark theatre — just a CLI that let you point Claude Opus at a codebase and ask it to do things. At the time, the right framing for it was "another agent harness, this one from Anthropic, probably fine." Nine months later, the right framing is that it has reshaped how I write code more than any single tool since the LSP. That is not the conclusion I expected to reach at month one, and the gap between then and now is the interesting part.

The right way to read the nine-month arc is not "Claude Code got better" — the tool itself has improved, but most of the gain has come from teams figuring out how to use it. The patterns that worked in October look naive in May. The patterns that work in May would have looked like over-engineering in October. This is the practitioner's retrospective on what actually held up, what broke under scale, and what the four-layer architecture that finally makes the tool earn its keep looks like.

I have been running Claude Code daily since the original release, across product code, infra work, eval pipelines, and migrations. The patterns below come from what I have shipped — and from the audits I have done of teams that have tried and bounced off it. Both data sources point at the same handful of things.

What Held Up From Day One

Three patterns from the original launch have not needed to change.

CLAUDE.md as terse onboarding. The right framing for the project-level context file has always been "what would I tell a new senior engineer joining this codebase tomorrow." Stack, entry points, naming conventions, build/test/lint commands, the two or three load-bearing gotchas. Under 200 lines. The teams that get this right treat CLAUDE.md the way they treat a README — pragmatic, terse, no over-explaining. The teams that get it wrong write a 600-line context dump that Claude obediently reads and then ignores, because past roughly 150 instructions the model's attention degrades sharply.

The plan-then-execute loop. Asking Claude to draft a plan first, get human approval, then execute — that pattern shipped on day one and is still the highest-ROI workflow Claude Code offers. The economics make it obvious. The plan costs roughly 2K tokens to draft and 30 seconds of human time to review. The execution costs whatever the task costs in tokens — but only if the plan was right. If the plan was wrong and you let the model execute anyway, you pay the full task cost twice. The plan is the cheapest insurance policy in the stack.

Treating Claude as a senior engineer in a chat. The mental model that produces the best output is "I am pair-programming with a senior engineer who is fast, knowledgeable, and occasionally overconfident." Not "I am operating an AI tool." The conversational register, the level of context you provide, the way you push back on suggestions — they all calibrate to that mental model. Teams that come to Claude Code from a junior-engineer mental model under-invest in context; teams that come from a tool-operator mental model over-specify and constrain the work. The senior-engineer-in-chat framing is the one that produces the cleanest output.

What Broke Within Three Months

Three patterns from the early days did not survive contact with real production work.

The one-giant-CLAUDE.md-with-everything pattern. Early teams tried to encode every project convention, every API boundary, every team norm into a single CLAUDE.md file. By month three, those files had grown to 800–1,200 lines, and Claude's adherence had collapsed. The fix — which arrived as Skills around the same time — is to keep CLAUDE.md to the always-load context (stack, conventions, commands) and move the load-on-demand context (specific subsystems, infrequent workflows, vendor integrations) into Skills.

Trying to do everything in one session. Early Claude Code sessions tried to carry forty-step workflows in a single conversation: read the ticket, write the code, run the tests, fix the failures, write the PR description, address review comments. By the end of those sessions the context window was a mess, the model was making decisions based on stale state from twenty steps earlier, and the wall-clock was atrocious. The fix is the multi-session pattern — one session per coherent unit of work, with the memory tool carrying continuity across them when needed.

Manual permission management. The early permission model required you to approve each shell command, each file write, each tool call. That looked safe at first; in practice it produced approval fatigue, and people started rubber-stamping permission requests. The fix is the hook layer — deterministic policy expressed in code, not in a human's tired finger. Hooks landed in mid-2025 and changed the security model entirely.

The pattern across all three is the same: the early naive use was tolerable at small scale, broke at production scale, and got rescued by tooling that landed three to six months in. The version of Claude Code that exists now is one that has been hardened against the failure modes of the version that existed at launch.

The Four-Layer Stack That Actually Holds Up

The architecture that has converged across the production teams I have worked with is a four-layer stack. The layers are not equally important — they form a hierarchy from "always loaded" to "loaded on demand" to "deterministic enforcement" to "delegated specialists."

Layer 1 — CLAUDE.md. Always loaded into every session. Terse, action-oriented, under 200 lines. The job of CLAUDE.md is to onboard the model to the codebase: what stack, where entry points live, what conventions to follow, what commands to run. It is not a place for nuanced policy or subsystem details. Treat it as documentation written for an engineer's first day, not their first year.

Layer 2 — Skills. Markdown guides in .claude/skills/, loaded on demand. The job of skills is to carry the load-on-demand context that does not earn a slot in CLAUDE.md. Common shapes: "how to add a new API endpoint in this codebase," "how the eval harness works," "how to invalidate the production cache safely." Each skill is a focused document that loads when relevant. The discipline is to keep them single-purpose; a skill that tries to cover three concerns is a skill that gets called for the wrong reasons. The same framing I have written about in the five levels of using Claude applies here — the third and fourth levels of capability live in the skills layer.

Layer 3 — Hooks. Deterministic callbacks in .claude/settings.json. Where CLAUDE.md is advisory and skills are documentation, hooks are enforcement. They run regardless of what the model decides; they are how you encode policy that cannot be left to model discretion. The three hook patterns that have earned their slot in production are PreToolUse (route a class of operations to a different model or a different approval flow), PostToolUse (auto-format, auto-test, or notify after each edit), and Stop (verify the work meets a checklist before the model says it is done).

Layer 4 — Sub-agents. Specialist agents in .claude/agents/, each with its own context, prompt, and tool permissions. The main Claude Code session coordinates; the sub-agents do bounded work — code review, test running, security check, frontend QA. Sub-agents are the highest-power layer and the most over-used. Most teams reach for sub-agents before they have exhausted the other three layers; the result is a tangled multi-agent system where a single careful CLAUDE.md + one well-designed skill would have done the same work.

The right discipline is to build the stack from the bottom up — CLAUDE.md first, skills when the context starts overflowing, hooks when policy needs enforcement, sub-agents only when the work is genuinely parallel and bounded. The teams that invert this order, starting with sub-agents and working downward, tend to ship complexity faster than capability.

The Memory Tool Changed the Pattern

Anthropic shipped the memory tool in March 2026, and its impact has been larger than the launch framing suggested.

Before memory, Claude Code sessions were stateless across invocations. Every session re-onboarded from CLAUDE.md and whatever skills the model chose to load. State that mattered across sessions — "we tried that approach last week and it broke," "the deploy to staging needs three approvals, not two," "this customer's data is sensitive and stays in EU regions only" — had to be re-discovered each time, or encoded into CLAUDE.md and bloating it.

With memory, the model writes durable notes to a memory directory and reads them on subsequent sessions. The shape that works in production: keep CLAUDE.md as the static onboarding doc, let the memory tool capture the evolving project state. Decisions get recorded as they happen; the next session starts already knowing what the previous one figured out. The session-to-session continuity that used to require an aggressive context-management discipline now happens on rails.

Two things to internalise about the memory tool. First, it is not a free-write substitute for proper engineering documentation. Notes that should live in the codebase, in ADRs, or in the team wiki should not be hidden in a memory file the model writes for itself. Second, the memory tool is privileged storage. A misconfigured memory layer that captures customer data or production secrets is a leak waiting to happen. The same caution that applies to logging applies here, more loudly than usual.

Background Agents Are Not a Toy

The claude --bg flag and the agent view shipped over the course of 2026 and turned Claude Code from an interactive tool into something more like a background job runner. The shift is bigger than the documentation makes it sound.

The use cases that earn the pattern: long-running migrations (refactor 200 call sites of a deprecated API over four hours), scheduled refactors (run the eval suite nightly and propose fixes for regressions), batched analysis work (generate a summary of every PR merged this week, ranked by risk). All workloads where the wall-clock is measured in hours, the human does not need to babysit, and the output is reviewed asynchronously.

The discipline that makes background agents work in production is writing for an agent that runs without supervision. The prompt has to be specific enough that the agent cannot drift; the success criteria have to be machine-checkable so the agent knows when it is done; the failure mode has to be "halt and surface the error" rather than "keep trying with progressively worse approaches." The same discipline I have written about for building production agents with the Claude Agent SDK applies here — the unattended-agent design constraints are the same constraints, just dressed differently for the CLI surface.

The interesting pattern emerging is scheduled coding routines — Claude Code sessions that run on cron, write to a memory file, and progressively improve a codebase over weeks. Dependency updates, dead-code removal, test-coverage improvements, documentation generation. The work is the kind that humans defer indefinitely; the agent does it on a schedule, and the cumulative effect over a quarter is material.

The Hook Patterns That Actually Earn Their Slot

Hooks are the most under-appreciated layer of the stack. Most teams ship Claude Code without hooks at all and discover six months later that they have re-built the same policy logic three times in three different sessions. The hooks that earn their slot in every production setup I have audited:

PreToolUse for risky-operation routing. When Claude wants to run rm -rf, git push --force, terraform apply, or anything else with blast radius, the PreToolUse hook intercepts the call and either denies it outright or routes it to a stricter approval flow (a more careful model, a human review, a logged-and-audited path). The model is not asked to police itself; the hook does.
PostToolUse for invariant enforcement. After every file write, run the formatter. After every test write, run the test. After every deploy, smoke-check the endpoint. The hook makes the invariant cheap to enforce; the model cannot forget.
Stop hooks for completion verification. Before the model declares the task done, the Stop hook checks: did the tests pass, does the build succeed, are there no uncommitted changes. The model's self-assessment is unreliable; the hook is not. The Stop hook is the difference between "the model said it shipped" and "the work is genuinely shipped."

The discipline is to treat the hook layer as security infrastructure, not as developer convenience. The same review rigor you apply to a CI workflow or a Terraform module applies to .claude/settings.json. A misconfigured hook is a production incident.

Sub-agents: When the Pattern Earns Its Complexity

Sub-agents are powerful and over-prescribed. The shape that works in production is restrained — one or two specialist agents that handle bounded work the main session would otherwise have to load context for. The shapes that fail are the ones that look impressive in a demo: five-agent teams with elaborate coordination protocols, recursive agent spawning, agents calling agents in complex DAGs.

The sub-agent patterns that have held up across the audits I have run:

Code review. A sub-agent that reads a diff and surfaces issues — duplication, style violations, missing test coverage, security concerns — without polluting the main session's context with the codebase load.
Test running and failure analysis. A sub-agent that runs the suite, captures failures, and produces a focused summary the main session can act on.
Documentation generation. A sub-agent that reads code and produces or updates the docs without keeping the entire codebase in the main session's context.

Across roughly forty production Claude Code stacks I have audited, two sub-agents is the median, three is common, more than five is almost always a mistake. Past that point the coordination overhead exceeds the context-isolation benefit. The same routing-layer discipline I have written about for the Composer 2.5 vs Claude Opus 4.7 head-to-head applies inside Claude Code — the interesting engineering work is the routing layer, not the count of specialists.

What's Still Hard at Nine Months

The honest counter-case. Claude Code is the best tool in the category, and there are still real open problems.

Cross-repo reasoning. The implicit-contract problem I wrote about in when not to use Composer 2.5 applies inside Claude Code too. The agent reasons cleanly within the current repo; reasoning that crosses repository boundaries is patchy. The workaround — load the consumer repos into context, or use the 1M context Opus to load everything at once — works but is expensive.

Latency on long sessions. Sessions that have been open for several hours, with substantial conversation history, become noticeably slower per response. Some of this is unavoidable context-management overhead; some of it is recoverable through better session hygiene. Either way, the long-session UX is the rough edge that still hurts.

Cost predictability. A session that takes a wrong turn early can rack up a token bill that the user does not see coming. The cost-observability work in the Composer 2.5 cost-engineering playbook applies here directly, and most teams have not done it yet.

The seam between memory and security. Memory is powerful and dangerous in the same breath. The discipline of what gets persisted and what does not has not yet been worked out at the industry level; expect production incidents over the next six months as teams figure out the seam by trial and error.

The Practitioner's Take

The honest summary on Claude Code at nine months is that it has crossed the threshold from "interesting tool" to "production infrastructure." The teams that have built the four-layer stack — CLAUDE.md, skills, hooks, sub-agents — are shipping code at a materially different rate than the teams that are still using it as a fancy chat interface. The delta is not small, and it is not subjective; it shows up in PR throughput, in cycle time, in the share of work that ships without a human in the inner loop.

The interesting engineering work is the same as it has been across every wave of dev tooling — the architecture around the tool, not the tool itself. Claude Code is the LSP equivalent for this generation of dev tooling: a substrate that becomes more powerful the better the surrounding infrastructure is. The teams that invest in the surrounding infrastructure capture the value; the teams that wait for the tool to be impressive on its own watch the others compound.

The boundary will move again in the next nine months — newer models, newer features, newer patterns. The architecture that holds up is the one that treats Claude Code as infrastructure and builds the rest of the stack to match. The teams that ship the stack capture the saving; the teams that ship the tool by itself spend the same money and get less out.