Claude Opus 4.7 1M Context Window: Patterns and Pitfalls

The 1M-token context window in Claude Opus 4.7 is a genuine capability shift, not a marketing increment. But "you can fit it" and "you should fit it" are different questions, and the production patterns for long context are non-obvious. This article walks through what works, what doesn't, and the architecture decisions to make once.

What 1M Tokens Actually Holds

Concrete numbers, because the abstract size is hard to reason about:

Code: ~500 typical files of TypeScript, or ~100 files of dense Java with comments and imports. A complete React + Node.js app with 80K LOC fits comfortably.
Documents: ~700-800 pages of typical prose. A long technical book, a year of meeting notes, a quarter of customer conversations.
Mixed: ~50K lines of code plus the project's documentation plus the last quarter's tickets. all in one prompt, no chunking.

This is enough to fundamentally change how you build.

Pattern 1: Codebase-Aware Code Assistance

Instead of using embeddings to retrieve relevant snippets per query, load the whole project once.

import { readdirSync, readFileSync, statSync } from "fs";
import { join } from "path";

function loadProject(root: string, ignore: RegExp[] = []): string {
  const files: { path: string; content: string }[] = [];
  walk(root, files, ignore);
  return files
    .map((f) => `<file path="${f.path}">\n${f.content}\n</file>`)
    .join("\n\n");
}

function walk(dir: string, out: any[], ignore: RegExp[]) {
  for (const entry of readdirSync(dir)) {
    const path = join(dir, entry);
    if (ignore.some((r) => r.test(path))) continue;
    if (statSync(path).isDirectory()) walk(path, out, ignore);
    else out.push({ path, content: readFileSync(path, "utf-8") });
  }
}

const codebase = loadProject("./src", [
  /node_modules/, /\.next/, /dist/, /build/, /\.git/,
]);

const response = await client.messages.create({
  model: "claude-opus-4-7",
  max_tokens: 4096,
  messages: [
    {
      role: "user",
      content: [
        {
          type: "text",
          text: codebase,
          cache_control: { type: "ephemeral" },
        },
        {
          type: "text",
          text: `${question}\n\nWhen citing code, use the format <file>:<line>.`,
        },
      ],
    },
  ],
});

Two production tweaks:

Cache the codebase block. A single project load + 50 questions in the same hour makes the math obvious. pay the write premium once, read at 10% of input cost for every subsequent question.
Place the question after the codebase, not before. Recall is best for the most recent context. The model just-read your code; the question is fresh in mind.

Pattern 2: Multi-Document Synthesis

For "read these 30 documents and produce a comparative analysis" tasks, structured XML containers dramatically outperform raw concatenation:

<documents>
  <document index="1" title="Q3 Earnings Report" type="financial">
    [content...]
  </document>
  <document index="2" title="Risk Memo from Legal" type="memo" date="2026-02-14">
    [content...]
  </document>
  ...
</documents>

<task>
For each of the following questions, cite the document index(es) that
support your answer. If documents disagree, note the disagreement.

1. What are the top 3 risks the executive team should focus on next quarter?
2. Where do the financial team and the legal team disagree?
3. ...
</task>

The structure helps the model attribute claims to sources, surfaces conflicts, and dramatically reduces cross-source hallucinations. The discipline of "cite the index" turns the model into something more like a careful researcher than a confident essayist.

Pattern 3: Long-Form Editing and Review

Loading a book-length manuscript, a 500-page contract, or a multi-quarter strategy document. and asking for line-level edits, structural critique, or consistency checking. is a category of work that simply wasn't possible at 200K. At 1M, it's a primary use case.

For editing tasks, ask for line numbers, not free-text quoting. Line numbers force the model to anchor every suggestion to a specific location, and the resulting output is mechanically applicable:

const document = readFileSync("manuscript.md", "utf-8");
const numbered = document
  .split("\n")
  .map((line, i) => `${(i + 1).toString().padStart(5)}: ${line}`)
  .join("\n");

const response = await client.messages.create({
  model: "claude-opus-4-7",
  max_tokens: 8192,
  messages: [{
    role: "user",
    content: `<document>
${numbered}
</document>

For each issue, output a JSON object on its own line:
{"line": <number>, "severity": "error"|"warning"|"suggestion", "issue": "...", "suggested_fix": "..."}

Look for: factual inconsistencies, repeated points, unclear phrasing, missing transitions.`,
  }],
});

This pattern produces output that you can pipe directly into a diff tool or PR comments.

Pattern 4: Long-Running Agent Sessions

For agents that loop through 30-100 iterations, the conversation history grows fast. With 1M context, you can run much longer without summarisation. the agent retains full memory of every tool call, every observation, every decision.

The pattern: don't summarise prematurely. With 4.6's 200K window, agents needed compression strategies (summarise after iteration 10, drop early tool results, etc.). With 4.7, you can typically run to natural completion without compression. and the agent's reasoning is better when it has the full record.

let messages: MessageParam[] = [{ role: "user", content: task }];
const HARD_TOKEN_CAP = 800_000;  // leave headroom for output

for (let i = 0; i < 50; i++) {
  const tokensSoFar = estimateTokens(messages);
  if (tokensSoFar > HARD_TOKEN_CAP) {
    messages = compressEarlyHistory(messages);  // only compress if needed
  }
  // ... normal agent loop
}

Compression becomes a fallback for genuinely long runs, not a default for every loop.

The Pitfalls

1. Latency scales with input size. A 500K-token prompt takes meaningfully longer to first token than a 5K one. For real-time UX, long context is poor architecture; reserve it for offline analysis or async features.

2. Cost scales with input size. A million-token prompt is a million-token bill. Combined with the 25% write premium for caching, the unit economics demand you actually reuse the cached prefix multiple times. otherwise smart RAG on a smaller model is cheaper.

3. Recall U-shape at depth. Content placed in the middle of a 1M-token prompt is recalled less reliably than content at the start or end. For critical instructions, repeat them at the end. For critical context, place it near the question.

4. Context dilution. Loading 200K tokens of irrelevant code "to give the model full context" can actually hurt performance by diluting attention. If 80% of your codebase isn't relevant to the question, exclude it. Bigger isn't always better; more relevant is.

When Long Context Is the Wrong Architecture

Three situations where you should not reach for 1M context:

High-volume real-time queries. Build a RAG system on Sonnet 4.6. Per-query cost will be 1-5% of long-context, latency will be 2-4× faster.
Tasks where retrieval is straightforward. "What does the user mean by 'undo'?" doesn't need the whole codebase; a small embedding index is the right tool.
When the source content is unstable. If your "context" changes often (live database, frequently-edited documents), caching breaks down and the per-request cost of fresh long context becomes prohibitive.

A Production Architecture

The shape that works for most teams that use long context seriously:

Long-context endpoints for heavy analytical tasks (codebase review, multi-document synthesis, long-form editing). Run async, results cached or stored. Use Opus 4.7 with full context loaded and cache_control enabled.
RAG endpoints for real-time queries (search, Q&A, chat). Use Sonnet 4.6 with retrieval. Volume here is 1000× the long-context usage.
Agent endpoints for orchestrated tasks. Use Opus 4.7 for complex agents (long sessions, planning); Sonnet for simple agents (3-5 tool calls). Compress only if needed.

Long context isn't a replacement for retrieval. it's a complement. The teams who get the most out of 4.7 use both, picking the right tool per use case rather than trying to make one architecture fit everything.

The One-Page Summary

Use 1M when cross-document or cross-file reasoning is the point.
Cache the long prefix; pay the write premium once, read many times.
Place critical content at the start or end, not the middle.
Use XML structure for multi-source content.
Don't dilute the prompt with irrelevant tokens "for context."
Reserve long context for offline or async features; use RAG for real-time.

Get those right, and the 1M window is one of the highest-leverage capabilities in modern AI engineering. Get them wrong, and you've built an expensive way to do what RAG already does cheaper.