Building a Legal Research Agent: Gemini Flash + Opus 4.7 + RAG (Use-Case Build)

The interesting thing about legal research as an AI use case is that the work is structurally a perfect fit for the tools and yet most of the AI products in the legal vertical do not architect to that fit. The work is high-volume (thousands of pages per case), high-context (long single documents that have to be reasoned about as one unit), high-stakes (privilege calls, citation accuracy), and high-margin (the firm bills the time, the firm bears the cost of the headcount). The technical primitives that have shipped in 2026 — 1M-token context windows, multi-model routing, RAG over case-law corpora, MCP-integrated case management — line up against each of those four characteristics cleanly. The architecture is sitting there waiting to be built.

The right way to read this piece is not as a vendor pitch or a product spec — it is the build I would architect for a hypothetical 12-attorney commercial litigation boutique that wants to handle three times the caseload without three times the headcount. Every model choice, every routing decision, and every cost number below is grounded in real model capabilities and real published pricing as of May 2026. The legal firm is hypothetical; the architecture is one I would ship.

This is the first in what I am running as a regular use-case series — practical multi-model builds that take a real scenario and walk through what gets built, with which models, at what cost, with which failure modes. The discipline is to keep the scenario concrete enough to feel real and the architecture concrete enough to ship.

The Scenario

A 12-attorney commercial litigation boutique. Mid-market — large enough to handle complex multi-party disputes, small enough that every attorney is hands-on. Average matter: a $2M–$20M commercial contract dispute with 2,000–15,000 pages of discovery, 200–500 case-law citations to evaluate, 30–80 deposition transcripts to read and cross-reference. Today, the firm runs three paralegals full-time on this work. The bottleneck is unambiguous: paralegal capacity. The firm could take on more cases, the partners are willing to bill the work, the constraint is the human-hours required to prepare each case.

The partners' question is the one every services business asks: can we use AI to scale the case-prep work without proportionally scaling the headcount? The corporate-vendor answer is "buy CoCounsel, buy Harvey, install a SaaS, pay per seat." Those products are genuinely capable, and for many firms they are the right answer. But for a firm with technical sophistication on the partner side — and the willingness to build rather than buy — there is a fifteen-thousand-dollar-a-month AI architecture that captures most of the same capability with substantially more control over the failure modes that matter to a litigation practice.

The goal: handle 3× the case-prep volume per paralegal hour. The constraint: never let the AI make a privilege determination or a citation claim without human verification. The mechanism: a multi-model architecture that routes each step of case prep to the model that handles it best, at the cost that justifies the routing.

Why This Is Hard

The naive version of "AI for legal research" is one model doing everything — load the documents, find the relevant cases, write the brief. It fails for three structural reasons, all of which are worth understanding before the architecture makes sense.

Volume vs. depth trade-off. Discovery review is volume work — thousands of pages, most of them irrelevant, with the signal buried in 5% of the pages. Brief preparation is depth work — three or four pages of argument that have to hold up to opposing-counsel scrutiny and a judge's skepticism. The model that is good at volume (fast, cheap, accurate at finding signals in noise) is structurally different from the model that is good at depth (slow, expensive, accurate at multi-step reasoning over carefully curated inputs). A single-model architecture pays the expensive model's price for the volume work, or the cheap model's accuracy ceiling on the depth work. Either way it loses.

Corpus size vs. context window. A single case file might be 50K–200K tokens. The relevant case-law corpus for citation research is 50M tokens or more — federal case law alone is in the high hundreds of millions of tokens, state law adds another order of magnitude. No context window holds the corpus. The "just load everything" approach that works for a single case does not work for the citation research that the case requires. The architecture has to do both — long-context for the single-case work, retrieval for the corpus work — and the framing from the RAG vs long-context piece applies directly.

Asymmetric cost of error. A wrong citation in a brief is embarrassing. A privilege determination that exposes a client communication is malpractice. A hallucinated case is a sanctionable offence. The cost-of-error distribution is heavily skewed toward the tail — most errors are mild, but the tail errors are career-altering for the attorney involved. The architecture has to be designed around the tail, not around the average; the model that gets 95% of citations right is not adequate when the 5% includes citations that do not exist.

The right architecture handles all three asymmetries — volume vs. depth, corpus size vs. context, and tail-risk vs. average-case performance — by routing each step to the model and the surrounding controls that fit the step.

The Model + Tool Routing Decision

The architecture I would ship has four model surfaces and three tool surfaces, each with a defined job.

Gemini 3.5 Flash — the document-loader and high-volume reader. Flash's 1M-token context window and multimodal input handle the heaviest part of the work: reading scanned discovery PDFs, parsing deposition transcripts, ingesting contracts and exhibits. The economics, covered in the Composer 2.5 vs Gemini 3.5 Flash comparison, favour Flash on this kind of volume work — $1.50 per million input tokens means a 200K-token deposition costs roughly $0.30 to read end-to-end. Flash is the model that does the bulk of the page-turning.

Claude Opus 4.7 — the critical reasoning layer. Anywhere the work shifts from "find the signal in the noise" to "reason carefully about precedent, jurisdictional differences, or argument structure," Opus 4.7 is the right tool. Its 1M-token context window — covered in the Opus 4.7 long-context patterns — handles the multi-document synthesis. Its calibration discipline (the "I do not know enough, here is what I would need" behaviour) is the right shape for legal reasoning, where overconfidence is the most expensive failure mode.

A RAG layer over case law. Federal and relevant state case law indexed in a vector database — somewhere around 50M tokens of decision text plus headnotes, syllabi, and citation graphs. The retrieval layer narrows the corpus to the 8–20 most-relevant cases for a given query; Opus reasons over those cases in the long-context window. This is the hybrid pattern from the RAG-vs-long-context piece, applied to the specific corpus shape of case law.

Human-in-the-loop checkpoints. Two non-negotiable checkpoints: privilege review on any document flagged as potentially privileged, and citation verification on any case cited in produced work. The model does the first pass; the human attorney signs off. The architecture is "model proposes, human disposes" on these two specific axes, by design.

The tool surfaces are an MCP-integrated case-management server (the firm's matter management system, Clio or PracticePanther equivalent), an MCP-integrated document repository (the firm's DMS, NetDocuments or iManage equivalent), and a citation-verification tool that hits the case-law corpus directly to confirm any citation Opus produces. The MCP layer is what I covered in the Composer 2.5 + MCP integration guide — the same patterns apply, with the servers swapped for the legal-vertical equivalents.

The Architecture in Practice

The end-to-end flow for a typical case-prep task — "prepare the cross-examination outline for the deposition of Mr. Smith on June 12th":

Document gathering (MCP). The agent queries the case-management MCP server for the matter, identifies the relevant pleadings, prior depositions, contract exhibits, and Mr. Smith's prior testimony in related matters. Returns a manifest of ~40 documents totalling ~600K tokens.
First-pass reading (Gemini Flash). Flash loads the full document set into its 1M context window, produces a structured summary: chronology of relevant events, key admissions from prior testimony, contract clauses Mr. Smith is being deposed about, points where his prior testimony contradicts the documentary record.
Citation research (RAG + Opus). From the structured summary, Opus identifies the legal questions Mr. Smith's testimony bears on. For each question, the RAG layer retrieves the 8–20 most-relevant case-law citations. Opus reads those cases in long-context and produces the precedent analysis.
Cross-examination outline (Opus). With the structured summary, the precedent analysis, and the attorney's stated strategy in context, Opus drafts the cross-examination outline — questions grouped by topic, supporting documents cited inline, anticipated objections noted.
Citation verification (deterministic tool). Every citation in the outline is checked against the case-law corpus. Citations that do not resolve to a real case are flagged for the attorney to review. This step is non-negotiable; Opus is calibrated but not perfect, and hallucinated citations in a deposition outline are the failure mode that kills the project.
Human review (the attorney). The attorney reviews the outline. The model has done the volume work and produced the draft; the attorney owns the strategy, the privilege calls, and the final citations.

End-to-end token usage for a representative case-prep cycle: ~1.2M tokens through Flash, ~400K tokens through Opus, ~50K tokens through the embedding model for retrieval. Cost, at standard-tier pricing: roughly $4 in Flash time, $12 in Opus time, ~$0.50 in retrieval, total ~$17 per cross-examination outline. The paralegal time saved on the same task: 6–10 hours.

The Cost Economics

The honest math on the architecture's monthly cost, against the baseline.

Baseline (three full-time paralegals). Three paralegals at roughly $55/hour fully-loaded, 160 hours per month each, ~$26,400/month. That covers approximately 80 cross-examination outlines, brief drafts, and discovery reviews per month across the firm's caseload — the current capacity bottleneck.

Architecture (one paralegal + AI infrastructure). One paralegal handles the human-in-the-loop review and the edge cases the model defers — ~$8,800/month. AI costs at the firm's typical case volume: roughly 80 case-prep cycles × $17 per cycle = $1,360/month. Add ~$400/month for the RAG infrastructure (vector DB, embedding API, hosting), ~$200/month for the MCP server hosting, ~$1,500/month for enterprise-tier API agreements with Anthropic and Google (no-train guarantees, business-associate-equivalent commitments, dedicated rate limits). Total architecture cost: ~$12,260/month.

The delta. $26,400 baseline → $12,260 architecture = ~$14,140/month saved at the same case-prep volume. Or, holding cost constant, the firm can handle ~2× the case-prep volume at the same monthly cost. Or, splitting the difference, the firm can grow case volume by 50% while reducing monthly cost by 25%.

The cost-engineering discipline I have written about for Composer 2.5 at production scale applies: the headline saving is real, but capturing it requires the routing layer to actually work. A naive build that routes everything to Opus would cost $40K+/month and lose money against the baseline; a naive build that routes everything to Flash would produce work the attorneys cannot ship. The routing layer is where the saving lives.

The Failure Modes That Matter

The architecture is not interesting because it saves money. Many architectures save money. It is interesting because the failure modes are tractable. The four that matter:

Hallucinated citations. Opus 4.7 is calibrated but not infallible; on rare occasions it will produce citations that do not exist or that misrepresent the holding of a real case. The mitigation is deterministic: every citation in produced work is checked against the case-law corpus before the work goes to the attorney. Hallucinated citations never make it past the verification step. This is not a probabilistic mitigation — it is a hard gate, and it has to be hard for the architecture to be acceptable.

Privilege misclassification. The model can produce a structured summary that flags potentially privileged content; the model cannot make the privilege determination. Anything tagged as potentially privileged routes to mandatory attorney review before it appears in produced work or in any model context where it might leak. The default is "treat as privileged if uncertain"; the attorney downgrades, the model never upgrades.

Confidentiality leakage to model providers. Enterprise-tier API agreements with both Anthropic and Google include no-train guarantees and business-associate-equivalent commitments. No client data flows through the consumer-tier endpoints, ever. This is a contracting question first and an engineering question second, but both have to be right.

Calibration drift on new case types. The architecture is tuned for commercial contract disputes. If the firm takes on a securities fraud matter, an IP litigation, or any other practice area, the routing rules need recalibration — different documents, different relevant precedents, different failure modes. The architecture should be treated as a starting point that gets tuned per practice area, not as a turnkey solution.

The honest counter-case from when not to use Composer 2.5 applies here too — there are categories of legal work where the right call is not to route to the AI at all. Privileged communications, settlement-strategy discussions, work product that opposing counsel will read and pressure-test — the human attorney owns these, with the model as a support tool at most.

Where This Generalises

The architecture is not a one-off for legal. The same shape — Flash for volume reading, Opus for critical reasoning, RAG for corpus that exceeds the context window, MCP for system integration, deterministic verification on the failure modes that matter — generalises to any document-heavy professional-services workflow with similar asymmetries. Three obvious adjacencies:

Medical research and clinical-trial design. High-volume literature review, high-stakes synthesis decisions, corpus that exceeds any context window, mandatory human review on patient-facing decisions. The same architectural primitives apply; the model routing changes (different RAG corpus, different verification step), the structure does not.

Compliance and regulatory affairs. Reading large bodies of regulation, identifying the provisions relevant to a specific business activity, producing audit-defensible memos. Asymmetric cost of error (regulatory fines), large corpus (the U.S. CFR alone is ~250M tokens), critical reasoning on application to specific facts. Same shape, different vertical.

Due diligence in M&A. Document-room review, target-company analysis, deal-specific risk identification. High-volume early stages, high-depth later stages, hard deadlines, multiple stakeholders. The same Flash-for-volume / Opus-for-depth split fits the workflow naturally.

The use-case framing I am running this series on is not "here is what to build for legal." It is "here is the architectural pattern that captures the cost saving in document-heavy professional services work, applied to legal as the first concrete example." The pattern generalises; the specifics calibrate per vertical.

The Practitioner's Take

The honest summary on building this architecture is that the technical primitives are ready and the discipline is what is missing. Gemini 3.5 Flash handles long-context volume work; Claude Opus 4.7 handles critical reasoning; RAG handles corpus-scale retrieval; MCP handles system integration; enterprise-tier contracts handle confidentiality. None of these are speculative. The architecture is one a small engineering team could ship in six to ten weeks.

The interesting engineering work, as in every other architecture I have written about in this series, is the routing layer. Which step goes to which model, with which surrounding controls, at what cost ceiling. The model choices are the boring part; the routing layer is the load-bearing part. A team that ships the routing layer captures the saving; a team that ships a single model and hopes runs against the same failure modes the corporate vendors built around years ago.

The architecture above is what I would build for a hypothetical mid-market litigation boutique willing to invest in technical sophistication over off-the-shelf SaaS. The choice between this build and a CoCounsel or Harvey subscription is the choice every firm faces between build-versus-buy in any vertical — control and cost vs. speed-to-value and vendor-managed risk. There is no universal right answer, but the architecture is a concrete reference point for the build-side of the decision.

The next piece in this use-case series will pick another vertical and walk through the same architectural primitives applied to a different workflow. The shape of the pattern is what compounds across pieces; the specific verticals are the surface that makes the pattern legible.