The interesting thing about Gemini 3.5 Pro is what Google did not say at I/O. Flash shipped on May 19th with full benchmarks, pricing, and API availability. Pro got a one-line mention — "coming next month" — and an admission that Google is already using it internally. No release date, no benchmark numbers, no pricing tier. For a model that Google clearly believes in enough to use in its own products, the public silence is loud.
The right way to read the silence is not "Pro isn't ready." Google ships incomplete launches when it has to, and Pro being in internal use means the model is production-ready in the strict sense. The right reading is "Pro is being held back for a release window that makes more competitive sense than launching alongside Flash." The most likely window is June — possibly to land between OpenAI's expected updates and Anthropic's next Opus point release, possibly to give Flash a month of clean air to establish the new performance bar before Pro raises it again.
This is the pre-read. What Flash already told us about Pro's capability shape, what the pricing will likely be, how the dual-tier architecture changes your routing layer, and what to do now to be ready when the release actually lands. I have been running Flash on shadow traffic for the week since it shipped and the patterns that are emerging make Pro's shape easier to predict than the public silence suggests.
What Flash Already Told Us
Flash launched as the surprising entrant — not because Google shipped a cheap-tier model, but because the cheap-tier model beat the previous-generation Pro on every agentic and coding benchmark. The headline numbers from launch: Terminal-Bench 2.1 at 76.2%, MCP Atlas at 83.6%, CharXiv Reasoning at 84.2%. All of those beat Gemini 3.1 Pro. The model that was supposed to be the cheap option turned out to be the agentic-and-coding leader of the family.
That outcome tells us something specific about Pro. Flash's strength is a result of post-training tuning that prioritised tool-use and coding-loop performance over the broad reasoning suite. Pro's job, at the architecture level, is to be the model that does not trade away the broad reasoning suite — the model that takes the same training corpus and the same post-training improvements, and instead of optimising for fast/cheap inference, optimises for raw capability across the harder benchmarks.
Two things to internalise about that framing. First, Pro is unlikely to be a strict superset of Flash on coding-specific benchmarks; the trade-off that made Flash strong is one Pro will partially un-make. Second, Pro is almost certainly the superset on hard reasoning, multi-step problem solving, and the workloads where Flash's optimised-for-speed posture costs it raw capability. The dual-tier architecture is by design.
The question that follows — and the one production teams should already be thinking about — is whether Pro matches Flash closely enough on coding that the dual-tier endpoint architecture is worth running.
The Architecture Question — Superset or Dual-Endpoint?
If Pro matches Flash on coding benchmarks (within a couple of points), the production architecture is simple: run Pro for everything. The cost premium is fine for most workloads; the capability headroom on reasoning tasks is upside. This is the architecture I would bet on if I had to guess Google's intent — make Pro the default, position Flash as the cheap-tier escape hatch for high-volume work.
If Pro lags Flash on coding (by more than a couple of points), the production architecture is dual-endpoint: route coding work to Flash, route reasoning work to Pro. This is the architecture that comes naturally if Google followed the same post-training discipline they used for Flash — coding-and-agent specialisation at the cost of broad capability. It is more operationally complex than the single-endpoint architecture and slightly more expensive in routing-layer engineering, but it captures the strength of each tier.
My guess at the most-likely shape: Pro lands within 1–2 points of Flash on coding benchmarks and meaningfully ahead on reasoning, instruction-following, and the hard non-coding suites. That gives Google a clean two-tier story — Pro is the default for the bulk of workloads where reasoning quality matters; Flash is the high-volume cheap-tier for coding agents and tool-using loops. The teams that already routed Flash against Composer 2.5 will simply add Pro as a third option above Flash, with the routing rule "Pro for reasoning, Flash for coding, Composer for in-Cursor coding."
Pricing — What to Expect Based on Flash's Math
Flash priced at $1.50 input and $9.00 output per million tokens on the standard tier. That is exactly three times the previous Flash generation's pricing — Google increased per-token cost while increasing capability faster than the cost. The pricing math implies Google's strategy is "charge what the new capability is worth, not what the old capability cost."
Applying the same logic to Pro: Gemini 3.1 Pro priced at roughly $3.50 input and $10.50 output per million on the standard tier. If Google applies a similar 1.5–2× premium to Pro for the 3.5 generation, the rate card lands somewhere around $5–7 input and $15–21 output per million. That is genuinely expensive territory — competitive with Claude Opus 4.7's $5/$25 pricing, and meaningfully more than GPT-5.5.
The corollary is that the cost-routing layer becomes mandatory once Pro ships. Sending Pro every query at $5+ input pricing is a bill nobody wants. The work that was optional when the only choice was "Flash or Composer" becomes load-bearing once Pro is the third option. Most production teams should treat the Pro release as the moment to add or harden their tier-routing logic, the same way the broader cost-engineering playbook recommends — the cheaper tiers should handle the bulk of work, the expensive tier should handle only the work that justifies the premium.
The corollary to the corollary: prompt caching becomes higher-leverage at Pro pricing. Flash already lists cached input at $0.15/M; Pro will almost certainly carry the same caching tier. The patterns from the Claude prompt-caching piece apply across providers — the engineering investment that pays off at Opus pricing pays off again at Pro pricing.
The Context Window Question
The single most production-relevant variable about Pro that we do not know yet is the context window. Flash shipped with a 1M-token input window — the same size as Opus 4.7. Pro could plausibly land at any of three sizes: 1M (matching Flash and Opus), 2M (a clean step up), or something in the 5M–10M range (which Google has talked about in research contexts but never shipped).
Each of those shapes has different implications for the architecture I have written about in RAG vs long-context. At 1M, the architecture does not move — Pro is one of three 1M-class models, and the hybrid pattern still wins for large corpora. At 2M, Pro becomes the long-context leader and starts to genuinely change the breakeven for some workloads; corpora that did not fit Opus 4.7 or Flash now fit Pro, and the retrieval layer gets simpler for those use cases. At 5M+, the architecture inverts — long-context becomes the default for a much wider set of workloads, RAG retreats to the corpora that genuinely exceed the new ceiling, and the cost-vs-capability trade for "load everything" becomes attractive at a scale it has not been before.
My guess at the most-likely landing: 2M. It is the obvious competitive move (Anthropic and OpenAI are at 1M; Google has been signalling longer-context ambitions for a year), it is the size that has been technically demonstrated, and it is the size that produces a clean marketing story without committing to the operational complexity of 5M+ in production. But Google has been unpredictable on context-window announcements before, and a surprise 5M or 10M release would not be out of character.
Multimodal — The Variable Google Will Lead On
The category where Pro is least uncertain is multimodal. Flash already shipped with first-class multimodal input — images, diagrams, PDFs, structured documents. Pro will almost certainly match or extend that, and the smart bet is that multimodal is the dimension Google chooses to differentiate Pro most aggressively from Anthropic and OpenAI's competition.
The most interesting multimodal capability to watch for: real-time multimodal reasoning over moderately long inputs. Loading a 200-page PDF with diagrams and asking analytical questions; reading a Figma export and generating React components; processing a video transcript with frame samples for context. Flash can do all of these to some degree; Pro is positioned to do them at production quality.
The use-case implication: workloads that were single-model coding loops are about to be multi-step pipelines that pass multimodal artefacts between models. The architecture from the Claude Code in production piece — four-layer stack, sub-agents for bounded specialist work — applies here. Pro slots in as a specialist for multimodal-input steps; Composer or Flash handles the coding output; the routing layer picks per step. The teams that build this routing now will have a working architecture the day Pro ships; the teams that wait for Pro to ship will spend two weeks figuring out how to wire it.
What To Do Now — Preparation Checklist
Five things production teams should do before Pro ships, all of them low-cost and high-ROI:
- Add a Gemini-tier abstraction to your routing layer. If your current architecture has a single Gemini endpoint hardcoded as "Flash," wrap it now in an abstraction that routes by capability —
geminiFor("coding"),geminiFor("reasoning"),geminiFor("multimodal"). When Pro ships, the routing change is one line per category instead of a refactor. - Build the cost-observability instrumentation if you have not. Per-task token spend, per-model breakdown, per-route attribution. The cost discipline matters at Flash pricing; it becomes mandatory at Pro pricing.
- Profile your current workloads for the dual-tier split. What fraction of your queries are "pure reasoning" (Pro candidates) versus "coding / tool use" (Flash candidates) versus "in-Cursor coding" (Composer candidates)? The answer tells you what your post-Pro cost profile will look like.
- Plan for the multimodal pipeline. Identify the workloads where multimodal input would unlock capability you currently work around. Pro will likely make those workloads viable; the engineering investment to build the pipeline is something you can start before the model ships.
- Pre-write your eval suite for Pro. Run your existing evals against Flash to establish the baseline; the moment Pro is available, run the same suite and the comparison is one command rather than two weeks of harness setup.
What's Not Changed
The unchanging caveats:
- The benchmarks will lie. Google's published numbers will be self-reported and optimised for Pro's strongest dimensions. Run your own evals on your own workload before letting the rate card decide your routing.
- Pricing will move. First-month pricing is rarely the long-term rate card. Build your cost-routing layer for prices that can shift 20–30% in either direction within a quarter of launch.
- Availability will be uneven. Google's enterprise rollout often lags the GA announcement by weeks; the model your AI Studio account can hit on day one may not be the model your Vertex AI or Vercel AI Gateway account can hit until weeks later. Plan for the staggered rollout in your launch timeline.
- The competitive landscape will respond. Anthropic and OpenAI will not stand still in June. Whatever Pro ships, the next Opus point-release and the next OpenAI flagship update will recalibrate the field within weeks. The routing-layer discipline matters more than the specific model choice; the layer is permanent, the model winners are seasonal.
The Practitioner's Take
The honest summary on Gemini 3.5 Pro is that the interesting work is what you do now, not what you do when the model ships. Pro itself will be capable, expensive at the top tier, multimodal-strong, and almost certainly competitive on long-context. None of those facts will be surprising; the published rate cards and benchmark numbers will tell the story Google wants to tell, and most teams will spend a week reading the launch coverage before figuring out what to do.
The teams that ship the production architecture before Pro lands — the routing abstraction, the cost observability, the multimodal pipeline, the eval suite — will have working systems on day one. The teams that wait for the announcement will spend the back half of June rebuilding their architecture under launch-week pressure.
The interesting engineering work, as always, is the routing layer. The model is interchangeable; the architecture compounds. Pro is the next iteration, not the last one; the routing logic you build now will outlive the specific model the same way it has outlived every previous flagship release. The teams that internalise this ship architectures that age well. The teams that build for the specific model in front of them rebuild every release cycle, and never compound the engineering investment.
Pro ships in June. The interesting work is what you do before then.