Every major LLM provider now offers three tiers: a small/lite model (Gemini 3.1 Flash Lite, Claude Haiku, GPT mini variants), a mid-tier workhorse (Gemini 3.1 Pro, Claude Sonnet 4.6, GPT 5.4/5.5), and a frontier flagship (Claude Opus 4.7, reasoning-tier models). The cost between tiers can span a 30-100× range. Choosing wrongly is the single largest cost surprise in production AI deployments.
This article is the framework for getting the choice right.
The Three-Tier Mental Model
The price-performance curve across the three tiers is roughly:
| Tier | Cost | Latency | Best for |
|---|---|---|---|
| Lite | 1× | Lowest | Classification, extraction, routing, real-time UX |
| Mid (Pro/Sonnet) | 5-15× | Moderate | Reasoning, generation, agent loops |
| Frontier (Opus, reasoning models) | 30-100× | Higher | Hard reasoning, deep code work, high-stakes outputs |
The cost ratios are approximate and shift with each generation, but the structure is stable. The key insight: cost grows much faster than capability between tiers. A 100× cost difference doesn't deliver 100× the capability. usually closer to 1.5-3× on tasks where the larger model is genuinely better.
Implication: for tasks where Lite is enough, Pro is wasteful, and Frontier is wildly wasteful.
Decision Tree by Task Type
A practical workflow for picking the tier per use case:
Classification, tagging, intent detection, simple extraction: → Lite. These tasks are bounded; small models handle them at near-Pro accuracy with few-shot prompting.
Routing, triage, simple form-filling, transformation: → Lite. The output space is constrained; the model's job is mechanical, not creative.
Standard summarisation, translation, rephrasing: → Lite or Pro depending on quality requirements. Test both with your eval set.
Customer-facing chat assistants: → Pro. Real users will notice the quality gap on Lite for open-ended interaction.
Standard code generation from clear specs: → Pro. Lite drops too many edge cases; Pro is consistent.
Multi-step reasoning, complex analysis, code review across files: → Pro or Frontier depending on complexity. Test both.
Hard logical/mathematical reasoning, planning over many constraints: → Frontier. Lite and Pro both struggle here.
Deep code work spanning many files, architectural reasoning, complex refactors: → Frontier with long context (Claude Opus 4.7 1M, Gemini Pro). Pro fails too often on these.
High-stakes content where a wrong answer is dangerous: → Frontier with cross-validation.
The Triage Architecture
The pattern that wins in production: triage to decide tier, not to decide answer.
async function answer(input: Input): Promise<Output> {
// Step 1: classify task complexity on Lite
const tier = await selectTier(input);
// Step 2: route to the chosen tier
switch (tier) {
case "lite": return await liteHandler(input);
case "pro": return await proHandler(input);
case "frontier": return await frontierHandler(input);
}
}
async function selectTier(input: Input): Promise<"lite" | "pro" | "frontier"> {
const result = await liteClassifier.generate({
contents: [{ role: "user", parts: [{ text: `Classify the complexity of this task:
"${input.summary}"
Output JSON: {"tier": "lite" | "pro" | "frontier", "reason": "..."}` }] }],
generationConfig: { responseMimeType: "application/json" },
});
return JSON.parse(result.response.text()).tier;
}
The triage call costs almost nothing on Lite. Most requests stay on Lite. A subset escalates to Pro. A small subset of those escalates to Frontier.
For most production workloads, the distribution lands at:
- 80-95% Lite
- 5-18% Pro
- 1-5% Frontier
Aggregate cost per request is close to Lite's cost. Quality on hard requests stays high.
Cost Modelling: A Worked Example
A SaaS support feature processing 1M tickets per day:
Naive: route everything to Pro.
- 1M × Pro per-request cost
- Typical bill: $20K-$80K/month
Tiered architecture: triage on Lite, escalate as needed.
- 1M Lite triage calls + 850K Lite handler + 130K Pro + 20K Frontier
- Typical bill: $2K-$8K/month
The savings (often 70-90%) compound across every feature in your AI stack. For a company shipping 5-10 AI features at meaningful volume, this is the difference between a sustainable AI line item and a runaway one.
Latency Modelling
Cost isn't the only axis. Latency profiles by tier:
- Lite: time-to-first-token typically tens of milliseconds; full responses often under 500ms
- Pro: time-to-first-token a few hundred ms; full responses 1-3s
- Frontier: time-to-first-token often a second; full responses 3-15s
For real-time UX, this matters enormously:
- Autocomplete, search-as-you-type, inline transformations: Lite is the only viable tier
- Chat with streaming output: Pro or Lite (Pro for quality, Lite for snappiness)
- Async analysis, agent runs: Latency matters less; pick on quality and cost
If you're shipping a real-time UX feature on Pro or Frontier, latency is probably your bottleneck before quality is. Move down a tier and check whether the result still meets the quality bar.
When the Triage Itself Is Wrong
The most common failure mode: the triage classifier defaults to "pro" too easily. "If unsure, escalate" is a tempting rule that ends up routing 60-80% of traffic to Pro.
The fix: triage with a confidence signal, default to Lite, and only escalate when the classifier is confident the task needs more capability:
const triage = await classify(input);
if (triage.tier === "frontier" && triage.confidence === "high") return "frontier";
if (triage.tier === "pro" && triage.confidence === "high") return "pro";
return "lite"; // default
This pattern keeps the Lite share at 80%+ even when the triage isn't sure.
Per-Provider vs Cross-Provider
You don't have to use the same provider across all tiers. The architecture that often wins:
- Lite tier: Gemini Flash Lite or Claude Haiku. pick on per-task quality and cost
- Pro tier: Claude Sonnet 4.6 or GPT 5.5. pick on tooling fit and per-task quality
- Frontier tier: Claude Opus 4.7 for long-context and agentic work; reasoning-tier models for hard problems
The cost of running multiple providers is operational complexity (different SDKs, different observability, different quotas). The benefit is hitting the optimal price-quality point per task. Worth it for systems with meaningful scale.
What to Measure
Three metrics every production AI system should track:
- Tier distribution per feature. What % of requests go to each tier? If it's drifting upward, find out why.
- Cost per request, per feature, per tier. Daily granularity. Catches regressions instantly.
- Quality differential per tier. How does Pro outperform Lite on this feature, in measurable terms? If you can't articulate the difference, you might not need Pro.
The Production Discipline
Three rules I enforce on every AI architecture review:
- Don't default to the highest tier "to be safe." Test the smaller tier first; upgrade only when measurement shows it matters.
- Build the triage classifier on Lite. Routing logic shouldn't cost as much as the routed work.
- Re-evaluate tier choice quarterly. Provider updates change the calculus; your architecture should evolve with them.
Get those rules into your team's working norms, and AI infrastructure costs become a managed line item rather than a recurring surprise.
The One-Page Summary
- Lite for the volume. classification, extraction, routing, real-time UX
- Pro for the workhorse. generation, reasoning, agent loops
- Frontier for the depth. hard reasoning, long context, high-stakes outputs
- A triage classifier on Lite decides which tier to invoke
- 80%+ of traffic should stay on Lite for most workloads
- Measure cost and quality per tier; alert on regressions
- The gap between provider tiers shifts every quarter; revisit the routing
Get this right, and your AI features ship sustainably. Skip it, and the bill is the headline before the product is.