OpenAI now ships two distinct families of models for builders to choose between: standard chat models like GPT 5.4, and reasoning-tier models that produce longer, more deliberate outputs by spending more compute per request. They're not interchangeable. and choosing the wrong one for a given task is the most common cost surprise I see in production AI systems.
This is the framework I use to route work between them.
The Fundamental Trade-Off
Standard chat models (GPT 5.4 and its tier-mates):
- Fast time-to-first-token (suitable for streaming UX)
- Cheaper per request
- Excellent at well-defined, bounded tasks
- Strong with clear prompts and examples
Reasoning-tier models:
- Higher latency. they "think" before responding
- More expensive per request, sometimes by 3-10×
- Better at multi-step reasoning, complex planning, and problems with non-obvious decompositions
- Can produce dramatically better results on hard problems
The decision isn't "which is smarter." It's "is this task hard enough that reasoning's overhead is justified?"
When the Standard Model Is Enough
GPT 5.4 (and the workhorse tier of any frontier provider) handles, well:
- Classification and tagging: pick a category from a list
- Extraction: pull structured fields from text
- Summarisation: condense a document to its key points
- Translation: across major language pairs
- Stylistic rewriting: adjust tone, length, register
- Standard code generation: implement a function from a clear spec
- Q&A over provided context: answer based on retrieved docs
The defining property: the task is well-bounded, the success criterion is clear, and a thoughtful human could do it in a few minutes. For these, the standard model is fast, cheap, and produces consistently good output.
When Reasoning Earns the Premium
Reasoning-tier models pull ahead when the task requires:
1. Multi-step problem decomposition. "Diagnose this slow query" requires reading the schema, the query, the explain plan, considering several hypotheses, and ruling them out. The standard model jumps to an answer; reasoning works through the candidates.
2. Constraint satisfaction. Scheduling problems, configuration problems, "find me a route that satisfies these 7 constraints". reasoning models are noticeably better at finding consistent solutions.
3. Genuine math or formal logic. Standard models stumble on multi-step arithmetic. Reasoning models tackle them directly (and even more reliably when paired with a code-execution tool).
4. Code that requires understanding interactions across files. "Why does this test fail?" with the test, the code under test, and the recent diffs. reasoning sees the chain of cause; standard models often hallucinate plausible-sounding but wrong fixes.
5. Critical decisions where partial correctness is dangerous. Legal interpretation, medical triage signals, financial computations. The cost of a wrong answer is high enough that the cost of reasoning compute is irrelevant.
The Decision Tree
A practical workflow for deciding per-task:
-
Is the task latency-sensitive? (Real-time chat, search-as-you-type, autocomplete) → standard model. Reasoning's latency makes it unusable for these regardless of capability.
-
Could you describe a clear scoring rubric for "good output"? → standard model. If success is well-defined and you can give examples, the standard model handles it well.
-
Does the task require holding many constraints in mind, or working through alternatives? → reasoning. This is where the depth pays off.
-
Is this a high-stakes or one-off task where cost is irrelevant? → reasoning. When the per-call cost doesn't matter (research, analysis, decision support), use the more capable tool.
-
Is this high-volume? → standard model, even if quality is slightly lower per call. The aggregate cost matters more than the per-call quality delta.
The Hybrid Pattern
The production architecture that wins for most teams:
async function answer(question: string, context: string) {
// Triage with the standard model
const triage = await classifyComplexity(question);
// returns "simple" | "moderate" | "complex"
if (triage === "complex") {
return await openai.chat.completions.create({
model: "o5-reasoning-tier", // example identifier; check current docs
messages: [{ role: "user", content: buildPrompt(question, context) }],
});
}
return await openai.chat.completions.create({
model: "gpt-5.4",
messages: [{ role: "user", content: buildPrompt(question, context) }],
});
}
The triage call costs almost nothing on the standard model. The reasoning call only fires for the 5-15% of requests where it's genuinely needed. Average cost per request stays close to the standard model's cost; quality on hard requests stays high.
Common Mis-routes
Three patterns I see teams getting wrong:
1. Defaulting to reasoning "to be safe." The latency tanks the UX, the cost spikes, and the standard model would have produced equivalent output 90% of the time.
2. Sticking with the standard model for tasks where it visibly fails. Iterating on the prompt for two weeks instead of routing the task to a more capable model. Sometimes the right answer is "this task needs reasoning"; the prompt isn't the problem.
3. Routing on user tier, not task difficulty. "Enterprise users get the reasoning model" is a UX decision pretending to be an engineering one. Both tiers should get the right model for their task.
Pricing-Time Reality Check
Reasoning-tier models can cost meaningfully more per request. sometimes by an order of magnitude when you account for the longer outputs they produce. Three implications:
- Set hard
max_tokensbudgets even (especially) on reasoning calls. - Cache aggressively on the prefix; the cost-per-output-token is the lever you most want to manage.
- Track per-call cost in your logs. A reasoning call that goes 8K output tokens deep is a real budget event.
The One-Page Summary
- Standard chat (GPT 5.4) for the bulk: real-time UX, well-bounded tasks, high volume.
- Reasoning tier for the depth: multi-step planning, constraint satisfaction, code reasoning, high-stakes outputs.
- A triage layer on the standard model decides which to invoke per request.
- Distribution typically lands at 80-95% standard, 5-20% reasoning.
- Latency, cost, and aggregate quality all win compared to defaulting to either tier alone.
Get the routing right, and you ship features that feel fast, behave reliably, and don't bankrupt the budget.