Migrating production LLM features from one model version to the next is mostly mechanical. change a string, ship the build. until it isn't. The places it bites: subtle output differences your prompt-test suite missed, cost shifts you didn't notice, and behavioural drift on edge cases that only show up under real traffic.
This is the playbook for migrating GPT 5.4 to 5.5 (and the same playbook works for any minor version migration on any frontier model).
Pre-Migration: Build Your Eval Suite
If you don't have an eval suite for the feature you're about to migrate, build one before changing anything. Without it, you can't tell if 5.5 is better, worse, or different in ways that matter.
Minimum viable eval suite:
type EvalCase = {
name: string;
input: Input;
// Either a hand-graded reference output, an assertion fn, or both
reference?: string;
assert?: (output: string) => { pass: boolean; reason: string };
};
const evals: EvalCase[] = [
{
name: "extracts-customer-name-from-email-signature",
input: { email: "...standard email with sig..." },
assert: (out) => ({
pass: JSON.parse(out).name === "Jane Doe",
reason: "expected name=Jane Doe",
}),
},
// 30-100 of these covering the feature's behaviours
];
A good eval suite covers:
- Happy path: the typical inputs your feature sees
- Edge cases: inputs that are tricky, ambiguous, or have historically failed
- Format adherence: does the output match the expected schema/length/structure
- Refusal behaviour: does the model decline tasks it should
- Failure modes: known failure modes you've patched around
Run the suite against 5.4 to establish baseline. Now you have something to compare 5.5 against.
Shadow Traffic: Real-World Validation
Eval suites catch known issues. Production traffic catches the unknown ones. Run 5.5 in shadow mode against a sample of real production traffic before any user sees its output.
async function generateWithShadow(input: Input): Promise<Response> {
const primary = await openai.chat.completions.create({
model: "gpt-5.4",
messages: buildMessages(input),
});
// Shadow 5% of traffic
if (Math.random() < 0.05) {
openai.chat.completions
.create({ model: "gpt-5.5", messages: buildMessages(input) })
.then((shadow) => {
analytics.track("model_shadow", {
input_hash: hashInput(input),
primary_tokens: primary.usage.completion_tokens,
shadow_tokens: shadow.usage.completion_tokens,
primary_text: extractText(primary),
shadow_text: extractText(shadow),
});
})
.catch((err) => logger.error("shadow_failed", err));
}
return primary;
}
The shadow runs are fire-and-forget. they don't affect user experience even if they fail. Log them somewhere queryable (a data warehouse, a dedicated table) and analyse:
- Token usage delta. Is 5.5 producing meaningfully longer or shorter outputs? Watch for cost changes.
- Latency delta. Is 5.5 slower on your traffic shape?
- Output similarity. For deterministic tasks, are the outputs equivalent? For generative tasks, are they at least comparably good?
A week of shadow data tells you more than a hundred eval cases.
Compare Outputs: The Diff That Matters
For the shadow data, look at three things:
1. Format adherence. Are JSON outputs valid? Do schema-strict outputs still conform? If you spot a regression here, it's usually a sign that 5.5 has slightly different default formatting tendencies that need a small system prompt tweak.
2. Length and verbosity. A common minor-version delta: outputs become 10-20% longer. This shows up as a cost increase if you don't tune max_tokens or system prompts.
3. Edge case behaviour. The cases your eval suite doesn't cover. Sample 30 random shadow comparisons and read them. Do you spot a pattern of regression on a specific input shape? That's a new eval case to add and validate.
Phased Rollout
Once shadow data and evals look good:
Phase 1: 1% of traffic for 24 hours. Watch error rates, latency, cost. Any issues, instant rollback.
Phase 2: 10% for 3 days. Now you'll have user feedback if there's a regression. If support tickets mention output quality changes, investigate before proceeding.
Phase 3: 50% for a week. A/B compare performance metrics. conversion, retention, whatever your feature optimises for. If 5.5 is genuinely better or equivalent, push to 100%. If worse, hold and investigate.
Phase 4: 100%. Keep 5.4 as a fallback for a month in case you need to revert.
Cost Watching
Even when output quality is equivalent, cost can shift in either direction:
- Lower input price would directly reduce cost.
- Higher capability often means longer outputs: measure your
completion_tokensaverage before and after. - Better instruction following can mean shorter prompts work. review your system prompts after migration; you may be able to drop guidance you previously needed.
Build a daily-cost dashboard before migration. Watch it for the two weeks after. If something spikes, the dashboard tells you on day one rather than next month's bill.
The Gotchas
Three patterns that bite teams during minor version migrations:
1. Different default temperatures or sampling behaviour. If you didn't pin temperature, the default may have shifted slightly. Always pin the parameters you care about explicitly.
const response = await openai.chat.completions.create({
model: "gpt-5.5",
temperature: 0.0, // pin this
max_tokens: 256, // pin this
// ...
});
2. Caching invalidation. Switching models invalidates any model-specific caches. The first wave of post-migration traffic hits the API uncached. Plan capacity for the warm-up period.
3. Tool use schema sensitivity. Newer models sometimes interpret schema descriptions slightly differently. If your tool descriptions contained ambiguities, the new model may resolve them differently. Re-test tool-use heavy features especially carefully.
Rollback Plan
Keep your code able to switch models with a config flag, not a redeploy:
const MODEL = process.env.PRIMARY_MODEL ?? "gpt-5.5";
const response = await openai.chat.completions.create({
model: MODEL,
messages,
});
If a regression hits production, change one environment variable and roll forward in seconds. Without this, every regression is a deploy cycle and an outage window.
After Migration
Once 5.5 is the new baseline, run the post-migration audit:
- Drop workarounds. Any prompt hacks you added to compensate for 5.4 quirks. check if they're still needed. Often they're not, and removing them reduces tokens and improves quality.
- Re-evaluate model routing. If you previously routed certain hard tasks to a higher tier, re-test on 5.5. Maybe 5.5 handles them now and you can collapse a tier.
- Update the eval suite. Add the new edge cases you found during shadow. Future-you migrating 5.5 → 5.6 will thank present-you.
The Generic Pattern
This playbook works for any LLM migration:
- Build evals before you migrate
- Shadow traffic before flipping the switch
- Phased rollout with cost and quality monitoring
- Pinned parameters and rollback config
- Post-migration audit to capture the upside
The teams who do this well stop fearing model upgrades. The teams who skip it stop upgrading at all because every migration is a risk event. Be the former.