REF / WRITING · SOFTWARE

Cost-Optimising ChatGPT 5.4 Production Deployments

Practical patterns for cutting GPT 5.4 spend without losing quality - caching, model routing, output discipline, and the dashboards that catch surprises.

DomainSoftware
Formattutorial
Published21 Mar 2026
Tagschatgpt · gpt-5 · openai

The fastest path from a working LLM feature to a financially sustainable LLM feature is a set of cost optimisations that don't compromise quality. For most production deployments of GPT 5.4, these patterns cut spend by 60-85% with no measurable user-facing impact.

This is the playbook.

The Cost Equation

Per-request cost on any chat completion API is roughly:

cost = (input_tokens × input_price) + (output_tokens × output_price)

Output tokens are typically 3-5× more expensive than input tokens. Cache reads (where supported) cost a small fraction of input tokens. The optimisation surface is therefore:

  1. Reduce input tokens you pay full price for
  2. Reduce output tokens
  3. Cache aggressively

Almost every cost win comes from one of these three.

1. Caching: The Highest-ROI Lever

OpenAI's automatic caching (and any explicit caching mechanisms) reduces the cost of input tokens that match a previously-seen prompt prefix. The cost reduction is large. typically 50%+ on the cached portion. and it kicks in transparently as long as you structure your prompts correctly.

Three rules:

1. Stable content first, variable content last. Cache hits require an exact-match prefix. Anything that varies between requests must come after the stable portion.

// ✅ Cacheable prefix
messages: [
  { role: "system", content: stableSystemPrompt },     // never changes
  { role: "user", content: stableExamples },           // never changes
  { role: "user", content: thisRequestSpecificInput }, // varies
]

// ❌ Cacheable prefix is broken on the first message
messages: [
  { role: "system", content: `Today is ${new Date()}. ${stableSystemPrompt}` },  // varies on every request
  { role: "user", content: thisRequestSpecificInput },
]

That single timestamp in the system message wastes the entire caching opportunity. Move dynamic context out of the cached prefix. into the user message, near the actual query.

2. Long stable prefixes are worth more. Caching becomes economically meaningful above ~1K tokens in the cached prefix. For RAG applications where the retrieved context can be cached across multiple turns of the same conversation, the savings compound dramatically.

3. Concentrate variation at the end. If multiple things vary per request (user ID, query, timestamp), put the smallest-changing one earliest and the most-changing one last. This is sometimes the difference between a 50% cache hit rate and a 5% cache hit rate.

2. Output Token Discipline

Three patterns to constrain output:

1. Set max_tokens aggressively. The model defaults to long. For a feature that needs a 100-token response, set max_tokens: 150. The model will naturally stop within budget.

2. Specify length in the prompt. "In one sentence" or "Return a JSON object with no extra prose" reduces output by 30-60% on average compared to no length guidance.

3. Use structured outputs. When the response is JSON, the API trims most of the verbose prose tendencies. A free-form response asking for "the answer plus reasoning" might produce 800 tokens; the same content as a {answer, reasoning} JSON object often comes in at 300.

const response = await openai.chat.completions.create({
  model: "gpt-5.4",
  max_tokens: 300,
  response_format: { type: "json_schema", json_schema: { /* ... */ }, strict: true },
  messages: [/* ... */],
});

These three together typically reduce output token spend by 50-70% with no loss of useful information.

3. Model Routing: The Two-Tier Architecture

The single biggest architectural cost optimisation: don't use GPT 5.4 for every request. Use the smallest model that produces acceptable quality for each task.

async function classify(input: string): Promise<Category> {
  // Cheap classification: smaller-tier model
  return await openai.chat.completions.create({
    model: "gpt-5.4-mini",  // or whichever smaller variant
    messages: [{ role: "user", content: `Classify: ${input}` }],
  });
}

async function generate(input: string, category: Category): Promise<string> {
  // For categories where quality matters, use the bigger model
  if (category === "complex") {
    return await openai.chat.completions.create({
      model: "gpt-5.4",
      messages: [/* ... */],
    });
  }
  // For simpler categories, the smaller model produces fine results
  return await openai.chat.completions.create({
    model: "gpt-5.4-mini",
    messages: [/* ... */],
  });
}

For most production deployments:

  • 70-90% of traffic can be handled by a smaller model in the family
  • 10-30% needs the full GPT 5.4 (or higher)

The triage-then-route pattern is the difference between a $50K/month bill and a $5K/month bill at meaningful volume.

4. Batch Processing for Async Workloads

For tasks that don't need real-time responses. overnight summarisation, periodic analysis, batch enrichment. the OpenAI Batch API offers a significant discount (typically ~50%) in exchange for slower turnaround (up to 24 hours).

// Submit a batch
const batch = await openai.batches.create({
  input_file_id: uploadedFile.id,
  endpoint: "/v1/chat/completions",
  completion_window: "24h",
});

// Poll for completion (or use a webhook)
const status = await openai.batches.retrieve(batch.id);
if (status.status === "completed") {
  const results = await openai.files.content(status.output_file_id!);
}

Audit which of your LLM workflows are actually real-time and which could be batched. Many "real-time" features (nightly report generation, weekly digest emails, periodic data enrichment) can be batched without any UX impact.

5. Prompt Compression for Dynamic Context

When a per-request prompt has lots of dynamic content (e.g., user-specific context that can't be cached), compression techniques can reduce input tokens 40-70% with minimal quality loss:

  • Strip whitespace and formatting from JSON or structured input the model doesn't need formatted.
  • Summarise long histories before including them. A 5K-token conversation history can become a 500-token "Previously the user discussed X, asked about Y, then..." summary.
  • Drop low-information fields: if 40% of your context is API metadata the model never references, omit it.

Caveat: don't compress at the cost of correctness. The quality cost of an over-compressed prompt is much higher than the cost cost of a fully-detailed one.

The Dashboards That Catch Surprises

Build these on day one:

// Per-request log
logger.info("openai.completion", {
  feature: "support_classifier",
  model: response.model,
  prompt_tokens: response.usage.prompt_tokens,
  completion_tokens: response.usage.completion_tokens,
  cached_tokens: response.usage.prompt_tokens_details?.cached_tokens ?? 0,
  cost_usd: estimateCost(response.usage, response.model),
  duration_ms: durationMs,
});

Three dashboards every production LLM system needs:

  1. Cost per feature, daily. Per-feature, per-model, per-day. The first time a feature spikes, you want to see it the same day.
  2. Cache hit rate. Per-feature. A drop usually means someone changed a "stable" prompt. Alert on it.
  3. Average tokens per request, weekly. A creeping increase often means the prompt is being grown a few tokens at a time without notice. Catch it before it doubles.

The Production Discipline

The cost-optimised production stack:

  • Cacheable prefix architecture. system + stable context first, dynamic last
  • max_tokens and structured outputs as defaults
  • Triage layer routing 70%+ of traffic to a smaller model
  • Batch API for any non-real-time workload
  • Dashboards: cost, cache hit rate, tokens-per-request, alerted on regressions

Apply those, and you can ship LLM features that are both high-quality and financially sustainable. Skip them, and you ship features that work great in beta and bankrupt you in production.