REF / WRITING · SOFTWARE

ChatGPT 5.5 Multimodal Patterns: Vision, Audio, and Mixed Inputs

Production patterns for multimodal GPT 5.5 - image inputs, audio handling, mixed-modality prompts, and the architecture that keeps quality and cost in line.

DomainSoftware
Formattutorial
Published2 May 2026
Tagschatgpt · gpt-5 · openai

Multimodal LLM features have moved from "interesting demo" to "real production capability" over the last two years. With GPT 5.5, vision and audio inputs are reliable enough to ship into customer-facing features for use cases like document analysis, visual support tickets, voice transcription with structured extraction, and accessibility tooling.

This article covers the production patterns I've found that work. and the failure modes to plan for.

The Three Categories of Multimodal Use

Most production multimodal features fall into one of three shapes:

1. Visual question answering. A user uploads an image; the system answers questions about it. Examples: identify a product, read a receipt, describe a diagram, extract data from a chart.

2. Document understanding. A multi-page PDF or scanned document; the system extracts structured information. Examples: invoice processing, contract review, medical report parsing.

3. Audio + structured output. A voice recording or live audio; the system transcribes plus extracts structured information. Examples: meeting notes with action items, support call summarisation, voice form filling.

Each has its own production pattern.

Pattern 1: Visual Question Answering

The minimum viable production setup:

import OpenAI from "openai";
const openai = new OpenAI();

async function describeImage(imageUrl: string, question: string) {
  const response = await openai.chat.completions.create({
    model: "gpt-5.5",
    max_tokens: 512,
    messages: [
      {
        role: "user",
        content: [
          { type: "image_url", image_url: { url: imageUrl, detail: "auto" } },
          { type: "text", text: question },
        ],
      },
    ],
  });
  return response.choices[0].message.content;
}

Three production tweaks:

1. Set detail deliberately. Most APIs offer low, high, or auto detail levels. low is dramatically cheaper but lossy on small text. high is what you want for OCR-style work. Default isn't always optimal. pick per use case.

2. Pre-process images that have predictable structure. A receipt photographed in poor lighting, rotated 30°, with shadows? Run it through a basic image preprocessor (rotation correction, contrast adjustment, deskewing) before sending. The model handles it better and the response quality improves materially.

3. Use structured output for extraction. When the goal is "extract these specific fields from this image," use schema-strict JSON mode rather than free-text answers.

const response = await openai.chat.completions.create({
  model: "gpt-5.5",
  response_format: {
    type: "json_schema",
    json_schema: {
      name: "receipt",
      schema: {
        type: "object",
        properties: {
          vendor: { type: "string" },
          date: { type: "string" },
          total: { type: "number" },
          line_items: { type: "array", items: { /* ... */ } },
        },
        required: ["vendor", "date", "total"],
      },
      strict: true,
    },
  },
  messages: [
    {
      role: "user",
      content: [
        { type: "image_url", image_url: { url: receiptUrl, detail: "high" } },
        { type: "text", text: "Extract the receipt data." },
      ],
    },
  ],
});

This is the single highest-quality pattern for any "image → structured data" feature.

Pattern 2: Multi-Page Document Understanding

For documents larger than a single image (multi-page PDFs, contracts, reports), the production pattern is:

  1. Convert each page to an image (via pdf-poppler, pdf2image, or a similar tool)
  2. Send all pages in a single multi-image request, with a system prompt that instructs page-aware extraction
  3. Use schema-strict output for the structured fields
async function extractContract(pages: string[]) {
  const content = pages.flatMap((url, i) => [
    { type: "image_url" as const, image_url: { url, detail: "high" as const } },
    { type: "text" as const, text: `Page ${i + 1} above.` },
  ]);
  content.push({
    type: "text" as const,
    text: "Extract: parties, effective date, term, governing law, all monetary amounts.",
  });

  const response = await openai.chat.completions.create({
    model: "gpt-5.5",
    response_format: { type: "json_schema", json_schema: contractSchema, strict: true },
    messages: [{ role: "user", content }],
  });

  return JSON.parse(response.choices[0].message.content!);
}

For very long documents (50+ pages), chunk by section and aggregate. Sending 100 high-detail pages in one request is expensive and often hits token limits.

Pattern 3: Audio + Structured Output

Audio handling typically uses a transcription model (Whisper or its successors) followed by a chat completion that processes the transcript. The pattern:

async function processSupportCall(audioUrl: string) {
  // Step 1 - transcribe
  const transcript = await openai.audio.transcriptions.create({
    file: await fetchAudio(audioUrl),
    model: "whisper-large",
    language: "en",
  });

  // Step 2 - extract structure
  const extraction = await openai.chat.completions.create({
    model: "gpt-5.5",
    response_format: {
      type: "json_schema",
      json_schema: {
        name: "support_call",
        schema: {
          type: "object",
          properties: {
            customer_issue: { type: "string" },
            resolution: { type: "string", enum: ["resolved", "escalated", "pending"] },
            next_steps: { type: "array", items: { type: "string" } },
            sentiment: { type: "string", enum: ["positive", "neutral", "negative"] },
          },
          required: ["customer_issue", "resolution", "next_steps"],
        },
        strict: true,
      },
    },
    messages: [
      { role: "system", content: "You analyse support call transcripts." },
      { role: "user", content: transcript.text },
    ],
  });

  return JSON.parse(extraction.choices[0].message.content!);
}

Two production patterns worth applying:

1. Cache the system prompt. The extraction system prompt is stable; cache it across calls.

2. Decouple transcription from analysis. If the transcription pipeline has issues (silence, background noise, accents), surface that quality signal alongside the extracted output. A confident extraction from a low-quality transcript is the dangerous case.

Cost Discipline for Multimodal

Multimodal requests are typically more expensive than text-only:

  • Images consume tokens proportional to size and detail level.
  • Audio is billed by duration (transcription) plus tokens (analysis).
  • Multi-page documents can rack up serious token bills if sent at high detail.

Three patterns to keep cost in line:

  1. Resize images server-side before sending. A 4K image carries no extra information for most use cases vs a 1024px version. Pre-resize.
  2. Use detail: low whenever you can get away with it. For visual classification where the answer doesn't depend on small text, low is often equivalent in quality at a fraction of the cost.
  3. Cache the system prompt and any stable instruction text. Even more important on multimodal calls because the per-call cost is higher.

Failure Modes to Plan For

Three multimodal-specific failure modes:

1. Hallucinated content from unclear images. When an image is blurry or partially occluded, the model will sometimes invent content rather than say "I can't tell." Counter with explicit instruction: "If any field is not clearly visible, return null. Do not guess."

2. OCR errors propagating to extraction. If the model misreads a digit (5 vs 6), the extracted value is wrong with high confidence. For financial extractions, ask the model to also return its confidence per field, and validate downstream.

3. Audio transcription failures on accents or terminology. Domain-specific terminology (medical, legal, brand names) often gets transcribed incorrectly. Provide a glossary in the prompt or use a domain-tuned model when stakes are high.

The Production Architecture

A mature multimodal feature ships with:

  • Schema-strict structured outputs
  • Image preprocessing pipeline (resize, rotate, deskew) before API calls
  • Confidence/uncertainty signal returned alongside extracted data
  • Caching on the stable instruction prefix
  • A fallback path when the multimodal call fails (manual review queue, retry on a different model, etc.)
  • Quality monitoring: hallucination rate, OCR error rate, transcription accuracy on representative samples

Multimodal features that hit those checkpoints become reliable parts of a product. Skipping any of them produces features that demo well and fail in surprising ways under real traffic.

What's Next

Multimodal capability has been improving fast. The features that were impractical 18 months ago. real-time video understanding, multi-modal agents that mix vision and tool use, complex audio scene analysis. are increasingly within reach. The patterns above generalise: structure your inputs, structure your outputs, validate your assumptions, and ship the production discipline that catches the surprises.