REF / WRITING · SOFTWARE

High-Volume Classification and Extraction with Gemini Flash Lite

Production patterns for using small-tier models like Gemini 3.1 Flash Lite for high-volume classification, extraction, and transformation tasks.

DomainSoftware
Formattutorial
Published11 Mar 2026
Tagsgemini · gemini-flash · llm

The two highest-volume LLM use cases in production today are classification (assign a category to an input) and extraction (pull structured fields from unstructured input). For both, small-tier models like Gemini 3.1 Flash Lite often produce identical-quality output to frontier models at 5-30× lower cost. This article is the production playbook.

Why Small Models Win Here

Classification and extraction share three properties that play to small-tier models' strengths:

  1. The output space is constrained. A category enum, a JSON schema. The model isn't generating creative text; it's mapping input to one of a known set of outputs.
  2. The task is local. Whether a ticket is "billing" or "technical" doesn't require deep reasoning across documents. the answer is usually in the first sentence or two.
  3. Few-shot examples close the quality gap. A small model with 5 well-chosen examples often matches a large model with no examples.

For tasks that share these properties, you can run them at 1/10th the cost without losing user-visible quality.

Pattern 1: Classification

The minimum viable production setup:

import { GoogleGenerativeAI } from "@google/generative-ai";
const genai = new GoogleGenerativeAI(process.env.GOOGLE_API_KEY!);

const model = genai.getGenerativeModel({
  model: "gemini-3.1-flash-lite",
  systemInstruction: `You classify customer support tickets into one of these categories:
- billing - questions about invoices, charges, or refunds
- technical - bug reports, errors, or how-to questions about features
- account - login issues, password resets, account changes
- sales - pricing questions, plan comparisons, upgrade requests
- other - anything not fitting the above

Examples:
"My credit card was charged twice last month" → billing
"The export button isn't working in Chrome" → technical
"How do I upgrade to the team plan?" → sales

Respond with the single category name only. No explanation.`,
  generationConfig: {
    temperature: 0.0,
    maxOutputTokens: 16,
  },
});

async function classify(ticket: string): Promise<string> {
  const result = await model.generateContent(ticket);
  return result.response.text().trim();
}

Three production tweaks:

1. temperature: 0.0 for determinism. Classification benefits from consistency. the same input should always get the same category.

2. Tight maxOutputTokens. A category name fits in 16 tokens. Letting the model generate freely invites prose.

3. Examples in the system prompt. Few-shot examples often raise accuracy 10-20 percentage points on small models.

For higher reliability, use schema-constrained output:

const result = await model.generateContent({
  contents: [{ role: "user", parts: [{ text: ticket }] }],
  generationConfig: {
    responseMimeType: "application/json",
    responseSchema: {
      type: "object",
      properties: {
        category: {
          type: "string",
          enum: ["billing", "technical", "account", "sales", "other"],
        },
        confidence: {
          type: "string",
          enum: ["high", "medium", "low"],
        },
      },
      required: ["category", "confidence"],
    },
  },
});

The confidence field is the single most useful add. it gives you a routing signal for low-confidence cases (escalate to a Pro model, queue for human review, etc.).

Pattern 2: Extraction

Extraction is the cleanest fit for small-tier models. With schema-strict output, the small model fills in fields based on input text. and the schema does most of the work.

const model = genai.getGenerativeModel({ model: "gemini-3.1-flash-lite" });

async function extractInvoice(text: string) {
  const result = await model.generateContent({
    contents: [{ role: "user", parts: [{ text: `Extract invoice data:\n\n${text}` }] }],
    generationConfig: {
      responseMimeType: "application/json",
      responseSchema: {
        type: "object",
        properties: {
          vendor: { type: "string" },
          invoice_number: { type: "string" },
          date: { type: "string" },
          due_date: { type: "string" },
          line_items: {
            type: "array",
            items: {
              type: "object",
              properties: {
                description: { type: "string" },
                quantity: { type: "number" },
                unit_price: { type: "number" },
                line_total: { type: "number" },
              },
              required: ["description", "line_total"],
            },
          },
          subtotal: { type: "number" },
          tax: { type: "number" },
          total: { type: "number" },
          currency: { type: "string" },
        },
        required: ["vendor", "date", "total", "line_items"],
      },
    },
  });

  return JSON.parse(result.response.text());
}

For extraction:

  • The schema does most of the heavy lifting
  • Small models match large models on quality for well-structured inputs
  • Where small models fall short is on highly-irregular inputs (handwritten notes, severely-formatted emails). these are the cases to route to a more capable tier

Pattern 3: Transformation

Cleaning data, normalising formats, reformatting between conventions, simple summarisation. All play to small-tier strengths.

async function normalisePhoneNumbers(input: string) {
  const result = await model.generateContent(`Normalise all phone numbers in the
following text to E.164 format (e.g., +14155551234). Preserve all other text.

Input:
${input}

Output:`);
  return result.response.text();
}

For transformation:

  • Specify the input and output formats explicitly
  • Provide examples of correct input/output pairs in the system prompt
  • Use temperature 0.0 for deterministic transformations
  • Validate the output programmatically (regex, schema). small models occasionally drift

Batch Processing

For high-volume offline tasks, batch many inputs into a single request. Small models handle this gracefully:

async function classifyBatch(tickets: string[]) {
  const numbered = tickets.map((t, i) => `[${i + 1}] ${t}`).join("\n\n");
  const result = await model.generateContent({
    contents: [{ role: "user", parts: [{ text: `Classify each of these tickets:\n\n${numbered}` }] }],
    generationConfig: {
      responseMimeType: "application/json",
      responseSchema: {
        type: "object",
        properties: {
          classifications: {
            type: "array",
            items: {
              type: "object",
              properties: {
                ticket: { type: "number" },
                category: { type: "string", enum: ["billing", "technical", "account", "sales", "other"] },
              },
            },
          },
        },
      },
    },
  });
  return JSON.parse(result.response.text()).classifications;
}

Batching 20-50 inputs per request reduces overhead substantially. For datasets with hundreds of thousands of items, this is the difference between an overnight job and a multi-day one.

Quality Discipline

Three patterns that catch quality regressions:

1. Programmatic validation. Whatever the model outputs, validate it against your schema/expected format before trusting it. Reject malformed responses; retry once with a slightly varied prompt; route to Pro if it fails twice.

async function classifyWithFallback(ticket: string): Promise<{ category: string; tier: "lite" | "pro" }> {
  try {
    const result = await classifyOnLite(ticket);
    if (validCategory(result.category)) return { category: result.category, tier: "lite" };
  } catch { /* fall through */ }

  const proResult = await classifyOnPro(ticket);
  return { category: proResult.category, tier: "pro" };
}

2. Sampling for ongoing quality monitoring. Take 1% of production traffic and send it through both Lite and Pro. Compare outputs. If they diverge meaningfully on a category of input, that's signal to investigate.

3. Human-labeled eval sets. Maintain 200-500 labelled real inputs. Re-run them periodically. Track accuracy over time. The first time a model update regresses, you'll see it before users complain.

The Cost Numbers

Concrete example of why this matters. A SaaS company with 100K daily classified support tickets:

  • On a Pro model: typically $5K-$15K/month
  • On a Lite model: typically $200-$700/month

The quality gap, properly measured, is often within 1-2 percentage points. The cost gap is 10-30×. For a workload with this profile, defaulting to Pro is a budget mistake, not a quality choice.

Integration With Other Models

Small models work best as part of a tiered architecture:

  • Lite for the volume: classification, extraction, routing
  • Pro for the depth: actual customer responses, complex reasoning, edge cases routed by the triage layer

The triage layer itself runs on Lite. The routing logic is the architecture.

What's Next

Small-tier models keep getting more capable. The gap with Pro tier on bounded tasks is narrowing each year. For most production AI features, the right architecture is Lite-first. and the threshold for "needs Pro" keeps moving up.

Build for that future: design your features assuming Lite handles them, fall back to Pro only when measurement shows it's necessary. That's the architecture that scales economically as your volume grows.