The two highest-volume LLM use cases in production today are classification (assign a category to an input) and extraction (pull structured fields from unstructured input). For both, small-tier models like Gemini 3.1 Flash Lite often produce identical-quality output to frontier models at 5-30× lower cost. This article is the production playbook.
Why Small Models Win Here
Classification and extraction share three properties that play to small-tier models' strengths:
- The output space is constrained. A category enum, a JSON schema. The model isn't generating creative text; it's mapping input to one of a known set of outputs.
- The task is local. Whether a ticket is "billing" or "technical" doesn't require deep reasoning across documents. the answer is usually in the first sentence or two.
- Few-shot examples close the quality gap. A small model with 5 well-chosen examples often matches a large model with no examples.
For tasks that share these properties, you can run them at 1/10th the cost without losing user-visible quality.
Pattern 1: Classification
The minimum viable production setup:
import { GoogleGenerativeAI } from "@google/generative-ai";
const genai = new GoogleGenerativeAI(process.env.GOOGLE_API_KEY!);
const model = genai.getGenerativeModel({
model: "gemini-3.1-flash-lite",
systemInstruction: `You classify customer support tickets into one of these categories:
- billing - questions about invoices, charges, or refunds
- technical - bug reports, errors, or how-to questions about features
- account - login issues, password resets, account changes
- sales - pricing questions, plan comparisons, upgrade requests
- other - anything not fitting the above
Examples:
"My credit card was charged twice last month" → billing
"The export button isn't working in Chrome" → technical
"How do I upgrade to the team plan?" → sales
Respond with the single category name only. No explanation.`,
generationConfig: {
temperature: 0.0,
maxOutputTokens: 16,
},
});
async function classify(ticket: string): Promise<string> {
const result = await model.generateContent(ticket);
return result.response.text().trim();
}
Three production tweaks:
1. temperature: 0.0 for determinism. Classification benefits from consistency. the same input should always get the same category.
2. Tight maxOutputTokens. A category name fits in 16 tokens. Letting the model generate freely invites prose.
3. Examples in the system prompt. Few-shot examples often raise accuracy 10-20 percentage points on small models.
For higher reliability, use schema-constrained output:
const result = await model.generateContent({
contents: [{ role: "user", parts: [{ text: ticket }] }],
generationConfig: {
responseMimeType: "application/json",
responseSchema: {
type: "object",
properties: {
category: {
type: "string",
enum: ["billing", "technical", "account", "sales", "other"],
},
confidence: {
type: "string",
enum: ["high", "medium", "low"],
},
},
required: ["category", "confidence"],
},
},
});
The confidence field is the single most useful add. it gives you a routing signal for low-confidence cases (escalate to a Pro model, queue for human review, etc.).
Pattern 2: Extraction
Extraction is the cleanest fit for small-tier models. With schema-strict output, the small model fills in fields based on input text. and the schema does most of the work.
const model = genai.getGenerativeModel({ model: "gemini-3.1-flash-lite" });
async function extractInvoice(text: string) {
const result = await model.generateContent({
contents: [{ role: "user", parts: [{ text: `Extract invoice data:\n\n${text}` }] }],
generationConfig: {
responseMimeType: "application/json",
responseSchema: {
type: "object",
properties: {
vendor: { type: "string" },
invoice_number: { type: "string" },
date: { type: "string" },
due_date: { type: "string" },
line_items: {
type: "array",
items: {
type: "object",
properties: {
description: { type: "string" },
quantity: { type: "number" },
unit_price: { type: "number" },
line_total: { type: "number" },
},
required: ["description", "line_total"],
},
},
subtotal: { type: "number" },
tax: { type: "number" },
total: { type: "number" },
currency: { type: "string" },
},
required: ["vendor", "date", "total", "line_items"],
},
},
});
return JSON.parse(result.response.text());
}
For extraction:
- The schema does most of the heavy lifting
- Small models match large models on quality for well-structured inputs
- Where small models fall short is on highly-irregular inputs (handwritten notes, severely-formatted emails). these are the cases to route to a more capable tier
Pattern 3: Transformation
Cleaning data, normalising formats, reformatting between conventions, simple summarisation. All play to small-tier strengths.
async function normalisePhoneNumbers(input: string) {
const result = await model.generateContent(`Normalise all phone numbers in the
following text to E.164 format (e.g., +14155551234). Preserve all other text.
Input:
${input}
Output:`);
return result.response.text();
}
For transformation:
- Specify the input and output formats explicitly
- Provide examples of correct input/output pairs in the system prompt
- Use temperature 0.0 for deterministic transformations
- Validate the output programmatically (regex, schema). small models occasionally drift
Batch Processing
For high-volume offline tasks, batch many inputs into a single request. Small models handle this gracefully:
async function classifyBatch(tickets: string[]) {
const numbered = tickets.map((t, i) => `[${i + 1}] ${t}`).join("\n\n");
const result = await model.generateContent({
contents: [{ role: "user", parts: [{ text: `Classify each of these tickets:\n\n${numbered}` }] }],
generationConfig: {
responseMimeType: "application/json",
responseSchema: {
type: "object",
properties: {
classifications: {
type: "array",
items: {
type: "object",
properties: {
ticket: { type: "number" },
category: { type: "string", enum: ["billing", "technical", "account", "sales", "other"] },
},
},
},
},
},
},
});
return JSON.parse(result.response.text()).classifications;
}
Batching 20-50 inputs per request reduces overhead substantially. For datasets with hundreds of thousands of items, this is the difference between an overnight job and a multi-day one.
Quality Discipline
Three patterns that catch quality regressions:
1. Programmatic validation. Whatever the model outputs, validate it against your schema/expected format before trusting it. Reject malformed responses; retry once with a slightly varied prompt; route to Pro if it fails twice.
async function classifyWithFallback(ticket: string): Promise<{ category: string; tier: "lite" | "pro" }> {
try {
const result = await classifyOnLite(ticket);
if (validCategory(result.category)) return { category: result.category, tier: "lite" };
} catch { /* fall through */ }
const proResult = await classifyOnPro(ticket);
return { category: proResult.category, tier: "pro" };
}
2. Sampling for ongoing quality monitoring. Take 1% of production traffic and send it through both Lite and Pro. Compare outputs. If they diverge meaningfully on a category of input, that's signal to investigate.
3. Human-labeled eval sets. Maintain 200-500 labelled real inputs. Re-run them periodically. Track accuracy over time. The first time a model update regresses, you'll see it before users complain.
The Cost Numbers
Concrete example of why this matters. A SaaS company with 100K daily classified support tickets:
- On a Pro model: typically $5K-$15K/month
- On a Lite model: typically $200-$700/month
The quality gap, properly measured, is often within 1-2 percentage points. The cost gap is 10-30×. For a workload with this profile, defaulting to Pro is a budget mistake, not a quality choice.
Integration With Other Models
Small models work best as part of a tiered architecture:
- Lite for the volume: classification, extraction, routing
- Pro for the depth: actual customer responses, complex reasoning, edge cases routed by the triage layer
The triage layer itself runs on Lite. The routing logic is the architecture.
What's Next
Small-tier models keep getting more capable. The gap with Pro tier on bounded tasks is narrowing each year. For most production AI features, the right architecture is Lite-first. and the threshold for "needs Pro" keeps moving up.
Build for that future: design your features assuming Lite handles them, fall back to Pro only when measurement shows it's necessary. That's the architecture that scales economically as your volume grows.