This is the question I get most often from clients who have been running LLM-powered internal tools for six months: should we keep paying OpenAI or move to a self-hosted model? The honest answer is that it depends on four specific variables, and I can give you the numbers to make the decision without guessing.
I have run both architectures in production across six client environments. Here is what I actually know about the cost structure.
The Variables That Drive the Decision
Before the cost comparison, you need to know four numbers about your specific usage:
- Monthly request volume: How many LLM calls per month?
- Average tokens per request: Input plus output, combined
- Latency requirement: Can users wait 5 seconds? 30 seconds?
- Data sensitivity: Can prompts leave your infrastructure at all?
Variable 4 overrides the cost calculation entirely. If you are processing patient records, financial data, or anything under a data residency requirement, self-hosted is not an optimization; it is a compliance necessity. Stop there and plan for self-hosting.
If variable 4 is not a constraint, run the math on variables 1, 2, and 3.
The Full Cost Comparison
| Factor | OpenAI API (GPT-4o mini) | Anthropic API (Haiku) | Self-hosted (Llama 3.1 8B) | Self-hosted (Llama 3.1 70B) |
|---|---|---|---|---|
| Input cost | $0.15/1M tokens | $0.25/1M tokens | $0 | $0 |
| Output cost | $0.60/1M tokens | $1.25/1M tokens | $0 | $0 |
| Infra cost (light) | $0 | $0 | $80/month (CPU server) | $400/month (GPU server) |
| Infra cost (medium) | $0 | $0 | $200/month (GPU instance) | $800/month (A100 instance) |
| Setup time | 0 hours | 0 hours | 8-16 hours | 16-40 hours |
| Maintenance | Minimal | Minimal | 2-4h/month | 4-8h/month |
| Quality (general tasks) | Excellent | Excellent | Good | Very good |
| Quality (complex reasoning) | Excellent | Excellent | Fair | Good |
| Latency | 1-3s | 1-4s | 0.5-2s (GPU) | 3-15s (GPU) |
The Break-Even Calculation
At current API pricing, the break-even monthly token volume for self-hosting Llama 3.1 8B on a $200/month GPU instance (using vLLM or Ollama) versus GPT-4o mini:
GPT-4o mini cost = (input_tokens * 0.15 + output_tokens * 0.60) / 1,000,000
Break-even where API cost = $200/month infra cost:
Assuming 2:1 input:output ratio for typical internal tool use
(input $0.15 * 2/3 + output $0.60 * 1/3 = $0.30 blended per 1M tokens)
$200 / $0.30 = 666M tokens/month break-even for 8B model
Six hundred and sixty-six million tokens per month is a high bar. That is roughly 13,000 requests per day at 1,500 tokens per request. Most internal tools are nowhere near that volume.
For the 70B model on a $800/month A100 instance:
$800 / $0.30 = 2.67B tokens/month break-even
That is 53,000 requests per day. Unless you are running a high-volume internal tool with thousands of daily users, API pricing is almost always cheaper at pure token cost.
The Hidden Costs of Self-Hosting
The token cost is only part of the picture. Self-hosting carries costs that do not appear in a simple break-even calculation.
Setup and configuration time. Getting Ollama or vLLM running on a cloud GPU instance takes a day for an experienced engineer. Model quantization, context length configuration, batching settings, and load balancing add another day. The first time you do this, expect to spend a week to get a production-quality setup.
Ongoing maintenance. Model updates, security patches on the inference server, monitoring for GPU memory issues, and restarting services after crashes. At steady state with one model, this is roughly four hours per month. Add more models and it scales linearly.
Model quality gap on complex tasks. Llama 3.1 8B is genuinely good at classification, summarization, and structured extraction. It is noticeably worse than GPT-4o on multi-step reasoning, nuanced instruction following, and complex code generation. For internal tools that use LLMs for extraction and classification, the quality gap is negligible. For tools that use LLMs for reasoning, the gap is real.
Incident response. When an API provider has an outage, you wait. When your self-hosted server has a GPU failure, you diagnose it. I have spent a Saturday afternoon debugging a CUDA out-of-memory error on a vLLM instance that was caused by a memory leak in the quantized model. That time cost is invisible in a break-even table.
Real Numbers: Three Production Deployments
Karachi clinic internal tool (document classification):
- Volume: 180,000 requests/month at 800 avg tokens
- Total tokens: 144M/month
- API cost (GPT-4o mini): $144M * $0.30/M = $43/month
- Self-hosting cost: $200/month GPU + $20 setup amortized = $220/month
- Decision: API wins clearly. Volume is 4.6x below break-even. Task is classification (no quality gap).
Singapore accounting firm (internal Q&A over financial documents):
- Volume: 2.1M requests/month at 2,200 avg tokens
- Total tokens: 4.6B/month
- API cost (GPT-4o mini): 4.6B * $0.30/M = $1,380/month
- Self-hosting cost: $800 A100 + maintenance = $900/month
- Decision: Self-hosting wins, but only with the 70B model because the task requires nuanced document understanding. Quality of 8B was insufficient.
- Note: This client also had data residency requirements that made API use non-viable regardless.
Amsterdam travel agency (multilingual content generation):
- Volume: 420,000 requests/month at 3,000 avg tokens
- Total tokens: 1.26B/month
- API cost (GPT-4o mini): $378/month
- Self-hosting cost: $800/month for quality-sufficient 70B model
- Decision: API wins on cost. Client chose self-hosting anyway due to data policy.
When Self-Hosting Is the Right Answer
Based on the deployments I have run, self-hosting makes sense when one or more of these is true:
- Data residency or compliance requirements prohibit API use
- Monthly token volume exceeds 500M with a quality-sufficient 8B model, or exceeds 2B with a 70B model
- Latency requirements are under 1 second and you have GPU hardware available
- You need a model fine-tuned on proprietary data that cannot be shared with an API provider
Self-hosting is the wrong answer when:
- Your volume is low or unpredictable
- You need frontier model quality (GPT-4o class) and do not have the hardware budget for it
- You do not have the engineering capacity for infrastructure maintenance
- Your use case benefits from the latest model updates (API providers update continuously)
The Hybrid Architecture
For three of my clients, the right answer is a hybrid: self-hosted for high-volume, low-stakes classification tasks (extraction, categorization, summarization), and API for low-volume, high-stakes tasks (complex reasoning, customer-facing responses, anything where quality variance is costly).
llm_routing_config:
high_volume_classification:
model: llama-3.1-8b-instruct
provider: self_hosted_vllm
max_tokens: 200
use_for: [document_classification, entity_extraction, data_validation]
complex_reasoning:
model: gpt-4o
provider: openai_api
max_tokens: 2000
use_for: [contract_analysis, exception_handling, report_generation]
default:
model: gpt-4o-mini
provider: openai_api
max_tokens: 800
use_for: [general_queries, summaries, standard_responses]
This hybrid approach reduced one client's monthly LLM spend from $1,100 to $380 while maintaining quality on the tasks where quality mattered.