Self-Hosted vs. API LLMs for Internal Tools: A Real Cost Breakdown

This is the question I get most often from clients who have been running LLM-powered internal tools for six months: should we keep paying OpenAI or move to a self-hosted model? The honest answer is that it depends on four specific variables, and I can give you the numbers to make the decision without guessing.

I have run both architectures in production across six client environments. Here is what I actually know about the cost structure.

The Variables That Drive the Decision

Before the cost comparison, you need to know four numbers about your specific usage:

Monthly request volume: How many LLM calls per month?
Average tokens per request: Input plus output, combined
Latency requirement: Can users wait 5 seconds? 30 seconds?
Data sensitivity: Can prompts leave your infrastructure at all?

Variable 4 overrides the cost calculation entirely. If you are processing patient records, financial data, or anything under a data residency requirement, self-hosted is not an optimization; it is a compliance necessity. Stop there and plan for self-hosting.

If variable 4 is not a constraint, run the math on variables 1, 2, and 3.

The Full Cost Comparison

Factor	OpenAI API (GPT-4o mini)	Anthropic API (Haiku)	Self-hosted (Llama 3.1 8B)	Self-hosted (Llama 3.1 70B)
Input cost	$0.15/1M tokens	$0.25/1M tokens	$0	$0
Output cost	$0.60/1M tokens	$1.25/1M tokens	$0	$0
Infra cost (light)	$0	$0	$80/month (CPU server)	$400/month (GPU server)
Infra cost (medium)	$0	$0	$200/month (GPU instance)	$800/month (A100 instance)
Setup time	0 hours	0 hours	8-16 hours	16-40 hours
Maintenance	Minimal	Minimal	2-4h/month	4-8h/month
Quality (general tasks)	Excellent	Excellent	Good	Very good
Quality (complex reasoning)	Excellent	Excellent	Fair	Good
Latency	1-3s	1-4s	0.5-2s (GPU)	3-15s (GPU)

The Break-Even Calculation

At current API pricing, the break-even monthly token volume for self-hosting Llama 3.1 8B on a $200/month GPU instance (using vLLM or Ollama) versus GPT-4o mini:

GPT-4o mini cost = (input_tokens * 0.15 + output_tokens * 0.60) / 1,000,000

Break-even where API cost = $200/month infra cost:
  Assuming 2:1 input:output ratio for typical internal tool use
  (input $0.15 * 2/3 + output $0.60 * 1/3 = $0.30 blended per 1M tokens)

  $200 / $0.30 = 666M tokens/month break-even for 8B model

Six hundred and sixty-six million tokens per month is a high bar. That is roughly 13,000 requests per day at 1,500 tokens per request. Most internal tools are nowhere near that volume.

For the 70B model on a $800/month A100 instance:

  $800 / $0.30 = 2.67B tokens/month break-even

That is 53,000 requests per day. Unless you are running a high-volume internal tool with thousands of daily users, API pricing is almost always cheaper at pure token cost.

The Hidden Costs of Self-Hosting

The token cost is only part of the picture. Self-hosting carries costs that do not appear in a simple break-even calculation.

Setup and configuration time. Getting Ollama or vLLM running on a cloud GPU instance takes a day for an experienced engineer. Model quantization, context length configuration, batching settings, and load balancing add another day. The first time you do this, expect to spend a week to get a production-quality setup.

Ongoing maintenance. Model updates, security patches on the inference server, monitoring for GPU memory issues, and restarting services after crashes. At steady state with one model, this is roughly four hours per month. Add more models and it scales linearly.

Model quality gap on complex tasks. Llama 3.1 8B is genuinely good at classification, summarization, and structured extraction. It is noticeably worse than GPT-4o on multi-step reasoning, nuanced instruction following, and complex code generation. For internal tools that use LLMs for extraction and classification, the quality gap is negligible. For tools that use LLMs for reasoning, the gap is real.

Incident response. When an API provider has an outage, you wait. When your self-hosted server has a GPU failure, you diagnose it. I have spent a Saturday afternoon debugging a CUDA out-of-memory error on a vLLM instance that was caused by a memory leak in the quantized model. That time cost is invisible in a break-even table.

Real Numbers: Three Production Deployments

Karachi clinic internal tool (document classification):

Volume: 180,000 requests/month at 800 avg tokens
Total tokens: 144M/month
API cost (GPT-4o mini): $144M * $0.30/M = $43/month
Self-hosting cost: $200/month GPU + $20 setup amortized = $220/month
Decision: API wins clearly. Volume is 4.6x below break-even. Task is classification (no quality gap).

Singapore accounting firm (internal Q&A over financial documents):

Volume: 2.1M requests/month at 2,200 avg tokens
Total tokens: 4.6B/month
API cost (GPT-4o mini): 4.6B * $0.30/M = $1,380/month
Self-hosting cost: $800 A100 + maintenance = $900/month
Decision: Self-hosting wins, but only with the 70B model because the task requires nuanced document understanding. Quality of 8B was insufficient.
Note: This client also had data residency requirements that made API use non-viable regardless.

Amsterdam travel agency (multilingual content generation):

Volume: 420,000 requests/month at 3,000 avg tokens
Total tokens: 1.26B/month
API cost (GPT-4o mini): $378/month
Self-hosting cost: $800/month for quality-sufficient 70B model
Decision: API wins on cost. Client chose self-hosting anyway due to data policy.

When Self-Hosting Is the Right Answer

Based on the deployments I have run, self-hosting makes sense when one or more of these is true:

Data residency or compliance requirements prohibit API use
Monthly token volume exceeds 500M with a quality-sufficient 8B model, or exceeds 2B with a 70B model
Latency requirements are under 1 second and you have GPU hardware available
You need a model fine-tuned on proprietary data that cannot be shared with an API provider

Self-hosting is the wrong answer when:

Your volume is low or unpredictable
You need frontier model quality (GPT-4o class) and do not have the hardware budget for it
You do not have the engineering capacity for infrastructure maintenance
Your use case benefits from the latest model updates (API providers update continuously)

The Hybrid Architecture

For three of my clients, the right answer is a hybrid: self-hosted for high-volume, low-stakes classification tasks (extraction, categorization, summarization), and API for low-volume, high-stakes tasks (complex reasoning, customer-facing responses, anything where quality variance is costly).

llm_routing_config:
  high_volume_classification:
    model: llama-3.1-8b-instruct
    provider: self_hosted_vllm
    max_tokens: 200
    use_for: [document_classification, entity_extraction, data_validation]

  complex_reasoning:
    model: gpt-4o
    provider: openai_api
    max_tokens: 2000
    use_for: [contract_analysis, exception_handling, report_generation]

  default:
    model: gpt-4o-mini
    provider: openai_api
    max_tokens: 800
    use_for: [general_queries, summaries, standard_responses]

This hybrid approach reduced one client's monthly LLM spend from $1,100 to $380 while maintaining quality on the tasks where quality mattered.