Self-Hosted LLMs vs Cloud APIs: The Real Cost Nobody Talks About

Every other week, a CTO asks us whether they should self-host an open-source LLM instead of using cloud APIs from Anthropic or OpenAI. The reasoning is always the same: "We are spending $15,000 a month on API calls, and a couple of GPUs would pay for themselves in three months." This reasoning is wrong. Not because self-hosting never makes sense, but because the math people use to justify it ignores about 60% of the actual costs.

We have helped four clients evaluate and, in two cases, implement self-hosted LLM deployments. Here is what the real cost comparison looks like.

Let us start with the obvious costs that everyone accounts for. GPU hardware or cloud GPU instances, electricity, and the base model weights. For a production-quality deployment of a 70B parameter model like Llama 3, you need at minimum two NVIDIA A100 80GB GPUs for inference. That is about $30,000 in hardware if you buy, or roughly $4,000 per month if you rent from a cloud provider like Lambda Labs or CoreWeave. Most teams look at their $15,000 monthly API bill, compare it to $4,000 in GPU costs, and see an easy win.

Now let us add the costs nobody talks about.

First, engineering time. Getting a model running on a single machine for demos takes about a day. Getting it running reliably in production with proper load balancing, health checks, automatic restarts, and monitoring takes 2 to 4 weeks of senior engineering time. At a fully loaded cost of $150 per hour for a senior ML engineer, that is $24,000 to $48,000 just for the initial deployment. And this is not a one-time cost because you will need ongoing maintenance as you update models, patch security vulnerabilities, and deal with infrastructure issues.

Second, inference optimization. Out-of-the-box inference with a 70B model on two A100s gives you roughly 30 tokens per second for a single request. That is fine for a demo but unacceptable for production use cases where users expect sub-second response times. To get competitive performance, you need to implement or configure quantization, batching, speculative decoding, or KV cache optimization. Tools like vLLM and TensorRT-LLM help, but they require expertise to configure correctly and they introduce their own operational complexity.

One of our clients spent 6 weeks optimizing their self-hosted Llama 3 70B deployment to match the throughput of Claude's API. The engineer working on it was one of their best, and during those 6 weeks he was not working on product features. The opportunity cost was enormous.

Third, reliability and uptime. Cloud APIs from Anthropic and OpenAI run on massive infrastructure with redundancy, automatic failover, and 24/7 operations teams. When you self-host, you are your own operations team. GPU failures happen. CUDA driver updates break things. Memory leaks in inference servers cause gradual degradation. One client experienced a silent failure where their model started returning increasingly degraded outputs due to a thermal throttling issue on one GPU. It took them 3 days to notice because their monitoring was not sophisticated enough to detect quality degradation, only complete failures.

Fourth, model updates and iteration speed. When Anthropic releases a new version of Claude, you get access immediately through the same API. When Meta releases a new version of Llama, you need to download the weights, test them against your benchmarks, optimize inference again, update your deployment, and run regression tests. This cycle typically takes 1 to 2 weeks. During that time, you are running an older model while your competitors using cloud APIs have already upgraded.

Fifth, compliance and security overhead. If you are in a regulated industry, self-hosting means you are responsible for the entire security posture of your AI infrastructure. That includes data encryption at rest and in transit, access controls, audit logging, and potentially SOC 2 or HIPAA compliance for the infrastructure itself. Cloud API providers handle most of this for you and can provide compliance documentation.

Now let us look at when self-hosting actually does make sense. We identified three scenarios where the math works out.

Scenario one is extreme volume. If you are processing more than 50 million tokens per day consistently, self-hosting starts to win on pure economics. At that scale, the per-token cost of self-hosted inference drops below cloud API pricing even after accounting for all the hidden costs we listed above. But very few companies are at this scale, and if you are, you probably already have an ML infrastructure team.

Scenario two is data sovereignty requirements. Some clients, particularly in healthcare and government, have hard requirements that no data leaves their infrastructure. In these cases, self-hosting is not a cost optimization, it is a compliance requirement. The cost comparison is irrelevant because the cloud API option does not exist.

Scenario three is latency-sensitive applications. If you need consistent sub-100ms time-to-first-token latency and your users are in a specific geographic region, self-hosting with local GPUs can provide latency guarantees that cloud APIs cannot. We built a real-time coding assistant for one client where the model needed to respond within 80ms to feel instantaneous. No cloud API could guarantee that consistently.

For everyone else, here is the math from our four client evaluations. Client A was spending $8,000 per month on Claude API calls. Their self-hosting estimate came to $6,500 per month in infrastructure plus $4,000 per month in amortized engineering time, for a total of $10,500. They stayed with the API.

Client B was spending $45,000 per month on API calls with very high volume. Their self-hosting cost came to $18,000 per month in infrastructure plus $6,000 in engineering time, totaling $24,000. They switched to self-hosting and are happy with the decision, though the migration took 3 months.

Client C had data sovereignty requirements and spent $28,000 per month on a self-hosted deployment when a comparable cloud API would have cost $12,000. They accepted the premium as a cost of compliance.

Client D was spending $3,000 per month on API calls and wanted to self-host "to learn." We talked them out of it. Learning is great, but do it on a side project, not your production system.

The break-even point for most companies is somewhere around $25,000 to $30,000 per month in API spending, assuming you have engineering capacity to spare and you are not in a rush. Below that, the operational overhead of self-hosting eats your savings. Above that, the math starts working in your favor, but only if you account for all the costs we have discussed.

Our default recommendation: start with cloud APIs, build your product, find product-market fit, and only consider self-hosting when your API bill consistently exceeds $25,000 per month and you have the engineering team to support it. Premature optimization of your AI infrastructure is just as dangerous as premature optimization of your code.

Related Articles

Gemini Flash Lite: The Underrated LLM That Powers Half Our Projects

The RAG Tax: Hidden Costs of Retrieval-Augmented Generation in Production

Why Claude Opus 4.5 Costs $75/MTok and Whether It's Worth Every Cent

Want to discuss this further?

Ready to build
something real?