The inference cost crisis — audited and addressed.
When token volume scales with users but API costs are linear and margins aren't, the optimisation surface is usually model routing and semantic caching before hardware changes. A semantic cache hit costs less than a cent. Routing a simple query to a smaller model rather than GPT-4 saves 95% per call. Neither requires re-architecting your application — they require understanding your query distribution and making deliberate routing decisions.
LLM API costs scale linearly with token volume. Products that launched with verbose system prompts, GPT-4-class models for every request, no caching layer, and no routing accumulate costs that scale directly with user growth. At prototype scale this is invisible. At production scale it becomes a unit economics problem: the AI features that make the product work cost more per user than the subscription revenue supports.
The optimization strategies are well-understood — but each comes with a quality risk that teams often ignore. Switching from GPT-4o to GPT-4o-mini for reasoning-heavy tasks degrades output quality. Semantic caching with too low a similarity threshold returns stale or semantically different answers to questions that deserve fresh responses. The mistake is implementing cost optimizations without measuring whether quality held — which requires having quality baselines before you optimize.
| Cost driver | Typical waste pattern | Optimization approach |
|---|---|---|
| Model selection | GPT-4o for every request including simple ones | Routing: classify complexity, send simple to GPT-4o-mini or Haiku |
| Prompt verbosity | Large system prompts repeated on every call | LLMLingua prompt compression + Anthropic/OpenAI prompt caching |
| Duplicate requests | Same question answered multiple times | Semantic cache (GPTCache, custom Redis) with similarity threshold tuning |
| Context window | Full conversation history appended every turn | Conversation summarization or sliding window for long sessions |
| Synchronous processing | Document processing tasks run real-time | Async batch endpoints at lower per-token cost on OpenAI and Anthropic |
| Self-hosting decision | API for all volume | vLLM/TGI self-hosting where volume justifies infrastructure overhead |
We audit AI spend by instrumenting every LLM call with token counts, model, latency, and task type. The audit produces a cost breakdown by request type — identifying which requests account for the most spend and which have the highest optimization potential without quality impact. Optimizations are implemented with quality gates: we measure current quality before changing anything, then re-measure after. Cost reduction is only accepted when quality meets the pre-defined threshold.
AI cost optimization process
Instrument every LLM call: tokens in/out, model, latency, task type, user tier. Build cost attribution dashboard. Identify the top request types by total spend — these are the highest-leverage optimization targets.
For the top-cost request types, build evaluation datasets and measure current quality. LangSmith or a custom eval pipeline. This baseline is what optimizations must preserve.
Apply targeted optimizations: model routing for complexity-stratified requests, LLMLingua prompt compression, prompt caching configuration, semantic caching with tuned similarity threshold, batch endpoint migration for async-eligible workloads.
Model the cost at current and projected volume for API vs. self-hosted vLLM/TGI. Include infrastructure and operational costs, not just GPU cost. Produce the crossover point with confidence ranges.
Re-measure quality after each optimization. Accept if quality meets threshold. Rollback if it does not. Maintain cost and quality dashboards with alerting when cost per request drifts up or quality metrics decline.
- 01
Model routing tiers
We train a small, fast classifier that tags each request by complexity and routes it to the cheapest model that can handle it — GPT-4o-mini or Claude Haiku for straightforward tasks, larger models only when justified. The classifier's inference cost is typically recovered within the first few dozen routed calls. Routing thresholds are tuned against your quality baselines, not set arbitrarily.
- 02
Semantic caching
We deploy a semantic cache that stores LLM responses indexed by embedding similarity — incoming requests within a configurable cosine distance of a cached result return without making an API call. Similarity thresholds and TTL policies are calibrated to your use case: a customer support bot tolerates different staleness risk than a legal document assistant. Cache hit rates and latency impact are instrumented from day one.
- 03
Prompt compression and caching
LLMLingua compresses verbose system prompts by 2-5x with measurable quality loss only above aggressive compression ratios — we find the safe ceiling for your prompts. For static or slowly-changing system prompts, we restructure them to qualify for Anthropic and OpenAI prompt caching, which saves 50-90% of input token costs on cached prefixes. Both optimizations are transparent to end users and reversible if baselines slip.
- 04
vLLM self-hosting crossover analysis
We model the fully-loaded cost of self-hosting your model on vLLM or TGI — GPU instance type, reserved vs. on-demand pricing, operational overhead, and engineering maintenance — against your current and projected API spend. The crossover point is specific to your model, request volume, latency SLA, and team capacity; there's no generic answer. The output is a spreadsheet model you can update as your traffic grows.
- 05
Quality-gated optimization
We build evaluation datasets for your top-cost request types before implementing any optimization — not after. Every change ships only if it clears defined quality thresholds on those evals: ROUGE, embedding similarity, or task-specific metrics depending on what your outputs require. If a cost reduction degrades measurable quality below threshold, it doesn't ship.
- Cost audit broken down by request type, model, and optimization opportunity rank
- Quality evaluation dataset built before any optimization ships
- Model routing implementation: classifier, thresholds, and routing logic
- Semantic cache deployment with similarity threshold and TTL tuning
- vLLM/TGI self-hosting crossover model with infrastructure cost breakdown
- Cost and quality monitoring dashboard with alerting on regression
Workloads with mixed request complexity typically see 50–80% cost reduction after routing and caching. The distribution matters more than the average: most savings come from correctly identifying the 60–70% of queries that a much cheaper model handles equally well.
Frequently
asked questions
How much can we realistically reduce AI costs?
It varies significantly by starting state. Applications using GPT-4o for every request with verbose prompts and no caching have more headroom than applications already using model routing and caching. We do not quote percentage savings before auditing — the audit is how we establish what is achievable for your specific workload. Quoting generic ranges is not useful because the variables matter too much.
What is LLMLingua and how does prompt compression work?
LLMLingua (Microsoft Research) compresses long prompts by identifying and removing tokens with low information content — the filler words, redundant context, and verbose phrasing that LLMs carry without significantly affecting output quality. Compression ratios of 2-4x are common on verbose system prompts with minimal quality impact on most tasks. We measure quality before and after compression — not all prompts compress well.
Does model routing degrade user experience?
If implemented correctly, no — users should not notice which model handled their request. The key is accurate routing: requests that require large model capability must be routed to large models. Errors in routing classification that send complex requests to small models produce visible quality degradation. Quality measurement before and after routing implementation is not optional.
What about fine-tuning smaller models as a cost optimization?
Fine-tuning a smaller model on a specific task can produce quality that matches larger general-purpose models at a fraction of the inference cost. This is viable for well-defined, high-volume tasks with sufficient training data. It requires more upfront investment — data collection, training, evaluation — but pays back at sufficient volume. We evaluate this option as part of the cost optimization audit.
Ready to get started?
Tell us what you are building. We will scope it, price it honestly, and give you a clear plan.
Free 30-min scoping call
