The inference cost crisis — audited and addressed.

When token volume scales with users but API costs are linear and margins aren't, the optimisation surface is usually model routing and semantic caching before hardware changes. A semantic cache hit costs less than a cent. Routing a simple query to a smaller model rather than GPT-4 saves 95% per call. Neither requires re-architecting your application — they require understanding your query distribution and making deliberate routing decisions.

Start a Conversation All Services

The Challenge

LLM API costs scale linearly with token volume. Products that launched with verbose system prompts, GPT-4-class models for every request, no caching layer, and no routing accumulate costs that scale directly with user growth. At prototype scale this is invisible. At production scale it becomes a unit economics problem: the AI features that make the product work cost more per user than the subscription revenue supports.

The optimization strategies are well-understood — but each comes with a quality risk that teams often ignore. Switching from GPT-4o to GPT-4o-mini for reasoning-heavy tasks degrades output quality. Semantic caching with too low a similarity threshold returns stale or semantically different answers to questions that deserve fresh responses. The mistake is implementing cost optimizations without measuring whether quality held — which requires having quality baselines before you optimize.

Cost driver	Typical waste pattern	Optimization approach
Model selection	GPT-4o for every request including simple ones	Routing: classify complexity, send simple to GPT-4o-mini or Haiku
Prompt verbosity	Large system prompts repeated on every call	LLMLingua prompt compression + Anthropic/OpenAI prompt caching
Duplicate requests	Same question answered multiple times	Semantic cache (GPTCache, custom Redis) with similarity threshold tuning
Context window	Full conversation history appended every turn	Conversation summarization or sliding window for long sessions
Synchronous processing	Document processing tasks run real-time	Async batch endpoints at lower per-token cost on OpenAI and Anthropic
Self-hosting decision	API for all volume	vLLM/TGI self-hosting where volume justifies infrastructure overhead

Our Approach

We audit AI spend by instrumenting every LLM call with token counts, model, latency, and task type. The audit produces a cost breakdown by request type — identifying which requests account for the most spend and which have the highest optimization potential without quality impact. Optimizations are implemented with quality gates: we measure current quality before changing anything, then re-measure after. Cost reduction is only accepted when quality meets the pre-defined threshold.

AI cost optimization process

Cost instrumentation and attribution

Instrument every LLM call: tokens in/out, model, latency, task type, user tier. Build cost attribution dashboard. Identify the top request types by total spend — these are the highest-leverage optimization targets.

Quality baselines

For the top-cost request types, build evaluation datasets and measure current quality. LangSmith or a custom eval pipeline. This baseline is what optimizations must preserve.

Optimization implementation by request type

Apply targeted optimizations: model routing for complexity-stratified requests, LLMLingua prompt compression, prompt caching configuration, semantic caching with tuned similarity threshold, batch endpoint migration for async-eligible workloads.

vLLM self-hosting crossover analysis

Model the cost at current and projected volume for API vs. self-hosted vLLM/TGI. Include infrastructure and operational costs, not just GPU cost. Produce the crossover point with confidence ranges.

Quality validation and ongoing monitoring

Re-measure quality after each optimization. Accept if quality meets threshold. Rollback if it does not. Maintain cost and quality dashboards with alerting when cost per request drifts up or quality metrics decline.

What Is Included

01
Model routing tiers
We train a small, fast classifier that tags each request by complexity and routes it to the cheapest model that can handle it — GPT-4o-mini or Claude Haiku for straightforward tasks, larger models only when justified. The classifier's inference cost is typically recovered within the first few dozen routed calls. Routing thresholds are tuned against your quality baselines, not set arbitrarily.
02
Semantic caching
We deploy a semantic cache that stores LLM responses indexed by embedding similarity — incoming requests within a configurable cosine distance of a cached result return without making an API call. Similarity thresholds and TTL policies are calibrated to your use case: a customer support bot tolerates different staleness risk than a legal document assistant. Cache hit rates and latency impact are instrumented from day one.
03
Prompt compression and caching
LLMLingua compresses verbose system prompts by 2-5x with measurable quality loss only above aggressive compression ratios — we find the safe ceiling for your prompts. For static or slowly-changing system prompts, we restructure them to qualify for Anthropic and OpenAI prompt caching, which saves 50-90% of input token costs on cached prefixes. Both optimizations are transparent to end users and reversible if baselines slip.
04
vLLM self-hosting crossover analysis
We model the fully-loaded cost of self-hosting your model on vLLM or TGI — GPU instance type, reserved vs. on-demand pricing, operational overhead, and engineering maintenance — against your current and projected API spend. The crossover point is specific to your model, request volume, latency SLA, and team capacity; there's no generic answer. The output is a spreadsheet model you can update as your traffic grows.
05
Quality-gated optimization
We build evaluation datasets for your top-cost request types before implementing any optimization — not after. Every change ships only if it clears defined quality thresholds on those evals: ROUGE, embedding similarity, or task-specific metrics depending on what your outputs require. If a cost reduction degrades measurable quality below threshold, it doesn't ship.

Deliverables

Cost audit broken down by request type, model, and optimization opportunity rank
Quality evaluation dataset built before any optimization ships
Model routing implementation: classifier, thresholds, and routing logic
Semantic cache deployment with similarity threshold and TTL tuning
vLLM/TGI self-hosting crossover model with infrastructure cost breakdown
Cost and quality monitoring dashboard with alerting on regression

Projected Impact

Workloads with mixed request complexity typically see 50–80% cost reduction after routing and caching. The distribution matters more than the average: most savings come from correctly identifying the 60–70% of queries that a much cheaper model handles equally well.

FAQ

Frequently
asked questions

How much can we realistically reduce AI costs?

It varies significantly by starting state. Applications using GPT-4o for every request with verbose prompts and no caching have more headroom than applications already using model routing and caching. We do not quote percentage savings before auditing — the audit is how we establish what is achievable for your specific workload. Quoting generic ranges is not useful because the variables matter too much.

What is LLMLingua and how does prompt compression work?

LLMLingua (Microsoft Research) compresses long prompts by identifying and removing tokens with low information content — the filler words, redundant context, and verbose phrasing that LLMs carry without significantly affecting output quality. Compression ratios of 2-4x are common on verbose system prompts with minimal quality impact on most tasks. We measure quality before and after compression — not all prompts compress well.

Does model routing degrade user experience?

If implemented correctly, no — users should not notice which model handled their request. The key is accurate routing: requests that require large model capability must be routed to large models. Errors in routing classification that send complex requests to small models produce visible quality degradation. Quality measurement before and after routing implementation is not optional.

What about fine-tuning smaller models as a cost optimization?

Fine-tuning a smaller model on a specific task can produce quality that matches larger general-purpose models at a fraction of the inference cost. This is viable for well-defined, high-volume tasks with sufficient training data. It requires more upfront investment — data collection, training, evaluation — but pays back at sufficient volume. We evaluate this option as part of the cost optimization audit.

Ready to get started?

Tell us what you are building. We will scope it, price it honestly, and give you a clear plan.

Start a Conversation

Free 30-min scoping call

Explore More

All services

The inference cost crisis — audited and addressed.

AI cost optimization process

Model routing tiers

Semantic caching

Prompt compression and caching

vLLM self-hosting crossover analysis

Quality-gated optimization

Frequentlyasked questions

Ready to get started?

Frequently
asked questions