Skip to main content
Back to Pulse
Google AI

New ways to balance cost and reliability in the Gemini API

Read the full articleNew ways to balance cost and reliability in the Gemini API on Google AI

What Happened

Google is introducing two new inference tiers to the Gemini API, Flex and Priority, to balance cost and latency.

Our Take

Honestly, this is a long overdue update to the Gemini API. Look, we've been complaining about the cost of running these large language models for years. The Flex and Priority tiers are a decent start, but we need more transparency on pricing and a clear roadmap for cost-reducing features.

That being said, the Flex tier's 50ms latency and 20% cost reduction are a step in the right direction. However, we still can't justify the cost of running these models for most clients without a significant return on investment.

What To Do

Start experimenting with the Flex tier to see if it's a viable option for your clients

Builder's Brief

Who

teams running high-volume Gemini API calls in production

What changes

Flex tier enables cheaper batch workloads; Priority tier unlocks latency guarantees worth evaluating for user-facing inference

When

now

Watch for

published SLA breach rates on Priority tier in the first 60 days — tells you whether the guarantee is real

What Skeptics Say

Tiered inference pricing shifts cost complexity onto developers without guaranteeing meaningful latency SLAs on Priority tier — Google's track record of deprecating API tiers makes locking architecture around Flex or Priority a real operational risk.

Cited By

React

Newsletter

Get the weekly AI digest

The stories that matter, with a builder's perspective. Every Thursday.

Loading comments...