What is rate limiting and what problems does it solve at scale?

Rate limiting controls how frequently clients can call your API or service, preventing abuse, protecting downstream systems from overload, and ensuring fair usage across tenants. At scale it solves: DDoS mitigation, cost runaway prevention (especially critical for AI API calls), tenant isolation in multi-tenant systems, and SLA enforcement for different customer tiers.

When should I use token bucket vs sliding window vs fixed window rate limiting?

Token bucket allows short bursts above the average rate — good for APIs where users occasionally spike. Sliding window provides smooth, precise limiting with no edge effects — good for per-user API quotas. Fixed window is simplest to implement and debug but has edge effects at window boundaries. For AI API gateway rate limiting, token bucket is the most user-friendly choice.

What are the most common rate limiting mistakes in production?

Rate limiting mistakes: per-process counters that do not work across horizontally scaled instances (use Redis), missing rate limits on internal APIs (internal callers can be the worst abusers), no differentiation by customer tier, rate limit windows that do not match actual usage patterns, and missing rate limit headers in responses (429 without Retry-After is unhelpful).

How does distributed rate limiting work across multiple servers?

Distributed rate limiting requires a shared counter store (Redis is the standard). The token bucket or sliding window algorithm runs against Redis using atomic Lua scripts to prevent race conditions. At very high throughput, local in-process counters with periodic sync to Redis provide better performance — accepting slight over-counting in exchange for reduced Redis load.

Is rate limiting worth implementing for internal microservices?

Yes — internal services are frequently the source of runaway load. An internal batch job, misconfigured retry loop, or deployment that spawns extra instances can overwhelm a downstream internal service just as effectively as external abuse. Rate limit critical internal dependencies with generous limits that catch bugs without impacting normal operation.

Fordel Studios

Rate Limiting Patterns That Actually Scale

Rate limiting sounds simple until you hit distributed systems, bursty AI agent traffic, and token-based billing models. This is a precise look at the four core algorithms and when each one is the right answer.

Abhishek Sharma· Head of Engg @ Fordel Studios

March 29, 202612 min read min read

Rate Limiting Patterns That Actually Scale

Every API needs rate limiting. Most teams implement it once, with fixed-window counters, and do not revisit it until something breaks — usually when a customer's integration loops, an AI agent sends 10,000 requests in a minute, or a competitor scrapes the API dry. By then, the problem is harder to fix than it would have been to get right initially.

This is a tour through the four algorithms that matter in production, the infrastructure decisions that determine which one you can implement, and the specific challenges that AI agent traffic introduces to rate limiting systems that were designed for human-paced API consumers.

···

The Four Algorithms

Fixed Window

Fixed window counting is the simplest: count requests in a time bucket (say, 100 requests per minute), reset the counter at the start of each bucket, reject requests once the counter hits the limit. It is easy to implement in any caching layer and requires minimal state.

The problem is the boundary attack. A client can make 100 requests at 11:59:30 and another 100 requests at 12:00:01, effectively sending 200 requests in a two-second window while technically staying within the per-minute limit. For most B2B APIs this is an acceptable tradeoff. For systems where the downstream cost is per-call (LLM APIs, payment processors), it is not.

Sliding Window Log

The sliding window log stores a timestamp for every request within the window and counts how many fall within the last N seconds. Perfectly accurate. The memory cost grows linearly with request volume — storing timestamps for every request from every client at scale requires significant Redis memory. Viable for low-traffic APIs, impractical at millions of requests per hour.

Sliding Window Counter

The hybrid: store only two counters (current window count, previous window count), approximate the sliding window by weighting the previous counter by how much of it overlaps the current window. 90% of the accuracy at a fraction of the memory cost. This is what most production systems use.

Token Bucket

A client starts with N tokens. Each request consumes one token. Tokens refill at a fixed rate. The bucket has a maximum capacity. This is the algorithm Stripe uses in production, and it is the right model for APIs that need to allow bursts while maintaining average rate limits.

“Token bucket allows short bursts while enforcing average throughput. Fixed window enforces hard limits. The right choice depends on whether your cost model is burst-sensitive or average-sensitive.”

Algorithm	Memory per client	Burst handling	Accuracy	Implementation complexity
Fixed window	O(1)	Poor (boundary attack)	Approximate	Low
Sliding window log	O(requests)	Good	Exact	Medium
Sliding window counter	O(1)	Good	~97% accurate	Medium
Token bucket	O(1)	Excellent (configurable)	Exact for averages	Medium

···

Redis and Lua for Distributed Atomic Rate Limiting

The implementation challenge in distributed systems is atomicity. If you have three API server instances and each instance checks and updates a shared Redis counter independently, you can have race conditions where multiple servers read the same count, both decide the request is below the limit, and both allow it — resulting in over-limit traffic getting through.

The fix is Lua scripts in Redis. Redis executes Lua scripts atomically — no other commands can interleave during script execution. Your entire check-and-increment logic runs as a single atomic operation. This is the correct approach for any distributed rate limiting implementation.

A token bucket implementation in Redis Lua: the script reads the current token count and last refill timestamp, calculates how many tokens should have refilled since the last call, adds them (up to the bucket capacity), consumes one token for the current request, writes back the new state with a TTL, and returns whether the request was allowed. All atomically.

0.2msaverage latency for Redis Lua rate limit checkOn single-node Redis at 10K+ QPS

429HTTP status code for rate limit exceededAlways include Retry-After header with seconds until next available slot

···

HTTP 429 Best Practices

The response format matters as much as the enforcement logic. A 429 with no context forces clients into exponential backoff guessing. A 429 with proper headers lets well-behaved clients adapt immediately.

Required headers on every 429 response

Retry-After: 42 — seconds until the client can retry (or an HTTP date)
X-RateLimit-Limit: 100 — the maximum allowed in the window
X-RateLimit-Remaining: 0 — how many remain in the current window
X-RateLimit-Reset: 1710000000 — Unix timestamp when the window resets
Include a body with a machine-readable error code and human-readable message

Kong and Apache APISIX both have rate limiting plugins that handle this header injection automatically. Kong's rate-limiting plugin supports fixed window and sliding window. APISIX's limit-req and limit-count plugins support token bucket and leaky bucket. If you are already running a gateway, use the plugin — do not reimplement in application code.

···

AI Agent Traffic Is a Different Problem

Classical rate limiting counts requests. AI-era rate limiting needs to count tokens, because an LLM API request carrying a 100-token prompt has a fundamentally different cost profile than one carrying a 50,000-token context window. A limit of "100 requests per minute" means nothing when one request can consume the equivalent resources of a thousand others.

For APIs consumed by AI agents rather than humans, the practical recommendations are: rate limit on token consumption, not request count where possible; use token bucket with a burst capacity of 5-10x the per-second average; implement a separate rate limit tier for agent API keys distinct from human API keys; and monitor for runaway agent loops (1000+ requests in 10 minutes from a single key is almost always a bug, not legitimate load).

Implementing AI-aware rate limiting

Add a token-count header to API responses

Return X-Tokens-Used and X-Tokens-Remaining so AI agent frameworks can self-throttle. Many frameworks (LangChain, AutoGen) will respect these if present.

Maintain a token bucket per API key with daily token budgets

Track cumulative token spend per key per day in addition to per-second rate limits. This catches agents that are technically within per-second limits but burning through a month of budget in a day.

Alert on anomalous patterns, not just limit breaches

A key that sends exactly 99 requests per minute for an hour is suspicious even if it never hits a 100/min limit. Build anomaly detection alongside threshold enforcement.

Implement graceful degradation before hard rejection

At 80% of the rate limit, return warnings in the response body. At 95%, start adding artificial latency. At 100%, return 429. This gives well-behaved agents a chance to back off before being cut off.

···

Gateway vs Application Rate Limiting

The right place to implement rate limiting is at the gateway, not in application code. Gateway rate limiting catches traffic before it hits your application servers, can be updated without code deploys, and can share state across services automatically. Kong, APISIX, Traefik, and AWS API Gateway all support rate limiting at the gateway layer.

Application-level rate limiting (middleware in Express, FastAPI, or Go) is appropriate for cases where the rate limit logic is business logic — for example, limiting a specific user to 10 AI generations per day as part of a subscription tier. That is not a traffic management concern; it is a product concern that belongs in application code.

The common failure mode is implementing rate limiting twice — once at the gateway for DDoS protection and once in the application for business logic — without the two systems being aware of each other. A user who hits both limits in sequence gets degraded service that is hard to debug. Keep the two concerns separate and document which layer enforces which limit.

Updated 2026-03-29

The rise of multi-agent AI orchestration is putting fresh pressure on rate limiting infrastructure. Systems designed for human-paced API calls are collapsing under the bursty, parallel request patterns generated by agent swarms: a single user action can fan out into dozens of simultaneous API calls across tools, memory stores, and model providers. Google's Gemini Flash Live API (launched this week for real-time conversational agents) and OpenAI's Codex plugin system both expose per-second token quotas that require token-bucket implementations rather than fixed-window counters to avoid cascading throttle errors. Enterprises treating agentic AI as a first-class engineering discipline are now retrofitting sliding-window rate limiters at the gateway layer specifically to absorb agent traffic spikes without penalising legitimate concurrent sessions.

···

Algorithm Deep-Dive: Token Bucket vs Sliding Window vs Fixed Window

Rate limiting sounds simple until you need to choose an algorithm under production constraints. The three dominant approaches each make different tradeoffs between accuracy, memory, and implementation complexity.

Fixed window divides time into discrete buckets (e.g., one bucket per minute). A counter increments per request and resets at the window boundary. Memory: O(1) per key. Problem: a client can send N requests at the end of window 1 and N requests at the start of window 2 — effectively 2N requests in a two-second window around the boundary. For most consumer APIs this is acceptable. For financial APIs or abuse-prone endpoints, it is not.

Sliding window log tracks the exact timestamp of every request. To check the limit, count requests in the last N seconds. Memory: O(requests per window) per key — at high request rates this becomes expensive. Redis sorted sets implement this precisely: ZADD requests:user123 <timestamp> <uuid>, then ZCOUNT with the time range. Accurate to the millisecond, but memory scales with request volume.

Sliding window counter is a hybrid: it blends the current window's count with the previous window's count weighted by how far into the current window you are. If the window is 60 seconds and you are 30 seconds in, the effective count is: (previous_count × 0.5) + current_count. This approximation is accurate within ~0.003% of the theoretical sliding window log at a fraction of the memory cost. Cloudflare uses this algorithm at scale.

Token bucket: a bucket holds a maximum of B tokens, refilled at rate R tokens per second. Each request consumes one token (or more for expensive operations). If the bucket is empty, the request is denied. Token bucket allows bursting up to B requests — useful for clients with bursty-but-bounded traffic patterns. Implementation: store (tokens, last_refill_timestamp) per key. On each request, calculate tokens added since last refill, cap at B, subtract 1.

Algorithm	Memory per key	Burst behaviour	Boundary spike risk	Best for
Fixed window	O(1)	Full limit at window start	Yes — 2x at boundary	Internal APIs, low-stakes rate limiting
Sliding window log	O(requests/window)	None — per-request accuracy	No	Billing APIs, high-accuracy requirements
Sliding window counter	O(1)	Minimal — approximate	Negligible	General-purpose, high-scale
Token bucket	O(1)	Configurable burst depth	No	Bursty clients, streaming APIs
Leaky bucket	O(queue size)	Smoothed output rate	No	Traffic shaping, bandwidth throttling

···

Redis Implementation: Sliding Window with Sorted Sets

A production Redis implementation of the sliding window log algorithm uses sorted sets (ZSET). The score is the request timestamp in milliseconds, and the member is a unique request identifier. The entire check-and-increment is wrapped in a Lua script to ensure atomicity — a critical detail that most blog posts skip.

The Lua script: remove all members older than (now - window_ms), count remaining members, if count < limit then add the new request and return 1 (allowed), else return 0 (denied). Execute as a single EVALSHA call. The atomicity guarantee means concurrent requests from the same client cannot race past the limit.

For distributed deployments with multiple Redis nodes, use Redis Cluster with consistent hashing on the rate limit key (typically a combination of API key + endpoint). All requests for a given client-endpoint pair route to the same shard, preserving accuracy. Avoid MULTI/EXEC transactions across shards — they do not work in Redis Cluster mode.

···

Distributed Rate Limiting and Consistent Hashing

When your API fleet spans multiple data centres or availability zones, rate limiting becomes a distributed consistency problem. Three approaches exist, each with a clear tradeoff profile.

Centralised Redis: all rate limit checks go to a single Redis instance (or Redis Cluster). Accurate counts, but adds a network hop to every request. At 50,000 requests per second, each adding 1-2ms of Redis latency, this is 50-100ms of added load on your Redis cluster. Acceptable for most APIs; problematic for sub-10ms latency requirements.

Local counters with periodic sync: each API server maintains local counters and synchronises with a central store every few seconds. Allows brief over-limit bursting between sync intervals, but eliminates the per-request Redis hop. Used by companies like Stripe for high-throughput endpoints where brief bursting is acceptable.

Token bucket with consistent hashing: hash each client ID to a designated rate-limit node in your cluster. All requests from a given client route to that node for rate limit checks. No cross-node coordination needed. If a node fails, rehashing kicks in — requests temporarily bypass rate limiting during the rehash window. Pair this with circuit breakers at the API gateway layer to contain the blast radius of a rate-limit node failure.

···

Client-Side Rate Limiting

Server-side rate limiting is a backstop, not a first line of defence. Well-designed API clients implement their own rate limiting to avoid hitting server-side limits in the first place. The pattern: maintain a local token bucket that mirrors the server's limits (documented in the API spec), drain it with each request, and back off exponentially on 429 responses.

Exponential backoff with jitter is the correct retry strategy for rate-limited requests. Without jitter, all clients that hit the limit simultaneously retry at the same interval — causing a thundering herd that immediately re-triggers the limit. Add randomised jitter (uniform random between 0 and the backoff interval) to spread retries over time. AWS and Google Cloud SDK clients implement this pattern by default.

Read the API's rate limit headers on every response (X-RateLimit-Remaining, Retry-After)
Implement local token bucket mirroring documented limits before you hit them
Use exponential backoff with jitter on 429 responses — never fixed retry intervals
Circuit-break after N consecutive 429s — the server is telling you to stop
Track rate limit consumption per endpoint, not globally — limits are often per-endpoint

···

Rate Limiting for AI and LLM APIs: Token-Based vs Request-Based

LLM APIs introduce a new dimension: token consumption. A single GPT-4 request can consume 500 tokens or 32,000 tokens depending on prompt length and response size. Request-based rate limits (100 requests per minute) dramatically under-constrain high-token-consumption users and over-constrain low-consumption users. OpenAI, Anthropic, and Google all use a dual limit: requests per minute (RPM) and tokens per minute (TPM). Your rate limiter needs to track both. For teams building on top of LLM APIs — whether through agentic systems or direct API calls — implement a token budget system: estimate token consumption before the request (from prompt length), check against your TPM budget, proceed or queue accordingly.

Streaming responses complicate token counting — you do not know the final token count until the stream ends. Two approaches: optimistic budgeting (estimate, proceed, reconcile after) or conservative budgeting (reserve worst-case tokens upfront, release unused budget after). Optimistic budgeting has higher throughput; conservative budgeting prevents TPM overruns at the cost of reduced concurrency.

the most expensive status codeA 429 that causes a customer-facing failure costs 10-100x more in support cost than it saved in infrastructure cost

···

Graceful Degradation Under Rate Limits

Rate limiting is not just about blocking excess traffic — it is about what happens to the traffic you block. A well-designed rate limiter returns a 429 response with a Retry-After header, giving the client a concrete time to wait before retrying. Without this header, clients implement their own retry logic, which almost always defaults to immediate retry with exponential backoff — creating thundering herd effects when multiple clients hit the limit simultaneously and all retry at the same interval.

For internal services, circuit breakers complement rate limiters. When a downstream service starts rejecting requests due to rate limits, the circuit breaker opens to prevent further calls, returns a cached or degraded response, and periodically checks whether the rate limit has reset. This pattern prevents cascading failures where rate limiting one service causes timeouts in every service that depends on it. Resilience4j (Java) and Polly (.NET) provide production-grade circuit breaker implementations. For Go services, the sony/gobreaker library is the standard choice.

Build with us

Need this kind of thinking applied to your product?

We build AI agents, full-stack platforms, and engineering systems. Same depth, applied to your problem.

Start a conversation View services

Newsletter

Enjoyed this? Get the weekly digest.

Research highlights and AI news, delivered every Thursday. No spam.

Loading comments...

Keep Reading

All articles