Rate Limiting Patterns That Actually Scale
Rate limiting sounds simple until you hit distributed systems, bursty AI agent traffic, and token-based billing models. This is a precise look at the four core algorithms and when each one is the right answer.

Every API needs rate limiting. Most teams implement it once, with fixed-window counters, and do not revisit it until something breaks — usually when a customer's integration loops, an AI agent sends 10,000 requests in a minute, or a competitor scrapes the API dry. By then, the problem is harder to fix than it would have been to get right initially.
This is a tour through the four algorithms that matter in production, the infrastructure decisions that determine which one you can implement, and the specific challenges that AI agent traffic introduces to rate limiting systems that were designed for human-paced API consumers.
The Four Algorithms
Fixed Window
Fixed window counting is the simplest: count requests in a time bucket (say, 100 requests per minute), reset the counter at the start of each bucket, reject requests once the counter hits the limit. It is easy to implement in any caching layer and requires minimal state.
The problem is the boundary attack. A client can make 100 requests at 11:59:30 and another 100 requests at 12:00:01, effectively sending 200 requests in a two-second window while technically staying within the per-minute limit. For most B2B APIs this is an acceptable tradeoff. For systems where the downstream cost is per-call (LLM APIs, payment processors), it is not.
Sliding Window Log
The sliding window log stores a timestamp for every request within the window and counts how many fall within the last N seconds. Perfectly accurate. The memory cost grows linearly with request volume — storing timestamps for every request from every client at scale requires significant Redis memory. Viable for low-traffic APIs, impractical at millions of requests per hour.
Sliding Window Counter
The hybrid: store only two counters (current window count, previous window count), approximate the sliding window by weighting the previous counter by how much of it overlaps the current window. 90% of the accuracy at a fraction of the memory cost. This is what most production systems use.
Token Bucket
A client starts with N tokens. Each request consumes one token. Tokens refill at a fixed rate. The bucket has a maximum capacity. This is the algorithm Stripe uses in production, and it is the right model for APIs that need to allow bursts while maintaining average rate limits.
“Token bucket allows short bursts while enforcing average throughput. Fixed window enforces hard limits. The right choice depends on whether your cost model is burst-sensitive or average-sensitive.”
| Algorithm | Memory per client | Burst handling | Accuracy | Implementation complexity |
|---|---|---|---|---|
| Fixed window | O(1) | Poor (boundary attack) | Approximate | Low |
| Sliding window log | O(requests) | Good | Exact | Medium |
| Sliding window counter | O(1) | Good | ~97% accurate | Medium |
| Token bucket | O(1) | Excellent (configurable) | Exact for averages | Medium |
Redis and Lua for Distributed Atomic Rate Limiting
The implementation challenge in distributed systems is atomicity. If you have three API server instances and each instance checks and updates a shared Redis counter independently, you can have race conditions where multiple servers read the same count, both decide the request is below the limit, and both allow it — resulting in over-limit traffic getting through.
The fix is Lua scripts in Redis. Redis executes Lua scripts atomically — no other commands can interleave during script execution. Your entire check-and-increment logic runs as a single atomic operation. This is the correct approach for any distributed rate limiting implementation.
A token bucket implementation in Redis Lua: the script reads the current token count and last refill timestamp, calculates how many tokens should have refilled since the last call, adds them (up to the bucket capacity), consumes one token for the current request, writes back the new state with a TTL, and returns whether the request was allowed. All atomically.
HTTP 429 Best Practices
The response format matters as much as the enforcement logic. A 429 with no context forces clients into exponential backoff guessing. A 429 with proper headers lets well-behaved clients adapt immediately.
- Retry-After: 42 — seconds until the client can retry (or an HTTP date)
- X-RateLimit-Limit: 100 — the maximum allowed in the window
- X-RateLimit-Remaining: 0 — how many remain in the current window
- X-RateLimit-Reset: 1710000000 — Unix timestamp when the window resets
- Include a body with a machine-readable error code and human-readable message
Kong and Apache APISIX both have rate limiting plugins that handle this header injection automatically. Kong's rate-limiting plugin supports fixed window and sliding window. APISIX's limit-req and limit-count plugins support token bucket and leaky bucket. If you are already running a gateway, use the plugin — do not reimplement in application code.
AI Agent Traffic Is a Different Problem
Classical rate limiting counts requests. AI-era rate limiting needs to count tokens, because an LLM API request carrying a 100-token prompt has a fundamentally different cost profile than one carrying a 50,000-token context window. A limit of "100 requests per minute" means nothing when one request can consume the equivalent resources of a thousand others.
For APIs consumed by AI agents rather than humans, the practical recommendations are: rate limit on token consumption, not request count where possible; use token bucket with a burst capacity of 5-10x the per-second average; implement a separate rate limit tier for agent API keys distinct from human API keys; and monitor for runaway agent loops (1000+ requests in 10 minutes from a single key is almost always a bug, not legitimate load).
Implementing AI-aware rate limiting
Return X-Tokens-Used and X-Tokens-Remaining so AI agent frameworks can self-throttle. Many frameworks (LangChain, AutoGen) will respect these if present.
Track cumulative token spend per key per day in addition to per-second rate limits. This catches agents that are technically within per-second limits but burning through a month of budget in a day.
A key that sends exactly 99 requests per minute for an hour is suspicious even if it never hits a 100/min limit. Build anomaly detection alongside threshold enforcement.
At 80% of the rate limit, return warnings in the response body. At 95%, start adding artificial latency. At 100%, return 429. This gives well-behaved agents a chance to back off before being cut off.
Gateway vs Application Rate Limiting
The right place to implement rate limiting is at the gateway, not in application code. Gateway rate limiting catches traffic before it hits your application servers, can be updated without code deploys, and can share state across services automatically. Kong, APISIX, Traefik, and AWS API Gateway all support rate limiting at the gateway layer.
Application-level rate limiting (middleware in Express, FastAPI, or Go) is appropriate for cases where the rate limit logic is business logic — for example, limiting a specific user to 10 AI generations per day as part of a subscription tier. That is not a traffic management concern; it is a product concern that belongs in application code.
The common failure mode is implementing rate limiting twice — once at the gateway for DDoS protection and once in the application for business logic — without the two systems being aware of each other. A user who hits both limits in sequence gets degraded service that is hard to debug. Keep the two concerns separate and document which layer enforces which limit.
Algorithm Deep-Dive: Token Bucket vs Sliding Window vs Fixed Window
Rate limiting sounds simple until you need to choose an algorithm under production constraints. The three dominant approaches each make different tradeoffs between accuracy, memory, and implementation complexity.
Fixed window divides time into discrete buckets (e.g., one bucket per minute). A counter increments per request and resets at the window boundary. Memory: O(1) per key. Problem: a client can send N requests at the end of window 1 and N requests at the start of window 2 — effectively 2N requests in a two-second window around the boundary. For most consumer APIs this is acceptable. For financial APIs or abuse-prone endpoints, it is not.
Sliding window log tracks the exact timestamp of every request. To check the limit, count requests in the last N seconds. Memory: O(requests per window) per key — at high request rates this becomes expensive. Redis sorted sets implement this precisely: ZADD requests:user123 <timestamp> <uuid>, then ZCOUNT with the time range. Accurate to the millisecond, but memory scales with request volume.
Sliding window counter is a hybrid: it blends the current window's count with the previous window's count weighted by how far into the current window you are. If the window is 60 seconds and you are 30 seconds in, the effective count is: (previous_count × 0.5) + current_count. This approximation is accurate within ~0.003% of the theoretical sliding window log at a fraction of the memory cost. Cloudflare uses this algorithm at scale.
Token bucket: a bucket holds a maximum of B tokens, refilled at rate R tokens per second. Each request consumes one token (or more for expensive operations). If the bucket is empty, the request is denied. Token bucket allows bursting up to B requests — useful for clients with bursty-but-bounded traffic patterns. Implementation: store (tokens, last_refill_timestamp) per key. On each request, calculate tokens added since last refill, cap at B, subtract 1.
| Algorithm | Memory per key | Burst behaviour | Boundary spike risk | Best for |
|---|---|---|---|---|
| Fixed window | O(1) | Full limit at window start | Yes — 2x at boundary | Internal APIs, low-stakes rate limiting |
| Sliding window log | O(requests/window) | None — per-request accuracy | No | Billing APIs, high-accuracy requirements |
| Sliding window counter | O(1) | Minimal — approximate | Negligible | General-purpose, high-scale |
| Token bucket | O(1) | Configurable burst depth | No | Bursty clients, streaming APIs |
| Leaky bucket | O(queue size) | Smoothed output rate | No | Traffic shaping, bandwidth throttling |
Redis Implementation: Sliding Window with Sorted Sets
A production Redis implementation of the sliding window log algorithm uses sorted sets (ZSET). The score is the request timestamp in milliseconds, and the member is a unique request identifier. The entire check-and-increment is wrapped in a Lua script to ensure atomicity — a critical detail that most blog posts skip.
The Lua script: remove all members older than (now - window_ms), count remaining members, if count < limit then add the new request and return 1 (allowed), else return 0 (denied). Execute as a single EVALSHA call. The atomicity guarantee means concurrent requests from the same client cannot race past the limit.
For distributed deployments with multiple Redis nodes, use Redis Cluster with consistent hashing on the rate limit key (typically a combination of API key + endpoint). All requests for a given client-endpoint pair route to the same shard, preserving accuracy. Avoid MULTI/EXEC transactions across shards — they do not work in Redis Cluster mode.
Distributed Rate Limiting and Consistent Hashing
When your API fleet spans multiple data centres or availability zones, rate limiting becomes a distributed consistency problem. Three approaches exist, each with a clear tradeoff profile.
Centralised Redis: all rate limit checks go to a single Redis instance (or Redis Cluster). Accurate counts, but adds a network hop to every request. At 50,000 requests per second, each adding 1-2ms of Redis latency, this is 50-100ms of added load on your Redis cluster. Acceptable for most APIs; problematic for sub-10ms latency requirements.
Local counters with periodic sync: each API server maintains local counters and synchronises with a central store every few seconds. Allows brief over-limit bursting between sync intervals, but eliminates the per-request Redis hop. Used by companies like Stripe for high-throughput endpoints where brief bursting is acceptable.
Token bucket with consistent hashing: hash each client ID to a designated rate-limit node in your cluster. All requests from a given client route to that node for rate limit checks. No cross-node coordination needed. If a node fails, rehashing kicks in — requests temporarily bypass rate limiting during the rehash window. Pair this with circuit breakers at the API gateway layer to contain the blast radius of a rate-limit node failure.
Client-Side Rate Limiting
Server-side rate limiting is a backstop, not a first line of defence. Well-designed API clients implement their own rate limiting to avoid hitting server-side limits in the first place. The pattern: maintain a local token bucket that mirrors the server's limits (documented in the API spec), drain it with each request, and back off exponentially on 429 responses.
Exponential backoff with jitter is the correct retry strategy for rate-limited requests. Without jitter, all clients that hit the limit simultaneously retry at the same interval — causing a thundering herd that immediately re-triggers the limit. Add randomised jitter (uniform random between 0 and the backoff interval) to spread retries over time. AWS and Google Cloud SDK clients implement this pattern by default.
- Read the API's rate limit headers on every response (X-RateLimit-Remaining, Retry-After)
- Implement local token bucket mirroring documented limits before you hit them
- Use exponential backoff with jitter on 429 responses — never fixed retry intervals
- Circuit-break after N consecutive 429s — the server is telling you to stop
- Track rate limit consumption per endpoint, not globally — limits are often per-endpoint
Rate Limiting for AI and LLM APIs: Token-Based vs Request-Based
LLM APIs introduce a new dimension: token consumption. A single GPT-4 request can consume 500 tokens or 32,000 tokens depending on prompt length and response size. Request-based rate limits (100 requests per minute) dramatically under-constrain high-token-consumption users and over-constrain low-consumption users. OpenAI, Anthropic, and Google all use a dual limit: requests per minute (RPM) and tokens per minute (TPM). Your rate limiter needs to track both. For teams building on top of LLM APIs — whether through agentic systems or direct API calls — implement a token budget system: estimate token consumption before the request (from prompt length), check against your TPM budget, proceed or queue accordingly.
Streaming responses complicate token counting — you do not know the final token count until the stream ends. Two approaches: optimistic budgeting (estimate, proceed, reconcile after) or conservative budgeting (reserve worst-case tokens upfront, release unused budget after). Optimistic budgeting has higher throughput; conservative budgeting prevents TPM overruns at the cost of reduced concurrency.
Graceful Degradation Under Rate Limits
Rate limiting is not just about blocking excess traffic — it is about what happens to the traffic you block. A well-designed rate limiter returns a 429 response with a Retry-After header, giving the client a concrete time to wait before retrying. Without this header, clients implement their own retry logic, which almost always defaults to immediate retry with exponential backoff — creating thundering herd effects when multiple clients hit the limit simultaneously and all retry at the same interval.
For internal services, circuit breakers complement rate limiters. When a downstream service starts rejecting requests due to rate limits, the circuit breaker opens to prevent further calls, returns a cached or degraded response, and periodically checks whether the rate limit has reset. This pattern prevents cascading failures where rate limiting one service causes timeouts in every service that depends on it. Resilience4j (Java) and Polly (.NET) provide production-grade circuit breaker implementations. For Go services, the sony/gobreaker library is the standard choice.
Need this kind of thinking applied to your product?
We build AI agents, full-stack platforms, and engineering systems. Same depth, applied to your problem.
Enjoyed this? Get the weekly digest.
Research highlights and AI news, delivered every Thursday. No spam.
Related articles

The Time We Cached Everything and Served the Wrong Data to the Wrong Customer

We Built a Multi-Tenant AI Pipeline and Here's What Actually Happened

How Kafka Actually Works Under the Hood

GraphQL vs REST API in 2026: The Real Comparison
