Rate Limiting APIs: Algorithms, Headers, and Implementation Patterns
Rate limiting is one of those features that looks optional until the moment it becomes mandatory. Without it, a single misbehaving client — a misconfigured retry loop, a runaway script, a bad actor — can degrade or take down your API for every other consumer. With it, you define the boundaries of acceptable usage and enforce them automatically. For any API exposed to more than one consumer, rate limiting is infrastructure, not a feature.
Why Rate Limiting Exists
The naive concern is abuse: someone hammering your API with thousands of requests per second to extract data or disrupt the service. That is real. But rate limiting also exists to protect against well-intentioned failures — a developer whose code has an infinite retry loop with no backoff, a mobile client that reinstalls and re-fetches a full history on every launch, a cron job with the wrong schedule. The internet produces these by accident as often as by malice.
Rate limiting also enforces pricing tiers. If your free plan includes 1,000 requests per month and your paid plan includes 100,000, rate limiting is what makes that distinction real rather than advisory.
The Core Algorithms
Fixed Window
The simplest approach: count requests per consumer in fixed time windows (per minute, per hour, per day). When the count exceeds the limit, reject until the window resets.
The problem is boundary exploitation. A consumer can fire 1,000 requests in the last second of one window and 1,000 in the first second of the next, resulting in 2,000 requests in a two-second span against a nominal 1,000-per-minute limit. Fixed window is easy to implement and adequate for loose limits; it is not suitable for tight traffic shaping.
Sliding Window
A sliding window tracks request timestamps and counts only those within the trailing interval. At any moment, the effective window is the last N seconds or minutes of actual time, not a reset-at-midnight bucket. This eliminates boundary exploitation at the cost of more memory per consumer (you need to store timestamps, not just a count).
Token Bucket
The token bucket algorithm models each consumer as having a bucket with a maximum capacity. Tokens accumulate at a steady rate up to the maximum. Each request consumes a token. If the bucket is empty, the request is rejected.
This is the most flexible algorithm for handling burst traffic. A consumer who has been idle accumulates tokens and can burst up to the bucket capacity. A consumer who has been active steadily can continue at the refill rate but cannot exceed capacity. Stripe and most major APIs use token bucket or a variant of it because it gives consumers burst headroom without allowing sustained overuse.
Leaky Bucket
Leaky bucket is token bucket inverted: requests enter a queue (the bucket) and are processed at a fixed output rate. Excess requests that overflow the bucket are dropped. The result is smooth output at the cost of latency for requests sitting in the queue. More useful for traffic shaping at an infrastructure level than for API rate limiting directly.
Communicating Limits to Callers
Rate limiting is hostile if it is invisible. The standard approach is to include rate limit state in every response via headers, whether the request succeeded or failed:
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 743
X-RateLimit-Reset: 1746230400
Limit is the maximum for the window. Remaining is how many requests are left. Reset is the Unix timestamp at which the limit resets. This lets well-behaved clients pace themselves without guessing.
When a limit is exceeded, return HTTP 429 (Too Many Requests) — not 403 (Forbidden), which implies permanent denial rather than temporary throttling. Include a Retry-After header with the number of seconds or a timestamp indicating when the consumer can try again:
HTTP/1.1 429 Too Many Requests
Retry-After: 47
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1746230400
A well-implemented client receiving a 429 with Retry-After should back off automatically. A consumer ignoring Retry-After and retrying immediately in a loop should itself be considered a policy violation.
Scope: What You Rate Limit Against
The simplest scope is per API key or per account. This is appropriate for most cases. More sophisticated implementations add per-endpoint limits (reads cheaper than writes, expensive aggregation endpoints separately throttled), per-IP limits to catch unauthenticated abuse, and global limits to protect the system regardless of how limits are distributed across consumers.
For multi-tenant APIs, rate limiting per tenant rather than per key matters when a single tenant might have multiple keys. Aggregate the limit at the account level, not the key level, or a consumer can trivially circumvent limits by generating more keys.
Implementation: Where the Limit State Lives
Rate limit counters need to be fast and shared across all instances of your API. In-memory counters are fast but lost on restart and invisible to other instances. Redis is the standard answer: atomic increment operations, TTL-based expiration for window resets, and fast enough to add negligible latency to each request. Most rate limiting libraries — in every major language — are built around Redis as the backing store.
For distributed systems with high global request volume, rate limiting at the edge (in an API gateway or CDN layer) rather than inside the application is worth the architecture investment. It catches excess traffic before it touches your application servers.
What to Tell Consumers
Document your rate limits explicitly: limits by tier, limits by endpoint if they differ, the algorithm used (consumers who understand token bucket will write better clients than those who think in fixed windows), and the exact headers you return. Good documentation reduces support burden substantially. Most consumers hitting rate limits are doing so because they do not know where the limits are, not because they are trying to exceed them.
Rate limiting done well is infrastructure that enables trust: consumers know the rules, know their current state, and can build clients that operate within the system reliably. Rate limiting done poorly is a black box that turns failures into mysteries.