Edge vs. application-level limiting, and the API-gateway role
Rate limiting can be enforced at several layers of the request path, and mature systems enforce it at more than one. The guiding principle: reject junk as early and as cheaply as possible, but enforce nuanced, identity-aware limits where you have the context to do so. Each layer earlier than the application saves the resources the rejected request would have consumed downstream.
The layers, from outside in
1. Edge (CDN / WAF) — Cloudflare, AWS WAF, Akamai. Runs in points of presence close to clients, before traffic reaches your origin. Best for coarse, high-volume defenses: per-IP floods, volumetric DDoS, bot mitigation, geo rules. It is the only layer that can absorb an attack without consuming your bandwidth or compute — the request never reaches your data center. The downside: minimal application context (it does not know who the authenticated user is or which tenant they belong to).
2. API gateway — Kong, NGINX, Spring Cloud Gateway, AWS API Gateway.
The single ingress to your services. This is the natural home for most rate limiting because it sees every request, already terminates TLS and authenticates, and can key limits on the API key, route, or plan tier — typically backed by Redis for shared state across gateway instances. Centralizing here keeps individual services from each re-implementing limiting. Spring Cloud Gateway ships a RequestRateLimiter filter (Redis + token bucket) keyed by a KeyResolver; Kong and NGINX have equivalent plugins/modules.
3. Application — Bucket4j, Resilience4j, custom. In-process limiting for business-specific rules the gateway can't easily express: per-tenant monthly quotas, limits that depend on the user's subscription state pulled from the DB, or protecting one expensive internal endpoint. It has the richest context but is the most expensive place to reject — the request has already traversed every prior layer.
Why layer them
- Defense in depth. The edge blunts volumetric attacks; the gateway enforces fair per-key limits; the app enforces business quotas. No single layer has both the position and the context to do all three.
- Cost gradient. A request rejected at the edge costs you nothing; one rejected in the app has already consumed LB, gateway, auth, and connection resources. Push coarse rejections outward.
The 429 contract
Whatever layer rejects, the response must be a clean, machine-readable contract so well-behaved clients can self-throttle:
HTTP/1.1 429 Too Many Requests
Retry-After: 30 # seconds to wait (or an HTTP-date)
X-RateLimit-Limit: 100 # ceiling for the window
X-RateLimit-Remaining: 0 # requests left
X-RateLimit-Reset: 1700000060 # epoch when the window resets
Retry-After is the important one: it tells clients when to retry, turning blind retry storms (which amplify the overload) into coordinated backoff. Returning 429 without Retry-After invites clients to hammer immediately.