Rate Limiting¶
Continuum Router rate-limits requests to prevent abuse, allocate resources fairly, and protect backends from overload. The rate limiting system uses a token bucket algorithm with multiple layers of protection.
Overview¶
The router implements multi-tier rate limiting:
- Per-client limits: Prevent individual clients from overwhelming the system
- Per-backend limits: Protect individual backend services from overload
- Per-API-key limits: Apply limits per authenticated key, with per-key overrides
- Per-model limits: Protect specific (expensive) models
- Global limits: Ensure overall system stability
Table of Contents¶
- Configuration
- Client Identification
- Rate Limiting Strategies
- Response Headers
- Monitoring
- Bypass Mechanisms
- Storage Backends
- Best Practices
Configuration¶
Basic Configuration¶
rate_limiting:
enabled: true
storage: memory # or "redis" for distributed setups
limits:
per_client:
requests_per_second: 10
burst_capacity: 20
per_backend:
requests_per_second: 100
burst_capacity: 200
global:
requests_per_second: 1000
burst_capacity: 2000
Per-API-Key Rate Limiting¶
Two mechanisms apply to API keys:
-
A
per_api_keydimension insiderate_limiting.limitsthat applies token-bucket limits to every keyed client: -
A per-key
rate_limitoverride (requests per minute) on individual entries in theapi_keyssection:
Bypass Configuration¶
Certain clients can bypass rate limiting entirely:
rate_limiting:
# Whitelist IPs that bypass rate limiting
whitelist:
- "192.168.1.0/24"
- "10.0.0.1"
# API keys that bypass rate limiting
bypass_keys:
- "admin-key-123"
- "monitoring-key-456"
Client Identification¶
The router identifies clients using the following priority order:
- API Key from the
Authorization: Bearer <token>orx-api-keyheader (preferred) -
Provides accurate tracking across different IPs
-
X-Forwarded-For header (proxy/load balancer scenarios)
- Honored only when the request comes from a trusted proxy
-
With
use_rightmost_xff: true(default), the rightmost IP in the chain is used, which is harder to spoof -
X-Real-IP header (alternative proxy header)
-
Also honored only from trusted proxies
-
Direct IP address (when no proxy headers present)
- Used when the request comes directly to the router
Trusted Proxy Configuration¶
rate_limiting:
# Proxies allowed to set X-Forwarded-For / X-Real-IP headers
trusted_proxies:
- "10.0.0.0/8"
- "192.168.1.1"
# Use the rightmost IP in the X-Forwarded-For chain (default: true)
use_rightmost_xff: true
Forwarding headers from untrusted sources are ignored, so clients cannot evade per-client limits by spoofing X-Forwarded-For.
Rate Limiting Strategies¶
Token Bucket Algorithm¶
The router uses the token bucket algorithm, which allows for burst traffic while maintaining long-term rate limits:
- Bucket capacity: Maximum number of tokens (burst_capacity)
- Refill rate: Tokens added per second (requests_per_second)
- Token cost: Each request consumes one token
How It Works¶
- Each client starts with a full bucket of tokens
- Tokens are consumed with each request
- Tokens refill at a constant rate
- Requests are rejected when bucket is empty
Per-Model Limits¶
In addition to the per-client, per-backend, per-API-key, and global dimensions, limits can target individual models. This is useful for protecting expensive models while leaving cheaper ones unconstrained:
rate_limiting:
limits:
per_model:
gpt-4:
requests_per_second: 2
burst_capacity: 5
gpt-3.5-turbo:
requests_per_second: 20
burst_capacity: 50
Response Headers¶
Success Response¶
When a request is within rate limits:
Rate Limited Response¶
When rate limit is exceeded:
HTTP/1.1 429 Too Many Requests
Retry-After: 30
Content-Type: application/json
{
"error": {
"message": "Rate limit exceeded: burst limit of 20 requests per 5 seconds exceeded",
"type": "rate_limit_error",
"code": "rate_limit_exceeded"
}
}
Monitoring¶
Metrics¶
Rate limiting activity is tracked in Prometheus metrics:
# Total rate limit checks performed
rate_limit_checks_total 10342
# Requests rejected, labeled by limit dimension
rate_limit_exceeded_total{limit_type="per_client"} 42
# Current token bucket levels and capacity
rate_limit_tokens_remaining{bucket_type="per_client",identifier="..."} 15
rate_limit_bucket_capacity{bucket_type="per_client"} 20
# Whitelisted and bypassed (API key) requests
rate_limit_whitelisted_requests_total 123
rate_limit_bypassed_requests_total 45
Logging¶
Rate limit events are logged with context:
{
"level": "warn",
"msg": "Rate limit exceeded",
"client_id": "abc123...",
"endpoint": "/v1/chat/completions",
"limit_type": "burst",
"limit_value": 20,
"window": "5s"
}
Bypass Mechanisms¶
IP Whitelist¶
Whitelist trusted IP addresses or CIDR ranges:
rate_limiting:
whitelist:
- "192.168.1.0/24" # Internal network
- "10.0.0.1" # Admin server
- "172.16.0.0/16" # Corporate network
API Key Bypass¶
Certain API keys can bypass all rate limits:
Health Check Exemption¶
Health check and metrics endpoints are automatically exempt from rate limiting:
/health/healthz/metrics
Storage Backends¶
Memory Storage (Default)¶
In-memory storage is fast but not shared across instances:
Pros: - No external dependencies - Low latency - Simple setup
Cons: - Not shared across router instances - Lost on restart - Limited to single-instance deployments
Redis Storage (Distributed)¶
Redis storage enables distributed rate limiting across multiple router instances:
rate_limiting:
storage: redis
redis:
url: "redis://localhost:6379"
key_prefix: "continuum:ratelimit:" # Prefix for rate limit keys (default)
ttl: 3600 # TTL for rate limit keys in seconds
Credentials can be embedded in the URL (redis://:password@host:6379), with environment variable substitution available via ${REDIS_PASSWORD}-style references.
Pros: - Shared across all router instances - Persistent across restarts - Accurate global limits
Cons: - Requires Redis infrastructure - Slightly higher latency - Additional operational complexity
Hot Reload Support¶
Rate limiting configuration supports hot reload for immediate updates:
# These settings update immediately without restart
rate_limiting:
enabled: true # ✅ Immediate: Enable/disable rate limiting
limits:
per_client:
requests_per_second: 10 # ✅ Immediate: New limits apply immediately
burst_capacity: 20 # ✅ Immediate: Burst settings update instantly
per_backend:
requests_per_second: 100 # ✅ Immediate: Backend limits update instantly
Best Practices¶
1. Start Conservative¶
Begin with stricter limits and relax them based on actual usage:
2. Monitor and Adjust¶
Use metrics to understand actual traffic patterns:
- Track the
rate_limit_exceeded_totalmetric - Identify legitimate vs. abusive traffic
- Adjust limits based on data
3. Use Different Tiers¶
Implement tiered rate limits for different user classes with per-key overrides (requests per minute):
api_keys:
api_keys:
- key: "free-tier-key"
rate_limit: 60
- key: "pro-tier-key"
rate_limit: 600
- key: "enterprise-tier-key"
rate_limit: 6000
4. Protect Expensive Models¶
Apply stricter limits to costly models with per_model limits:
5. Use Redis for Production¶
For multi-instance deployments, use Redis:
Troubleshooting¶
Common Issues¶
Rate limits not applying¶
Symptom: Clients can exceed configured limits
Solutions: 1. Check if client is whitelisted 2. Verify rate_limiting.enabled is true 3. Check logs for rate limiting initialization 4. Ensure API key format is correct
Too many false positives¶
Symptom: Legitimate traffic being rate limited
Solutions: 1. Increase burst_capacity for bursty traffic 2. Review client identification (may be grouping multiple clients) 3. Consider using per-API-key limits 4. Add legitimate IPs to whitelist
Redis connection issues¶
Symptom: Rate limiting not working with Redis storage
Solutions: 1. Verify Redis connectivity 2. Check Redis authentication 3. Review connection pool settings 4. Monitor Redis performance
Debug Logging¶
Enable debug logging to see rate limiting decisions in the logs:
Related Documentation¶
- Rate Limiting Architecture - Implementation details and design decisions
- Configuration Guide - Complete configuration reference
- Admin API - Runtime configuration management
- Metrics - Monitoring and observability
- Error Handling - Error codes and retry strategies