Rate Limiting¶
Continuum Router provides advanced rate limiting capabilities to prevent abuse, ensure fair resource allocation, and protect backend services from overload. The rate limiting system uses a token bucket algorithm with multiple layers of protection.
Overview¶
The router implements multi-tier rate limiting:
- Per-client limits: Prevent individual clients from overwhelming the system
- Per-backend limits: Protect individual backend services from overload
- Global limits: Ensure overall system stability
- Endpoint-specific limits: Special handling for critical endpoints
Table of Contents¶
- Configuration
- Client Identification
- Rate Limiting Strategies
- Response Headers
- Monitoring
- Bypass Mechanisms
- Storage Backends
- Best Practices
Configuration¶
Basic Configuration¶
rate_limiting:
enabled: true
storage: memory # or "redis" for distributed setups
limits:
per_client:
requests_per_second: 10
burst_capacity: 20
per_backend:
requests_per_second: 100
burst_capacity: 200
global:
requests_per_second: 1000
burst_capacity: 2000
Per-API-Key Rate Limiting¶
You can configure custom rate limits for specific API keys:
api_keys:
- key: "premium-user-key"
name: "Premium User"
rate_limit:
requests_per_second: 100
burst_capacity: 200
- key: "standard-user-key"
name: "Standard User"
rate_limit:
requests_per_second: 10
burst_capacity: 20
Bypass Configuration¶
Certain clients can bypass rate limiting entirely:
rate_limiting:
# Whitelist IPs that bypass rate limiting
whitelist:
- "192.168.1.0/24"
- "10.0.0.1"
# API keys that bypass rate limiting
bypass_keys:
- "admin-key-123"
- "monitoring-key-456"
Client Identification¶
The router identifies clients using the following priority order:
- API Key from
Authorization: Bearer <token>header (preferred) - First 16 characters used as client identifier
-
Provides accurate tracking across different IPs
-
X-Forwarded-For header (proxy/load balancer scenarios)
-
Extracts real client IP from proxy headers
-
X-Real-IP header (alternative proxy header)
-
Fallback for different proxy configurations
-
Direct IP address (when no proxy headers present)
- Used when request comes directly to the router
Client Identification Example¶
# Configuration for client identification
rate_limiting:
client_identification:
priority:
- api_key # Bearer token (first 16 chars used as ID)
- x_forwarded_for # Proxy/load balancer header
- x_real_ip # Alternative IP header
fallback: "unknown" # When no identifier available
Rate Limiting Strategies¶
Token Bucket Algorithm¶
The router uses the token bucket algorithm, which allows for burst traffic while maintaining long-term rate limits:
- Bucket capacity: Maximum number of tokens (burst_capacity)
- Refill rate: Tokens added per second (requestspersecond)
- Token cost: Each request consumes one token
How It Works¶
- Each client starts with a full bucket of tokens
- Tokens are consumed with each request
- Tokens refill at a constant rate
- Requests are rejected when bucket is empty
Dual-Window Approach¶
For critical endpoints like /v1/models, the router uses a dual-window approach:
- Sustained limit: Prevents excessive usage over time (100 req/min)
- Burst protection: Catches rapid-fire requests (20 req/5s)
# Example: /v1/models endpoint rate limiting
rate_limiting:
models_endpoint:
sustained_limit: 100 # Maximum requests per minute
burst_limit: 20 # Maximum requests in any 5-second window
window_duration: 60s # Sliding window for sustained limit
burst_window: 5s # Window for burst detection
Response Headers¶
Success Response¶
When a request is within rate limits:
Rate Limited Response¶
When rate limit is exceeded:
HTTP/1.1 429 Too Many Requests
Retry-After: 30
Content-Type: application/json
{
"error": {
"message": "Rate limit exceeded: burst limit of 20 requests per 5 seconds exceeded",
"type": "rate_limit_error",
"code": "rate_limit_exceeded"
}
}
Monitoring¶
Metrics¶
Rate limit violations are tracked in Prometheus metrics:
# Total rejected requests by client
rate_limit_violations_total{client_id="abc123...",endpoint="/v1/chat/completions"} 42
# Current token bucket levels
rate_limit_tokens_available{client_id="abc123...",tier="per_client"} 15
# Rate limit bypass events
rate_limit_bypassed_total{reason="whitelisted_ip"} 123
Logging¶
Rate limit events are logged with context:
{
"level": "warn",
"msg": "Rate limit exceeded",
"client_id": "abc123...",
"endpoint": "/v1/chat/completions",
"limit_type": "burst",
"limit_value": 20,
"window": "5s"
}
Bypass Mechanisms¶
IP Whitelist¶
Whitelist trusted IP addresses or CIDR ranges:
rate_limiting:
whitelist:
- "192.168.1.0/24" # Internal network
- "10.0.0.1" # Admin server
- "172.16.0.0/16" # Corporate network
API Key Bypass¶
Certain API keys can bypass all rate limits:
Health Check Exemption¶
Health check endpoints are automatically exempt from rate limiting:
/health/health/ready/health/live/metrics
Storage Backends¶
Memory Storage (Default)¶
In-memory storage is fast but not shared across instances:
Pros: - No external dependencies - Low latency - Simple setup
Cons: - Not shared across router instances - Lost on restart - Limited to single-instance deployments
Redis Storage (Distributed)¶
Redis storage enables distributed rate limiting across multiple router instances:
rate_limiting:
storage: redis
redis:
url: "redis://localhost:6379"
db: 0
password: "${REDIS_PASSWORD}"
pool_size: 10
Pros: - Shared across all router instances - Persistent across restarts - Accurate global limits
Cons: - Requires Redis infrastructure - Slightly higher latency - Additional operational complexity
Hot Reload Support¶
Rate limiting configuration supports hot reload for immediate updates:
# These settings update immediately without restart
rate_limiting:
enabled: true # ✅ Immediate: Enable/disable rate limiting
limits:
per_client:
requests_per_second: 10 # ✅ Immediate: New limits apply immediately
burst_capacity: 20 # ✅ Immediate: Burst settings update instantly
per_backend:
requests_per_second: 100 # ✅ Immediate: Backend limits update instantly
Best Practices¶
1. Start Conservative¶
Begin with stricter limits and relax them based on actual usage:
2. Monitor and Adjust¶
Use metrics to understand actual traffic patterns:
- Track
rate_limit_violations_totalmetric - Identify legitimate vs. abusive traffic
- Adjust limits based on data
3. Use Different Tiers¶
Implement tiered rate limits for different user classes:
api_keys:
- key: "free-tier-key"
rate_limit:
requests_per_second: 1
burst_capacity: 5
- key: "pro-tier-key"
rate_limit:
requests_per_second: 10
burst_capacity: 20
- key: "enterprise-tier-key"
rate_limit:
requests_per_second: 100
burst_capacity: 200
4. Protect Critical Endpoints¶
Apply stricter limits to expensive operations:
rate_limiting:
endpoint_overrides:
"/v1/chat/completions":
per_client:
requests_per_second: 5 # Stricter for expensive endpoints
"/v1/models":
per_client:
requests_per_second: 10 # More lenient for cheap endpoints
5. Use Redis for Production¶
For multi-instance deployments, use Redis:
rate_limiting:
storage: redis
redis:
url: "redis://redis-cluster:6379"
pool_size: 20
connect_timeout: 5s
Troubleshooting¶
Common Issues¶
Rate limits not applying¶
Symptom: Clients can exceed configured limits
Solutions: 1. Check if client is whitelisted 2. Verify rate_limiting.enabled is true 3. Check logs for rate limiting initialization 4. Ensure API key format is correct
Too many false positives¶
Symptom: Legitimate traffic being rate limited
Solutions: 1. Increase burst_capacity for bursty traffic 2. Review client identification (may be grouping multiple clients) 3. Consider using per-API-key limits 4. Add legitimate IPs to whitelist
Redis connection issues¶
Symptom: Rate limiting not working with Redis storage
Solutions: 1. Verify Redis connectivity 2. Check Redis authentication 3. Review connection pool settings 4. Monitor Redis performance
Debug Logging¶
Enable debug logging for rate limiting:
Future Enhancements¶
Planned improvements to the rate limiting system:
- Per-endpoint configuration: Custom limits for each API endpoint
- Dynamic rate adjustment: Automatic scaling based on backend capacity
- Distributed coordination: Better Redis integration for multi-region deployments
- Cost-based limiting: Different token costs for different operations
- Quota management: Monthly/daily quota limits in addition to rate limits
- Advanced analytics: Real-time dashboards for rate limit monitoring
Related Documentation¶
- Rate Limiting Architecture - Implementation details and design decisions
- Configuration Guide - Complete configuration reference
- Admin API - Runtime configuration management
- Metrics - Monitoring and observability
- Error Handling - Error codes and retry strategies