Rate Limiting¶
The router implements sophisticated rate limiting to protect against abuse and ensure fair resource allocation across clients.
Architecture¶
Rate limiting is implemented at the HTTP layer using a dual-window approach:
pub struct ModelEndpointRateLimiter {
cache: Arc<DashMap<String, (Instant, u32, u32)>>, // (window_start, request_count, burst_count)
window_duration: Duration, // 60 seconds for sustained limit
max_requests_per_window: u32, // 100 requests per minute
burst_limit: u32, // 20 requests maximum
burst_window: Duration, // 5 seconds for burst detection
}
Client Identification¶
The system uses a priority-based client identification strategy:
- API Key (Preferred): Uses the first 16 characters of Bearer token
- IP Address (Fallback): Extracted from headers in order:
X-Forwarded-For(for proxy/load balancer environments)X-Real-IP(alternative header)- Falls back to "unknown" if no IP can be determined
Rate Limit Rules¶
Each client is independently rate-limited with:
- Sustained Limit: 100 requests per minute (sliding window)
- Burst Protection: Maximum 20 requests in any 5-second window
- Per-Client Isolation: Each API key or IP has separate quotas
Configuration¶
rate_limiting:
enabled: true
# Global defaults
default:
requests_per_minute: 100
burst_limit: 20
burst_window_seconds: 5
# Per-endpoint overrides
endpoints:
/v1/chat/completions:
requests_per_minute: 60
burst_limit: 10
/v1/models:
requests_per_minute: 300
burst_limit: 50
# Per-client overrides (by API key prefix)
clients:
sk-premium-*:
requests_per_minute: 500
burst_limit: 100
sk-free-*:
requests_per_minute: 20
burst_limit: 5
Cache Management¶
The rate limiter includes automatic cleanup mechanisms:
- Memory Efficiency: DashMap for lock-free concurrent access
- Automatic Cleanup: Removes expired entries when cache > 1000 entries
- TTL Differentiation:
- Empty model responses: 5-second cache TTL (DoS prevention)
- Normal responses: 60-second cache TTL
Security Considerations¶
The rate limiting system addresses several security concerns:
- DoS Prevention: Short TTL for empty responses prevents cache poisoning
- Fair Resource Allocation: Per-client limits prevent monopolization
- Burst Protection: Dual-window approach catches both sustained and burst attacks
- Client Spoofing Mitigation: API key prioritization over IP addresses
Metrics and Monitoring¶
Rate limiting integrates with the metrics system:
pub struct ModelMetrics {
pub rate_limit_violations: AtomicU64, // Track rejected requests
pub empty_responses_returned: AtomicU64, // Monitor empty response rate
pub transient_errors: AtomicU64, // Network/timeout failures
pub permanent_errors: AtomicU64, // Auth/config failures
}
Prometheus Metrics¶
# Rate limit violations
rate_limit_violations_total{client_id="sk-abc...", endpoint="/v1/chat/completions"} 15
# Current request counts
rate_limit_current_requests{client_id="sk-abc...", window="minute"} 45
rate_limit_current_requests{client_id="sk-abc...", window="burst"} 8
# Rate limit status
rate_limit_remaining{client_id="sk-abc...", window="minute"} 55
Response Headers¶
Rate limit information is included in response headers:
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 55
X-RateLimit-Reset: 1699574460
X-RateLimit-Burst-Limit: 20
X-RateLimit-Burst-Remaining: 12
Error Response¶
When rate limited, the API returns:
{
"error": {
"message": "Rate limit exceeded. Please retry after 45 seconds.",
"type": "rate_limit_exceeded",
"code": 429,
"details": {
"limit": 100,
"remaining": 0,
"reset_at": "2024-01-15T10:30:00Z",
"retry_after": 45
}
}
}
Implementation Details¶
The rate limiter is implemented as a singleton service:
static MODEL_RATE_LIMITER: once_cell::sync::Lazy<ModelEndpointRateLimiter> =
once_cell::sync::Lazy::new(ModelEndpointRateLimiter::new);
This ensures: - Single point of rate limit enforcement - Consistent state across all request handlers - Minimal memory overhead
Related Documentation¶
- Circuit Breaker - Backend failure handling
- Model Fallback - Fallback strategies
- Architecture Overview - Main architecture guide