Skip to content

Rate Limiting

The router implements sophisticated rate limiting to protect against abuse and ensure fair resource allocation across clients.

Architecture

Rate limiting is implemented at the HTTP layer using a dual-window approach:

pub struct ModelEndpointRateLimiter {
    cache: Arc<DashMap<String, (Instant, u32, u32)>>, // (window_start, request_count, burst_count)
    window_duration: Duration,        // 60 seconds for sustained limit
    max_requests_per_window: u32,     // 100 requests per minute
    burst_limit: u32,                 // 20 requests maximum
    burst_window: Duration,           // 5 seconds for burst detection
}

Client Identification

The system uses a priority-based client identification strategy:

  1. API Key (Preferred): Uses the first 16 characters of Bearer token
  2. IP Address (Fallback): Extracted from headers in order:
  3. X-Forwarded-For (for proxy/load balancer environments)
  4. X-Real-IP (alternative header)
  5. Falls back to "unknown" if no IP can be determined

Rate Limit Rules

Each client is independently rate-limited with:

  • Sustained Limit: 100 requests per minute (sliding window)
  • Burst Protection: Maximum 20 requests in any 5-second window
  • Per-Client Isolation: Each API key or IP has separate quotas

Configuration

rate_limiting:
  enabled: true

  # Global defaults
  default:
    requests_per_minute: 100
    burst_limit: 20
    burst_window_seconds: 5

  # Per-endpoint overrides
  endpoints:
    /v1/chat/completions:
      requests_per_minute: 60
      burst_limit: 10
    /v1/models:
      requests_per_minute: 300
      burst_limit: 50

  # Per-client overrides (by API key prefix)
  clients:
    sk-premium-*:
      requests_per_minute: 500
      burst_limit: 100
    sk-free-*:
      requests_per_minute: 20
      burst_limit: 5

Cache Management

The rate limiter includes automatic cleanup mechanisms:

  • Memory Efficiency: DashMap for lock-free concurrent access
  • Automatic Cleanup: Removes expired entries when cache > 1000 entries
  • TTL Differentiation:
    • Empty model responses: 5-second cache TTL (DoS prevention)
    • Normal responses: 60-second cache TTL

Security Considerations

The rate limiting system addresses several security concerns:

  1. DoS Prevention: Short TTL for empty responses prevents cache poisoning
  2. Fair Resource Allocation: Per-client limits prevent monopolization
  3. Burst Protection: Dual-window approach catches both sustained and burst attacks
  4. Client Spoofing Mitigation: API key prioritization over IP addresses

Metrics and Monitoring

Rate limiting integrates with the metrics system:

pub struct ModelMetrics {
    pub rate_limit_violations: AtomicU64,  // Track rejected requests
    pub empty_responses_returned: AtomicU64, // Monitor empty response rate
    pub transient_errors: AtomicU64,       // Network/timeout failures
    pub permanent_errors: AtomicU64,       // Auth/config failures
}

Prometheus Metrics

# Rate limit violations
rate_limit_violations_total{client_id="sk-abc...", endpoint="/v1/chat/completions"} 15

# Current request counts
rate_limit_current_requests{client_id="sk-abc...", window="minute"} 45
rate_limit_current_requests{client_id="sk-abc...", window="burst"} 8

# Rate limit status
rate_limit_remaining{client_id="sk-abc...", window="minute"} 55

Response Headers

Rate limit information is included in response headers:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 55
X-RateLimit-Reset: 1699574460
X-RateLimit-Burst-Limit: 20
X-RateLimit-Burst-Remaining: 12

Error Response

When rate limited, the API returns:

{
  "error": {
    "message": "Rate limit exceeded. Please retry after 45 seconds.",
    "type": "rate_limit_exceeded",
    "code": 429,
    "details": {
      "limit": 100,
      "remaining": 0,
      "reset_at": "2024-01-15T10:30:00Z",
      "retry_after": 45
    }
  }
}

Implementation Details

The rate limiter is implemented as a singleton service:

static MODEL_RATE_LIMITER: once_cell::sync::Lazy<ModelEndpointRateLimiter> =
    once_cell::sync::Lazy::new(ModelEndpointRateLimiter::new);

This ensures: - Single point of rate limit enforcement - Consistent state across all request handlers - Minimal memory overhead