Rate Limiting¶

Continuum Router provides advanced rate limiting capabilities to prevent abuse, ensure fair resource allocation, and protect backend services from overload. The rate limiting system uses a token bucket algorithm with multiple layers of protection.

Overview¶

The router implements multi-tier rate limiting:

Per-client limits: Prevent individual clients from overwhelming the system
Per-backend limits: Protect individual backend services from overload
Global limits: Ensure overall system stability
Endpoint-specific limits: Special handling for critical endpoints

Configuration¶

Basic Configuration¶

rate_limiting:
  enabled: true
  storage: memory  # or "redis" for distributed setups

  limits:
    per_client:
      requests_per_second: 10
      burst_capacity: 20
    per_backend:
      requests_per_second: 100
      burst_capacity: 200
    global:
      requests_per_second: 1000
      burst_capacity: 2000

Per-API-Key Rate Limiting¶

You can configure custom rate limits for specific API keys:

api_keys:
    - key: "premium-user-key"
    name: "Premium User"
    rate_limit:
      requests_per_second: 100
      burst_capacity: 200

    - key: "standard-user-key"
    name: "Standard User"
    rate_limit:
      requests_per_second: 10
      burst_capacity: 20

Bypass Configuration¶

Certain clients can bypass rate limiting entirely:

rate_limiting:
  # Whitelist IPs that bypass rate limiting
  whitelist:
        - "192.168.1.0/24"
        - "10.0.0.1"

  # API keys that bypass rate limiting
  bypass_keys:
        - "admin-key-123"
        - "monitoring-key-456"

Client Identification¶

The router identifies clients using the following priority order:

API Key from Authorization: Bearer <token> header (preferred)
First 16 characters used as client identifier
Provides accurate tracking across different IPs
X-Forwarded-For header (proxy/load balancer scenarios)
Extracts real client IP from proxy headers
X-Real-IP header (alternative proxy header)
Fallback for different proxy configurations
Direct IP address (when no proxy headers present)
Used when request comes directly to the router

Client Identification Example¶

# Configuration for client identification
rate_limiting:
  client_identification:
    priority:
      - api_key                   # Bearer token (first 16 chars used as ID)
      - x_forwarded_for           # Proxy/load balancer header
      - x_real_ip                 # Alternative IP header
    fallback: "unknown"           # When no identifier available

Rate Limiting Strategies¶

Token Bucket Algorithm¶

The router uses the token bucket algorithm, which allows for burst traffic while maintaining long-term rate limits:

Bucket capacity: Maximum number of tokens (burst_capacity)
Refill rate: Tokens added per second (requestspersecond)
Token cost: Each request consumes one token

How It Works¶

Each client starts with a full bucket of tokens
Tokens are consumed with each request
Tokens refill at a constant rate
Requests are rejected when bucket is empty

Dual-Window Approach¶

For critical endpoints like /v1/models, the router uses a dual-window approach:

Sustained limit: Prevents excessive usage over time (100 req/min)
Burst protection: Catches rapid-fire requests (20 req/5s)

# Example: /v1/models endpoint rate limiting
rate_limiting:
  models_endpoint:
    sustained_limit: 100          # Maximum requests per minute
    burst_limit: 20               # Maximum requests in any 5-second window
    window_duration: 60s          # Sliding window for sustained limit
    burst_window: 5s              # Window for burst detection

Response Headers¶

Success Response¶

When a request is within rate limits:

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1640995200

Rate Limited Response¶

When rate limit is exceeded:

HTTP/1.1 429 Too Many Requests
Retry-After: 30
Content-Type: application/json

{
  "error": {
    "message": "Rate limit exceeded: burst limit of 20 requests per 5 seconds exceeded",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded"
  }
}

Monitoring¶

Metrics¶

Rate limit violations are tracked in Prometheus metrics:

# Total rejected requests by client
rate_limit_violations_total{client_id="abc123...",endpoint="/v1/chat/completions"} 42

# Current token bucket levels
rate_limit_tokens_available{client_id="abc123...",tier="per_client"} 15

# Rate limit bypass events
rate_limit_bypassed_total{reason="whitelisted_ip"} 123

Logging¶

Rate limit events are logged with context:

{
  "level": "warn",
  "msg": "Rate limit exceeded",
  "client_id": "abc123...",
  "endpoint": "/v1/chat/completions",
  "limit_type": "burst",
  "limit_value": 20,
  "window": "5s"
}

Bypass Mechanisms¶

IP Whitelist¶

Whitelist trusted IP addresses or CIDR ranges:

rate_limiting:
  whitelist:
        - "192.168.1.0/24"      # Internal network
        - "10.0.0.1"            # Admin server
        - "172.16.0.0/16"       # Corporate network

API Key Bypass¶

Certain API keys can bypass all rate limits:

rate_limiting:
  bypass_keys:
        - "admin-key-123"
        - "monitoring-key-456"
        - "load-test-key-789"

Health Check Exemption¶

Health check endpoints are automatically exempt from rate limiting:

/health
/health/ready
/health/live
/metrics

Storage Backends¶

Memory Storage (Default)¶

In-memory storage is fast but not shared across instances:

rate_limiting:
  storage: memory

Pros: - No external dependencies - Low latency - Simple setup

Cons: - Not shared across router instances - Lost on restart - Limited to single-instance deployments

Redis Storage (Distributed)¶

Redis storage enables distributed rate limiting across multiple router instances:

rate_limiting:
  storage: redis
  redis:
    url: "redis://localhost:6379"
    db: 0
    password: "${REDIS_PASSWORD}"
    pool_size: 10

Pros: - Shared across all router instances - Persistent across restarts - Accurate global limits

Cons: - Requires Redis infrastructure - Slightly higher latency - Additional operational complexity

Hot Reload Support¶

Rate limiting configuration supports hot reload for immediate updates:

# These settings update immediately without restart
rate_limiting:
  enabled: true                  # ✅ Immediate: Enable/disable rate limiting
  limits:
    per_client:
      requests_per_second: 10    # ✅ Immediate: New limits apply immediately
      burst_capacity: 20         # ✅ Immediate: Burst settings update instantly
    per_backend:
      requests_per_second: 100   # ✅ Immediate: Backend limits update instantly

Best Practices¶

1. Start Conservative¶

Begin with stricter limits and relax them based on actual usage:

rate_limiting:
  limits:
    per_client:
      requests_per_second: 5      # Start low
      burst_capacity: 10

2. Monitor and Adjust¶

Use metrics to understand actual traffic patterns:

Track rate_limit_violations_total metric
Identify legitimate vs. abusive traffic
Adjust limits based on data

3. Use Different Tiers¶

Implement tiered rate limits for different user classes:

api_keys:
    - key: "free-tier-key"
    rate_limit:
      requests_per_second: 1
      burst_capacity: 5

    - key: "pro-tier-key"
    rate_limit:
      requests_per_second: 10
      burst_capacity: 20

    - key: "enterprise-tier-key"
    rate_limit:
      requests_per_second: 100
      burst_capacity: 200

4. Protect Critical Endpoints¶

Apply stricter limits to expensive operations:

rate_limiting:
  endpoint_overrides:
    "/v1/chat/completions":
      per_client:
        requests_per_second: 5    # Stricter for expensive endpoints
    "/v1/models":
      per_client:
        requests_per_second: 10   # More lenient for cheap endpoints

5. Use Redis for Production¶

For multi-instance deployments, use Redis:

rate_limiting:
  storage: redis
  redis:
    url: "redis://redis-cluster:6379"
    pool_size: 20
    connect_timeout: 5s

Troubleshooting¶

Common Issues¶

Rate limits not applying¶

Symptom: Clients can exceed configured limits

Solutions: 1. Check if client is whitelisted 2. Verify rate_limiting.enabled is true 3. Check logs for rate limiting initialization 4. Ensure API key format is correct

Too many false positives¶

Symptom: Legitimate traffic being rate limited

Solutions: 1. Increase burst_capacity for bursty traffic 2. Review client identification (may be grouping multiple clients) 3. Consider using per-API-key limits 4. Add legitimate IPs to whitelist

Redis connection issues¶

Symptom: Rate limiting not working with Redis storage

Solutions: 1. Verify Redis connectivity 2. Check Redis authentication 3. Review connection pool settings 4. Monitor Redis performance

Debug Logging¶

Enable debug logging for rate limiting:

logging:
  level: debug
  modules:
    rate_limiting: debug

Future Enhancements¶

Planned improvements to the rate limiting system:

Per-endpoint configuration: Custom limits for each API endpoint
Dynamic rate adjustment: Automatic scaling based on backend capacity
Distributed coordination: Better Redis integration for multi-region deployments
Cost-based limiting: Different token costs for different operations
Quota management: Monthly/daily quota limits in addition to rate limits
Advanced analytics: Real-time dashboards for rate limit monitoring

Rate Limiting Architecture - Implementation details and design decisions
Configuration Guide - Complete configuration reference
Admin API - Runtime configuration management
Metrics - Monitoring and observability
Error Handling - Error codes and retry strategies