Skip to content

Rate Limiting

Continuum Router provides advanced rate limiting capabilities to prevent abuse, ensure fair resource allocation, and protect backend services from overload. The rate limiting system uses a token bucket algorithm with multiple layers of protection.

Overview

The router implements multi-tier rate limiting:

  • Per-client limits: Prevent individual clients from overwhelming the system
  • Per-backend limits: Protect individual backend services from overload
  • Global limits: Ensure overall system stability
  • Endpoint-specific limits: Special handling for critical endpoints

Table of Contents

Configuration

Basic Configuration

rate_limiting:
  enabled: true
  storage: memory  # or "redis" for distributed setups

  limits:
    per_client:
      requests_per_second: 10
      burst_capacity: 20
    per_backend:
      requests_per_second: 100
      burst_capacity: 200
    global:
      requests_per_second: 1000
      burst_capacity: 2000

Per-API-Key Rate Limiting

You can configure custom rate limits for specific API keys:

api_keys:
    - key: "premium-user-key"
    name: "Premium User"
    rate_limit:
      requests_per_second: 100
      burst_capacity: 200

    - key: "standard-user-key"
    name: "Standard User"
    rate_limit:
      requests_per_second: 10
      burst_capacity: 20

Bypass Configuration

Certain clients can bypass rate limiting entirely:

rate_limiting:
  # Whitelist IPs that bypass rate limiting
  whitelist:
        - "192.168.1.0/24"
        - "10.0.0.1"

  # API keys that bypass rate limiting
  bypass_keys:
        - "admin-key-123"
        - "monitoring-key-456"

Client Identification

The router identifies clients using the following priority order:

  1. API Key from Authorization: Bearer <token> header (preferred)
  2. First 16 characters used as client identifier
  3. Provides accurate tracking across different IPs

  4. X-Forwarded-For header (proxy/load balancer scenarios)

  5. Extracts real client IP from proxy headers

  6. X-Real-IP header (alternative proxy header)

  7. Fallback for different proxy configurations

  8. Direct IP address (when no proxy headers present)

  9. Used when request comes directly to the router

Client Identification Example

# Configuration for client identification
rate_limiting:
  client_identification:
    priority:
      - api_key                   # Bearer token (first 16 chars used as ID)
      - x_forwarded_for           # Proxy/load balancer header
      - x_real_ip                 # Alternative IP header
    fallback: "unknown"           # When no identifier available

Rate Limiting Strategies

Token Bucket Algorithm

The router uses the token bucket algorithm, which allows for burst traffic while maintaining long-term rate limits:

  • Bucket capacity: Maximum number of tokens (burst_capacity)
  • Refill rate: Tokens added per second (requestspersecond)
  • Token cost: Each request consumes one token

How It Works

  1. Each client starts with a full bucket of tokens
  2. Tokens are consumed with each request
  3. Tokens refill at a constant rate
  4. Requests are rejected when bucket is empty

Dual-Window Approach

For critical endpoints like /v1/models, the router uses a dual-window approach:

  • Sustained limit: Prevents excessive usage over time (100 req/min)
  • Burst protection: Catches rapid-fire requests (20 req/5s)
# Example: /v1/models endpoint rate limiting
rate_limiting:
  models_endpoint:
    sustained_limit: 100          # Maximum requests per minute
    burst_limit: 20               # Maximum requests in any 5-second window
    window_duration: 60s          # Sliding window for sustained limit
    burst_window: 5s              # Window for burst detection

Response Headers

Success Response

When a request is within rate limits:

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1640995200

Rate Limited Response

When rate limit is exceeded:

HTTP/1.1 429 Too Many Requests
Retry-After: 30
Content-Type: application/json

{
  "error": {
    "message": "Rate limit exceeded: burst limit of 20 requests per 5 seconds exceeded",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded"
  }
}

Monitoring

Metrics

Rate limit violations are tracked in Prometheus metrics:

# Total rejected requests by client
rate_limit_violations_total{client_id="abc123...",endpoint="/v1/chat/completions"} 42

# Current token bucket levels
rate_limit_tokens_available{client_id="abc123...",tier="per_client"} 15

# Rate limit bypass events
rate_limit_bypassed_total{reason="whitelisted_ip"} 123

Logging

Rate limit events are logged with context:

{
  "level": "warn",
  "msg": "Rate limit exceeded",
  "client_id": "abc123...",
  "endpoint": "/v1/chat/completions",
  "limit_type": "burst",
  "limit_value": 20,
  "window": "5s"
}

Bypass Mechanisms

IP Whitelist

Whitelist trusted IP addresses or CIDR ranges:

rate_limiting:
  whitelist:
        - "192.168.1.0/24"      # Internal network
        - "10.0.0.1"            # Admin server
        - "172.16.0.0/16"       # Corporate network

API Key Bypass

Certain API keys can bypass all rate limits:

rate_limiting:
  bypass_keys:
        - "admin-key-123"
        - "monitoring-key-456"
        - "load-test-key-789"

Health Check Exemption

Health check endpoints are automatically exempt from rate limiting:

  • /health
  • /health/ready
  • /health/live
  • /metrics

Storage Backends

Memory Storage (Default)

In-memory storage is fast but not shared across instances:

rate_limiting:
  storage: memory

Pros: - No external dependencies - Low latency - Simple setup

Cons: - Not shared across router instances - Lost on restart - Limited to single-instance deployments

Redis Storage (Distributed)

Redis storage enables distributed rate limiting across multiple router instances:

rate_limiting:
  storage: redis
  redis:
    url: "redis://localhost:6379"
    db: 0
    password: "${REDIS_PASSWORD}"
    pool_size: 10

Pros: - Shared across all router instances - Persistent across restarts - Accurate global limits

Cons: - Requires Redis infrastructure - Slightly higher latency - Additional operational complexity

Hot Reload Support

Rate limiting configuration supports hot reload for immediate updates:

# These settings update immediately without restart
rate_limiting:
  enabled: true                  # ✅ Immediate: Enable/disable rate limiting
  limits:
    per_client:
      requests_per_second: 10    # ✅ Immediate: New limits apply immediately
      burst_capacity: 20         # ✅ Immediate: Burst settings update instantly
    per_backend:
      requests_per_second: 100   # ✅ Immediate: Backend limits update instantly

Best Practices

1. Start Conservative

Begin with stricter limits and relax them based on actual usage:

rate_limiting:
  limits:
    per_client:
      requests_per_second: 5      # Start low
      burst_capacity: 10

2. Monitor and Adjust

Use metrics to understand actual traffic patterns:

  • Track rate_limit_violations_total metric
  • Identify legitimate vs. abusive traffic
  • Adjust limits based on data

3. Use Different Tiers

Implement tiered rate limits for different user classes:

api_keys:
    - key: "free-tier-key"
    rate_limit:
      requests_per_second: 1
      burst_capacity: 5

    - key: "pro-tier-key"
    rate_limit:
      requests_per_second: 10
      burst_capacity: 20

    - key: "enterprise-tier-key"
    rate_limit:
      requests_per_second: 100
      burst_capacity: 200

4. Protect Critical Endpoints

Apply stricter limits to expensive operations:

rate_limiting:
  endpoint_overrides:
    "/v1/chat/completions":
      per_client:
        requests_per_second: 5    # Stricter for expensive endpoints
    "/v1/models":
      per_client:
        requests_per_second: 10   # More lenient for cheap endpoints

5. Use Redis for Production

For multi-instance deployments, use Redis:

rate_limiting:
  storage: redis
  redis:
    url: "redis://redis-cluster:6379"
    pool_size: 20
    connect_timeout: 5s

Troubleshooting

Common Issues

Rate limits not applying

Symptom: Clients can exceed configured limits

Solutions: 1. Check if client is whitelisted 2. Verify rate_limiting.enabled is true 3. Check logs for rate limiting initialization 4. Ensure API key format is correct

Too many false positives

Symptom: Legitimate traffic being rate limited

Solutions: 1. Increase burst_capacity for bursty traffic 2. Review client identification (may be grouping multiple clients) 3. Consider using per-API-key limits 4. Add legitimate IPs to whitelist

Redis connection issues

Symptom: Rate limiting not working with Redis storage

Solutions: 1. Verify Redis connectivity 2. Check Redis authentication 3. Review connection pool settings 4. Monitor Redis performance

Debug Logging

Enable debug logging for rate limiting:

logging:
  level: debug
  modules:
    rate_limiting: debug

Future Enhancements

Planned improvements to the rate limiting system:

  • Per-endpoint configuration: Custom limits for each API endpoint
  • Dynamic rate adjustment: Automatic scaling based on backend capacity
  • Distributed coordination: Better Redis integration for multi-region deployments
  • Cost-based limiting: Different token costs for different operations
  • Quota management: Monthly/daily quota limits in addition to rate limits
  • Advanced analytics: Real-time dashboards for rate limit monitoring