Skip to content

Rate Limiting

Continuum Router rate-limits requests to prevent abuse, allocate resources fairly, and protect backends from overload. The rate limiting system uses a token bucket algorithm with multiple layers of protection.

Overview

The router implements multi-tier rate limiting:

  • Per-client limits: Prevent individual clients from overwhelming the system
  • Per-backend limits: Protect individual backend services from overload
  • Per-API-key limits: Apply limits per authenticated key, with per-key overrides
  • Per-model limits: Protect specific (expensive) models
  • Global limits: Ensure overall system stability

Table of Contents

Configuration

Basic Configuration

rate_limiting:
  enabled: true
  storage: memory  # or "redis" for distributed setups

  limits:
    per_client:
      requests_per_second: 10
      burst_capacity: 20
    per_backend:
      requests_per_second: 100
      burst_capacity: 200
    global:
      requests_per_second: 1000
      burst_capacity: 2000

Per-API-Key Rate Limiting

Two mechanisms apply to API keys:

  1. A per_api_key dimension inside rate_limiting.limits that applies token-bucket limits to every keyed client:

    rate_limiting:
      limits:
        per_api_key:
          requests_per_second: 10
          burst_capacity: 20
    
  2. A per-key rate_limit override (requests per minute) on individual entries in the api_keys section:

    api_keys:
      api_keys:
        - key: "premium-user-key"
          name: "Premium User"
          rate_limit: 600        # requests per minute
    
        - key: "standard-user-key"
          name: "Standard User"
          rate_limit: 60         # requests per minute
    

Bypass Configuration

Certain clients can bypass rate limiting entirely:

rate_limiting:
  # Whitelist IPs that bypass rate limiting
  whitelist:
        - "192.168.1.0/24"
        - "10.0.0.1"

  # API keys that bypass rate limiting
  bypass_keys:
        - "admin-key-123"
        - "monitoring-key-456"

Client Identification

The router identifies clients using the following priority order:

  1. API Key from the Authorization: Bearer <token> or x-api-key header (preferred)
  2. Provides accurate tracking across different IPs

  3. X-Forwarded-For header (proxy/load balancer scenarios)

  4. Honored only when the request comes from a trusted proxy
  5. With use_rightmost_xff: true (default), the rightmost IP in the chain is used, which is harder to spoof

  6. X-Real-IP header (alternative proxy header)

  7. Also honored only from trusted proxies

  8. Direct IP address (when no proxy headers present)

  9. Used when the request comes directly to the router

Trusted Proxy Configuration

rate_limiting:
  # Proxies allowed to set X-Forwarded-For / X-Real-IP headers
  trusted_proxies:
    - "10.0.0.0/8"
    - "192.168.1.1"
  # Use the rightmost IP in the X-Forwarded-For chain (default: true)
  use_rightmost_xff: true

Forwarding headers from untrusted sources are ignored, so clients cannot evade per-client limits by spoofing X-Forwarded-For.

Rate Limiting Strategies

Token Bucket Algorithm

The router uses the token bucket algorithm, which allows for burst traffic while maintaining long-term rate limits:

  • Bucket capacity: Maximum number of tokens (burst_capacity)
  • Refill rate: Tokens added per second (requests_per_second)
  • Token cost: Each request consumes one token

How It Works

  1. Each client starts with a full bucket of tokens
  2. Tokens are consumed with each request
  3. Tokens refill at a constant rate
  4. Requests are rejected when bucket is empty

Per-Model Limits

In addition to the per-client, per-backend, per-API-key, and global dimensions, limits can target individual models. This is useful for protecting expensive models while leaving cheaper ones unconstrained:

rate_limiting:
  limits:
    per_model:
      gpt-4:
        requests_per_second: 2
        burst_capacity: 5
      gpt-3.5-turbo:
        requests_per_second: 20
        burst_capacity: 50

Response Headers

Success Response

When a request is within rate limits:

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1640995200

Rate Limited Response

When rate limit is exceeded:

HTTP/1.1 429 Too Many Requests
Retry-After: 30
Content-Type: application/json

{
  "error": {
    "message": "Rate limit exceeded: burst limit of 20 requests per 5 seconds exceeded",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded"
  }
}

Monitoring

Metrics

Rate limiting activity is tracked in Prometheus metrics:

# Total rate limit checks performed
rate_limit_checks_total 10342

# Requests rejected, labeled by limit dimension
rate_limit_exceeded_total{limit_type="per_client"} 42

# Current token bucket levels and capacity
rate_limit_tokens_remaining{bucket_type="per_client",identifier="..."} 15
rate_limit_bucket_capacity{bucket_type="per_client"} 20

# Whitelisted and bypassed (API key) requests
rate_limit_whitelisted_requests_total 123
rate_limit_bypassed_requests_total 45

Logging

Rate limit events are logged with context:

{
  "level": "warn",
  "msg": "Rate limit exceeded",
  "client_id": "abc123...",
  "endpoint": "/v1/chat/completions",
  "limit_type": "burst",
  "limit_value": 20,
  "window": "5s"
}

Bypass Mechanisms

IP Whitelist

Whitelist trusted IP addresses or CIDR ranges:

rate_limiting:
  whitelist:
        - "192.168.1.0/24"      # Internal network
        - "10.0.0.1"            # Admin server
        - "172.16.0.0/16"       # Corporate network

API Key Bypass

Certain API keys can bypass all rate limits:

rate_limiting:
  bypass_keys:
        - "admin-key-123"
        - "monitoring-key-456"
        - "load-test-key-789"

Health Check Exemption

Health check and metrics endpoints are automatically exempt from rate limiting:

  • /health
  • /healthz
  • /metrics

Storage Backends

Memory Storage (Default)

In-memory storage is fast but not shared across instances:

rate_limiting:
  storage: memory

Pros: - No external dependencies - Low latency - Simple setup

Cons: - Not shared across router instances - Lost on restart - Limited to single-instance deployments

Redis Storage (Distributed)

Redis storage enables distributed rate limiting across multiple router instances:

rate_limiting:
  storage: redis
  redis:
    url: "redis://localhost:6379"
    key_prefix: "continuum:ratelimit:"   # Prefix for rate limit keys (default)
    ttl: 3600                            # TTL for rate limit keys in seconds

Credentials can be embedded in the URL (redis://:password@host:6379), with environment variable substitution available via ${REDIS_PASSWORD}-style references.

Pros: - Shared across all router instances - Persistent across restarts - Accurate global limits

Cons: - Requires Redis infrastructure - Slightly higher latency - Additional operational complexity

Hot Reload Support

Rate limiting configuration supports hot reload for immediate updates:

# These settings update immediately without restart
rate_limiting:
  enabled: true                  # ✅ Immediate: Enable/disable rate limiting
  limits:
    per_client:
      requests_per_second: 10    # ✅ Immediate: New limits apply immediately
      burst_capacity: 20         # ✅ Immediate: Burst settings update instantly
    per_backend:
      requests_per_second: 100   # ✅ Immediate: Backend limits update instantly

Best Practices

1. Start Conservative

Begin with stricter limits and relax them based on actual usage:

rate_limiting:
  limits:
    per_client:
      requests_per_second: 5      # Start low
      burst_capacity: 10

2. Monitor and Adjust

Use metrics to understand actual traffic patterns:

  • Track the rate_limit_exceeded_total metric
  • Identify legitimate vs. abusive traffic
  • Adjust limits based on data

3. Use Different Tiers

Implement tiered rate limits for different user classes with per-key overrides (requests per minute):

api_keys:
  api_keys:
    - key: "free-tier-key"
      rate_limit: 60

    - key: "pro-tier-key"
      rate_limit: 600

    - key: "enterprise-tier-key"
      rate_limit: 6000

4. Protect Expensive Models

Apply stricter limits to costly models with per_model limits:

rate_limiting:
  limits:
    per_model:
      gpt-4:
        requests_per_second: 2
        burst_capacity: 5

5. Use Redis for Production

For multi-instance deployments, use Redis:

rate_limiting:
  storage: redis
  redis:
    url: "redis://redis-cluster:6379"

Troubleshooting

Common Issues

Rate limits not applying

Symptom: Clients can exceed configured limits

Solutions: 1. Check if client is whitelisted 2. Verify rate_limiting.enabled is true 3. Check logs for rate limiting initialization 4. Ensure API key format is correct

Too many false positives

Symptom: Legitimate traffic being rate limited

Solutions: 1. Increase burst_capacity for bursty traffic 2. Review client identification (may be grouping multiple clients) 3. Consider using per-API-key limits 4. Add legitimate IPs to whitelist

Redis connection issues

Symptom: Rate limiting not working with Redis storage

Solutions: 1. Verify Redis connectivity 2. Check Redis authentication 3. Review connection pool settings 4. Monitor Redis performance

Debug Logging

Enable debug logging to see rate limiting decisions in the logs:

logging:
  level: debug