Rate Limiting¶

Continuum Router rate-limits requests to prevent abuse, allocate resources fairly, and protect backends from overload. The rate limiting system uses a token bucket algorithm with multiple layers of protection.

Overview¶

The router implements multi-tier rate limiting:

Per-client limits: Prevent individual clients from overwhelming the system
Per-backend limits: Protect individual backend services from overload
Per-API-key limits: Apply limits per authenticated key, with per-key overrides
Per-model limits: Protect specific (expensive) models
Global limits: Ensure overall system stability

Configuration¶

Basic Configuration¶

rate_limiting:
  enabled: true
  storage: memory  # or "redis" for distributed setups

  limits:
    per_client:
      requests_per_second: 10
      burst_capacity: 20
    per_backend:
      requests_per_second: 100
      burst_capacity: 200
    global:
      requests_per_second: 1000
      burst_capacity: 2000

Per-API-Key Rate Limiting¶

Two mechanisms apply to API keys:

A per_api_key dimension inside rate_limiting.limits that applies token-bucket limits to every keyed client:

rate_limiting:
  limits:
    per_api_key:
      requests_per_second: 10
      burst_capacity: 20

A per-key rate_limit override (requests per minute) on individual entries in the api_keys section:

api_keys:
  api_keys:
    - key: "premium-user-key"
      name: "Premium User"
      rate_limit: 600        # requests per minute

    - key: "standard-user-key"
      name: "Standard User"
      rate_limit: 60         # requests per minute

Bypass Configuration¶

Certain clients can bypass rate limiting entirely:

rate_limiting:
  # Whitelist IPs that bypass rate limiting
  whitelist:
        - "192.168.1.0/24"
        - "10.0.0.1"

  # API keys that bypass rate limiting
  bypass_keys:
        - "admin-key-123"
        - "monitoring-key-456"

Client Identification¶

The router identifies clients using the following priority order:

API Key from the Authorization: Bearer <token> or x-api-key header (preferred)
Provides accurate tracking across different IPs
X-Forwarded-For header (proxy/load balancer scenarios)
Honored only when the request comes from a trusted proxy
With use_rightmost_xff: true (default), the rightmost IP in the chain is used, which is harder to spoof
X-Real-IP header (alternative proxy header)
Also honored only from trusted proxies
Direct IP address (when no proxy headers present)
Used when the request comes directly to the router

Trusted Proxy Configuration¶

rate_limiting:
  # Proxies allowed to set X-Forwarded-For / X-Real-IP headers
  trusted_proxies:
    - "10.0.0.0/8"
    - "192.168.1.1"
  # Use the rightmost IP in the X-Forwarded-For chain (default: true)
  use_rightmost_xff: true

Forwarding headers from untrusted sources are ignored, so clients cannot evade per-client limits by spoofing X-Forwarded-For.

Rate Limiting Strategies¶

Token Bucket Algorithm¶

The router uses the token bucket algorithm, which allows for burst traffic while maintaining long-term rate limits:

Bucket capacity: Maximum number of tokens (burst_capacity)
Refill rate: Tokens added per second (requests_per_second)
Token cost: Each request consumes one token

How It Works¶

Each client starts with a full bucket of tokens
Tokens are consumed with each request
Tokens refill at a constant rate
Requests are rejected when bucket is empty

Per-Model Limits¶

In addition to the per-client, per-backend, per-API-key, and global dimensions, limits can target individual models. This is useful for protecting expensive models while leaving cheaper ones unconstrained:

rate_limiting:
  limits:
    per_model:
      gpt-4:
        requests_per_second: 2
        burst_capacity: 5
      gpt-3.5-turbo:
        requests_per_second: 20
        burst_capacity: 50

Response Headers¶

Success Response¶

When a request is within rate limits:

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1640995200

Rate Limited Response¶

When rate limit is exceeded:

HTTP/1.1 429 Too Many Requests
Retry-After: 30
Content-Type: application/json

{
  "error": {
    "message": "Rate limit exceeded: burst limit of 20 requests per 5 seconds exceeded",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded"
  }
}

Monitoring¶

Metrics¶

Rate limiting activity is tracked in Prometheus metrics:

# Total rate limit checks performed
rate_limit_checks_total 10342

# Requests rejected, labeled by limit dimension
rate_limit_exceeded_total{limit_type="per_client"} 42

# Current token bucket levels and capacity
rate_limit_tokens_remaining{bucket_type="per_client",identifier="..."} 15
rate_limit_bucket_capacity{bucket_type="per_client"} 20

# Whitelisted and bypassed (API key) requests
rate_limit_whitelisted_requests_total 123
rate_limit_bypassed_requests_total 45

Logging¶

Rate limit events are logged with context:

{
  "level": "warn",
  "msg": "Rate limit exceeded",
  "client_id": "abc123...",
  "endpoint": "/v1/chat/completions",
  "limit_type": "burst",
  "limit_value": 20,
  "window": "5s"
}

Bypass Mechanisms¶

IP Whitelist¶

Whitelist trusted IP addresses or CIDR ranges:

rate_limiting:
  whitelist:
        - "192.168.1.0/24"      # Internal network
        - "10.0.0.1"            # Admin server
        - "172.16.0.0/16"       # Corporate network

API Key Bypass¶

Certain API keys can bypass all rate limits:

rate_limiting:
  bypass_keys:
        - "admin-key-123"
        - "monitoring-key-456"
        - "load-test-key-789"

Health Check Exemption¶

Health check and metrics endpoints are automatically exempt from rate limiting:

/health
/healthz
/metrics

Storage Backends¶

Memory Storage (Default)¶

In-memory storage is fast but not shared across instances:

rate_limiting:
  storage: memory

Pros: - No external dependencies - Low latency - Simple setup

Cons: - Not shared across router instances - Lost on restart - Limited to single-instance deployments

Redis Storage (Distributed)¶

Redis storage enables distributed rate limiting across multiple router instances:

rate_limiting:
  storage: redis
  redis:
    url: "redis://localhost:6379"
    key_prefix: "continuum:ratelimit:"   # Prefix for rate limit keys (default)
    ttl: 3600                            # TTL for rate limit keys in seconds

Credentials can be embedded in the URL (redis://:password@host:6379), with environment variable substitution available via ${REDIS_PASSWORD}-style references.

Pros: - Shared across all router instances - Persistent across restarts - Accurate global limits

Cons: - Requires Redis infrastructure - Slightly higher latency - Additional operational complexity

Hot Reload Support¶

Rate limiting configuration supports hot reload for immediate updates:

# These settings update immediately without restart
rate_limiting:
  enabled: true                  # ✅ Immediate: Enable/disable rate limiting
  limits:
    per_client:
      requests_per_second: 10    # ✅ Immediate: New limits apply immediately
      burst_capacity: 20         # ✅ Immediate: Burst settings update instantly
    per_backend:
      requests_per_second: 100   # ✅ Immediate: Backend limits update instantly

Best Practices¶

1. Start Conservative¶

Begin with stricter limits and relax them based on actual usage:

rate_limiting:
  limits:
    per_client:
      requests_per_second: 5      # Start low
      burst_capacity: 10

2. Monitor and Adjust¶

Use metrics to understand actual traffic patterns:

Track the rate_limit_exceeded_total metric
Identify legitimate vs. abusive traffic
Adjust limits based on data

3. Use Different Tiers¶

Implement tiered rate limits for different user classes with per-key overrides (requests per minute):

api_keys:
  api_keys:
    - key: "free-tier-key"
      rate_limit: 60

    - key: "pro-tier-key"
      rate_limit: 600

    - key: "enterprise-tier-key"
      rate_limit: 6000

4. Protect Expensive Models¶

Apply stricter limits to costly models with per_model limits:

rate_limiting:
  limits:
    per_model:
      gpt-4:
        requests_per_second: 2
        burst_capacity: 5

5. Use Redis for Production¶

For multi-instance deployments, use Redis:

rate_limiting:
  storage: redis
  redis:
    url: "redis://redis-cluster:6379"

Troubleshooting¶

Common Issues¶

Rate limits not applying¶

Symptom: Clients can exceed configured limits

Solutions: 1. Check if client is whitelisted 2. Verify rate_limiting.enabled is true 3. Check logs for rate limiting initialization 4. Ensure API key format is correct

Too many false positives¶

Symptom: Legitimate traffic being rate limited

Solutions: 1. Increase burst_capacity for bursty traffic 2. Review client identification (may be grouping multiple clients) 3. Consider using per-API-key limits 4. Add legitimate IPs to whitelist

Redis connection issues¶

Symptom: Rate limiting not working with Redis storage

Solutions: 1. Verify Redis connectivity 2. Check Redis authentication 3. Review connection pool settings 4. Monitor Redis performance

Debug Logging¶

Enable debug logging to see rate limiting decisions in the logs:

logging:
  level: debug

Rate Limiting Architecture - Implementation details and design decisions
Configuration Guide - Complete configuration reference
Admin API - Runtime configuration management
Metrics - Monitoring and observability
Error Handling - Error codes and retry strategies