Configuration Guide¶

This guide provides comprehensive documentation for configuring Continuum Router. The router supports multiple configuration methods with a clear priority system to provide maximum flexibility for different deployment scenarios.

Table of Contents¶

Configuration Methods
Configuration Priority
Configuration File Format
Environment Variables
Command Line Arguments
Configuration Sections
Advanced Configuration
Examples
Migration Guide

Configuration Methods¶

Continuum Router supports three configuration methods:

Configuration File (YAML) - Recommended for production
Environment Variables - Ideal for containerized deployments
Command Line Arguments - Useful for testing and overrides

Configuration Discovery¶

The router automatically searches for configuration files in these locations (in order):

Path specified by --config flag
./config.yaml (current directory)
./config.yml
/etc/continuum-router/config.yaml
/etc/continuum-router/config.yml
~/.config/continuum-router/config.yaml
~/.config/continuum-router/config.yml

Configuration Priority¶

Configuration is applied in the following priority order (highest to lowest):

Command-line arguments (highest priority)
Environment variables
Configuration file
Default values (lowest priority)

This allows you to: - Set base configuration in a file - Override specific settings via environment variables in containers - Make temporary adjustments using command-line arguments

Configuration File Format¶

Complete Configuration Example¶

# Continuum Router Configuration
# This example shows all available configuration options with their default values

# Server configuration
server:
  # bind_address accepts a single string or an array of addresses
  # TCP format: "host:port", Unix socket format: "unix:/path/to/socket"
  bind_address: "0.0.0.0:8080"          # Single address (backward compatible)
  # bind_address:                        # Or multiple addresses:
  #   - "0.0.0.0:8080"                   #   TCP on all interfaces
  #   - "unix:/var/run/router.sock"     #   Unix socket (Unix/Linux/macOS only)
  # socket_mode: 0o660                   # Optional: Unix socket file permissions
  workers: 4                             # Number of worker threads (0 = auto-detect)
  connection_pool_size: 100              # Max idle connections per backend

# Model metadata configuration (optional)
model_metadata_file: "model-metadata.yaml"  # Path to external model metadata file

# Backend configuration
backends:
  # Native OpenAI API with built-in configuration
    - name: "openai"
    type: openai                         # Use native OpenAI backend
    api_key: "${CONTINUUM_OPENAI_API_KEY}"  # Loaded from environment
    org_id: "${CONTINUUM_OPENAI_ORG_ID}"    # Optional organization ID
    weight: 3
    models:                              # Specify which models to use
      - gpt-4o
      - gpt-4o-mini
      - o3-mini
      - text-embedding-3-large
    retry_override:                      # Backend-specific retry settings (optional)
      max_attempts: 5
      base_delay: "200ms"
      max_delay: "30s"
      exponential_backoff: true
      jitter: true

  # Generic OpenAI-compatible backend with custom metadata
    - name: "openai-compatible"
    url: "https://custom-llm.example.com"
    weight: 1
    models:
      - "gpt-4"
      - "gpt-3.5-turbo"
    model_configs:                       # Enhanced model configuration with metadata
      - id: "gpt-4"
        aliases:                         # Alternative IDs that share this metadata (optional)
          - "gpt-4-0125-preview"
          - "gpt-4-turbo-preview"
        metadata:
          display_name: "GPT-4"
          summary: "Most capable GPT-4 model for complex tasks"
          capabilities: ["text", "image", "function_calling"]
          knowledge_cutoff: "2024-04"
          pricing:
            input_tokens: 0.03
            output_tokens: 0.06
          limits:
            context_window: 128000
            max_output: 4096

  # Ollama local server with automatic URL detection
    - name: "local-ollama"
    type: ollama                         # Defaults to http://localhost:11434
    weight: 2
    models:
      - "llama2"
      - "mistral"
      - "codellama"

  # vLLM server
    - name: "vllm-server"
    type: vllm
    url: "http://localhost:8000"
    weight: 1
    # Models will be discovered automatically if not specified
    # Models with namespace prefixes (e.g., "custom/gpt-4") will automatically
    # match metadata for base names (e.g., "gpt-4")

  # Google Gemini API (native backend)
    - name: "gemini"
    type: gemini                           # Use native Gemini backend
    api_key: "${CONTINUUM_GEMINI_API_KEY}" # Loaded from environment
    weight: 2
    models:
      - gemini-2.5-pro
      - gemini-2.5-flash
      - gemini-2.0-flash

# Health monitoring configuration
health_checks:
  enabled: true                          # Enable/disable health checks
  interval: "30s"                        # How often to check backend health
  timeout: "10s"                         # Timeout for health check requests
  unhealthy_threshold: 3                 # Failures before marking unhealthy
  healthy_threshold: 2                   # Successes before marking healthy
  endpoint: "/v1/models"                 # Endpoint used for health checks

# Request handling and timeout configuration
timeouts:
  connection: "10s"                      # TCP connection establishment timeout
  request:
    standard:                            # Non-streaming requests
      first_byte: "30s"                  # Time to receive first byte
      total: "180s"                      # Total request timeout (3 minutes)
    streaming:                           # Streaming (SSE) requests
      first_byte: "60s"                  # Time to first SSE chunk
      chunk_interval: "30s"              # Max time between chunks
      total: "600s"                      # Total streaming timeout (10 minutes)
    image_generation:                    # Image generation requests (DALL-E, etc.)
      first_byte: "60s"                  # Time to receive first byte
      total: "180s"                      # Total timeout (3 minutes default)
    model_overrides:                     # Model-specific timeout overrides
      gpt-5-latest:
        streaming:
          total: "1200s"                 # 20 minutes for GPT-5
      gpt-4o:
        streaming:
          total: "900s"                  # 15 minutes for GPT-4o
  health_check:
    timeout: "5s"                        # Health check timeout
    interval: "30s"                      # Health check interval

request:
  max_retries: 3                         # Maximum retry attempts for requests
  retry_delay: "1s"                      # Initial delay between retries

# Global retry and resilience configuration
retry:
  max_attempts: 3                        # Maximum retry attempts
  base_delay: "100ms"                    # Base delay between retries
  max_delay: "30s"                       # Maximum delay between retries
  exponential_backoff: true              # Use exponential backoff
  jitter: true                          # Add random jitter to delays

# Caching and optimization configuration
cache:
  model_cache_ttl: "300s"               # Cache model lists for 5 minutes
  deduplication_ttl: "60s"              # Deduplicate requests for 1 minute
  enable_deduplication: true            # Enable request deduplication

# Logging configuration
logging:
  level: "info"                         # Log level: trace, debug, info, warn, error
  format: "json"                        # Log format: json, pretty
  enable_colors: false                  # Enable colored output (for pretty format)

# Files API configuration
files:
  enabled: true                         # Enable/disable Files API endpoints
  max_file_size: 536870912              # Maximum file size in bytes (default: 512MB)
  storage_path: "./data/files"          # Storage path for uploaded files (supports ~)
  retention_days: 0                     # File retention in days (0 = keep forever)
  metadata_storage: persistent          # Metadata backend: "memory" or "persistent" (default)
  cleanup_orphans_on_startup: false     # Auto-cleanup orphaned files on startup

  # Authentication and authorization
  auth:
    method: api_key                     # "none" or "api_key" (default)
    required_scope: files               # API key scope required for access
    enforce_ownership: true             # Users can only access their own files
    admin_can_access_all: true          # Admin scope grants access to all files

# Load balancing configuration
load_balancer:
  strategy: "round_robin"               # Strategy: round_robin, weighted, random
  health_aware: true                    # Only route to healthy backends

# Distributed tracing configuration
tracing:
  enabled: true                         # Enable/disable distributed tracing
  w3c_trace_context: true               # Support W3C Trace Context (traceparent header)
  headers:
    trace_id: "X-Trace-ID"              # Header name for trace ID
    request_id: "X-Request-ID"          # Header name for request ID
    correlation_id: "X-Correlation-ID"  # Header name for correlation ID

# Circuit breaker configuration (future feature)
circuit_breaker:
  enabled: false                        # Enable circuit breaker
  failure_threshold: 5                  # Failures to open circuit
  recovery_timeout: "60s"               # Time before attempting recovery
  half_open_retries: 3                  # Retries in half-open state

# Rate limiting configuration (future feature)
rate_limiting:
  enabled: false                        # Enable rate limiting
  requests_per_second: 100              # Global requests per second
  burst_size: 200                       # Burst capacity

# Metrics and monitoring configuration (future feature)
metrics:
  enabled: false                        # Enable metrics collection
  endpoint: "/metrics"                  # Metrics endpoint path
  include_labels: true                  # Include detailed labels

Minimal Configuration¶

# Minimal configuration - other settings will use defaults
server:
  bind_address: "0.0.0.0:8080"

backends:
    - name: "ollama"
    url: "http://localhost:11434"
    - name: "lm-studio"  
    url: "http://localhost:1234"

Environment Variables¶

All configuration options can be overridden using environment variables with the CONTINUUM_ prefix:

Server Configuration¶

Variable	Type	Default	Description
`CONTINUUM_BIND_ADDRESS`	string	`"0.0.0.0:8080"`	Server bind address
`CONTINUUM_WORKERS`	integer	`4`	Number of worker threads
`CONTINUUM_CONNECTION_POOL_SIZE`	integer	`100`	HTTP connection pool size

Backend Configuration¶

Variable	Type	Default	Description
`CONTINUUM_BACKEND_URLS`	string		Comma-separated backend URLs
`CONTINUUM_BACKEND_WEIGHTS`	string		Comma-separated weights (must match URLs)

Health Check Configuration¶

Variable	Type	Default	Description
`CONTINUUM_HEALTH_CHECKS_ENABLED`	boolean	`true`	Enable health checks
`CONTINUUM_HEALTH_CHECK_INTERVAL`	string	`"30s"`	Health check interval
`CONTINUUM_HEALTH_CHECK_TIMEOUT`	string	`"10s"`	Health check timeout
`CONTINUUM_UNHEALTHY_THRESHOLD`	integer	`3`	Failures before unhealthy
`CONTINUUM_HEALTHY_THRESHOLD`	integer	`2`	Successes before healthy

Request Configuration¶

Variable	Type	Default	Description
`CONTINUUM_REQUEST_TIMEOUT`	string	`"300s"`	Maximum request timeout
`CONTINUUM_MAX_RETRIES`	integer	`3`	Maximum retry attempts
`CONTINUUM_RETRY_DELAY`	string	`"1s"`	Initial retry delay

Logging Configuration¶

Variable	Type	Default	Description
`CONTINUUM_LOG_LEVEL`	string	`"info"`	Log level
`CONTINUUM_LOG_FORMAT`	string	`"json"`	Log format
`CONTINUUM_LOG_COLORS`	boolean	`false`	Enable colored output
`RUST_LOG`	string		Rust-specific logging configuration

Cache Configuration¶

Variable	Type	Default	Description
`CONTINUUM_MODEL_CACHE_TTL`	string	`"300s"`	Model cache TTL
`CONTINUUM_DEDUPLICATION_TTL`	string	`"60s"`	Deduplication TTL
`CONTINUUM_ENABLE_DEDUPLICATION`	boolean	`true`	Enable deduplication

Files API Configuration¶

Variable	Type	Default	Description
`CONTINUUM_FILES_ENABLED`	boolean	`true`	Enable/disable Files API
`CONTINUUM_FILES_MAX_SIZE`	integer	`536870912`	Maximum file size in bytes (512MB)
`CONTINUUM_FILES_STORAGE_PATH`	string	`"./data/files"`	Storage path for uploaded files
`CONTINUUM_FILES_RETENTION_DAYS`	integer	`0`	File retention in days (0 = forever)
`CONTINUUM_FILES_METADATA_STORAGE`	string	`"persistent"`	Metadata backend: "memory" or "persistent"
`CONTINUUM_FILES_CLEANUP_ORPHANS`	boolean	`false`	Auto-cleanup orphaned files on startup
`CONTINUUM_FILES_AUTH_METHOD`	string	`"api_key"`	Authentication method: "none" or "api_key"
`CONTINUUM_FILES_AUTH_SCOPE`	string	`"files"`	Required API key scope for Files API access
`CONTINUUM_FILES_ENFORCE_OWNERSHIP`	boolean	`true`	Users can only access their own files
`CONTINUUM_FILES_ADMIN_ACCESS_ALL`	boolean	`true`	Admin scope grants access to all files
`CONTINUUM_DEV_MODE`	boolean	`false`	Enable development API keys (DO NOT use in production)

API Key Management Configuration¶

Variable	Type	Default	Description
`CONTINUUM_API_KEY`	string	-	Single API key for simple deployments
`CONTINUUM_API_KEY_SCOPES`	string	`"read,write"`	Comma-separated scopes for the API key
`CONTINUUM_API_KEY_USER_ID`	string	`"admin"`	User ID associated with the API key
`CONTINUUM_API_KEY_ORG_ID`	string	`"default"`	Organization ID associated with the API key
`CONTINUUM_DEV_MODE`	boolean	`false`	Enable development API keys (DO NOT use in production)

Example Environment Configuration¶

# Basic configuration
export CONTINUUM_BIND_ADDRESS="0.0.0.0:9000"
export CONTINUUM_BACKEND_URLS="http://localhost:11434,http://localhost:1234"
export CONTINUUM_LOG_LEVEL="debug"

# Advanced configuration
export CONTINUUM_CONNECTION_POOL_SIZE="200"
export CONTINUUM_HEALTH_CHECK_INTERVAL="60s"
export CONTINUUM_MODEL_CACHE_TTL="600s"
export CONTINUUM_ENABLE_DEDUPLICATION="true"

# Start the router
continuum-router

Command Line Arguments¶

Command-line arguments provide the highest priority configuration method and are useful for testing and temporary overrides.

Core Options¶

continuum-router --help

Argument	Type	Description
`-c, --config <FILE>`	path	Configuration file path
`--generate-config`	flag	Generate sample config and exit
`--model-metadata <FILE>`	path	Path to model metadata YAML file (overrides config)

Backend Configuration¶

Argument	Type	Description
`--backends <URLs>`	string	Comma-separated backend URLs
`--backend-url <URL>`	string	Single backend URL (deprecated)

Server Configuration¶

Argument	Type	Description
`--bind <ADDRESS>`	string	Server bind address
`--connection-pool-size <SIZE>`	integer	HTTP connection pool size

Health Check Configuration¶

Argument	Type	Description
`--disable-health-checks`	flag	Disable health monitoring
`--health-check-interval <SECONDS>`	integer	Health check interval
`--health-check-timeout <SECONDS>`	integer	Health check timeout
`--unhealthy-threshold <COUNT>`	integer	Failures before unhealthy
`--healthy-threshold <COUNT>`	integer	Successes before healthy

Example CLI Usage¶

# Use config file with overrides
continuum-router --config config.yaml --bind "0.0.0.0:9000"

# Override backends temporarily
continuum-router --config config.yaml --backends "http://localhost:11434"

# Use custom model metadata file
continuum-router --config config.yaml --model-metadata /path/to/custom-metadata.yaml

# Use model metadata with tilde expansion
continuum-router --model-metadata ~/configs/model-metadata.yaml

# Adjust health check settings for testing
continuum-router --config config.yaml --health-check-interval 10

# Generate sample configuration
continuum-router --generate-config > my-config.yaml

Configuration Sections¶

Server Section¶

Controls the HTTP server behavior:

server:
  bind_address: "0.0.0.0:8080"    # Host and port to bind
  workers: 4                       # Worker threads (0 = auto)
  connection_pool_size: 100        # HTTP connection pool size

Multiple Bind Addresses and Unix Sockets¶

The server supports binding to multiple addresses simultaneously, including Unix domain sockets (on Unix-like systems). This enables flexible deployment scenarios such as:

Listening on both IPv4 and IPv6 addresses
Exposing a TCP port for external clients while using a Unix socket for local services
Running behind a reverse proxy via Unix socket for better security

Single Address (Backward Compatible):

server:
  bind_address: "0.0.0.0:8080"

Multiple Addresses:

server:
  bind_address:
    - "127.0.0.1:8080"           # IPv4 localhost
    - "[::1]:8080"               # IPv6 localhost
    - "0.0.0.0:9090"             # All interfaces on port 9090

Unix Socket Binding (Unix/Linux/macOS only):

server:
  bind_address:
    - "0.0.0.0:8080"             # TCP for external access
    - "unix:/var/run/continuum-router.sock"  # Unix socket for local services
  socket_mode: 0o660              # Optional: file permissions for Unix sockets (octal)

Configuration Options:

Option	Type	Default	Description
`bind_address`	string or array	`"0.0.0.0:8080"`	Address(es) to bind. TCP format: `host:port`. Unix socket format: `unix:/path/to/socket`
`socket_mode`	integer (octal)	`null`	File permissions for Unix sockets (e.g., `0o660` for owner/group read-write)

Unix Socket Notes:

Unix socket addresses must start with unix: prefix
Existing socket files are automatically removed before binding
Socket files are cleaned up on graceful shutdown
On non-Unix platforms, unix: addresses log a warning and are skipped
Unix socket connections bypass IP-based authentication checks (client IP reported as "unix")

Nginx Reverse Proxy Example:

upstream continuum {
    server unix:/var/run/continuum-router.sock;
}

server {
    listen 443 ssl;
    location /v1/ {
        proxy_pass http://continuum;
    }
}

Performance Tuning:

workers: Set to 0 for auto-detection, or match CPU cores
connection_pool_size: Increase for high-load scenarios (200-500)

CORS Configuration¶

CORS (Cross-Origin Resource Sharing) allows the router to accept requests from web browsers running on different origins. This is essential for embedding continuum-router in:

Tauri apps: WebView using origins like tauri://localhost
Electron apps: Custom protocols
Separate web frontends: Development servers on different ports

server:
  bind_address: "0.0.0.0:8080"
  cors:
    enabled: true
    allow_origins:
      - "tauri://localhost"
      - "http://localhost:*"        # Wildcard port matching
      - "https://example.com"
    allow_methods:
      - "GET"
      - "POST"
      - "PUT"
      - "DELETE"
      - "OPTIONS"
      - "PATCH"
    allow_headers:
      - "Content-Type"
      - "Authorization"
      - "X-Request-ID"
      - "X-Trace-ID"
    expose_headers:
      - "X-Request-ID"
      - "X-Fallback-Used"
    allow_credentials: false
    max_age: 3600                   # Preflight cache duration in seconds

CORS Configuration Options:

Option	Type	Default	Description
`enabled`	boolean	`false`	Enable/disable CORS middleware
`allow_origins`	array	`[]`	Allowed origins (supports `` for any, port wildcards like `http://localhost:`)
`allow_methods`	array	`["GET", "POST", "PUT", "DELETE", "OPTIONS", "PATCH"]`	Allowed HTTP methods
`allow_headers`	array	`["Content-Type", "Authorization", "X-Request-ID", "X-Trace-ID"]`	Allowed request headers
`expose_headers`	array	`[]`	Headers exposed to the client JavaScript
`allow_credentials`	boolean	`false`	Allow cookies and authorization headers
`max_age`	integer	`3600`	Preflight response cache duration in seconds

Origin Pattern Matching:

Pattern	Example	Description
`*`	`*`	Matches any origin (not compatible with `allow_credentials: true`)
Exact URL	`https://example.com`	Exact match
Custom scheme	`tauri://localhost`	Custom protocols (Tauri, Electron)
Port wildcard	`http://localhost:*`	Matches any port on localhost

Security Considerations:

Using * for origins allows any website to make requests - only use for public APIs
When allow_credentials is true, you cannot use * for origins - specify exact origins
For development, use port wildcards like http://localhost:* for flexibility
In production, always specify exact origins for security

Hot Reload: CORS configuration supports immediate hot reload - changes apply to new requests instantly without server restart.

Backends Section¶

Defines the LLM backends to route requests to:

backends:
    - name: "unique-identifier"        # Must be unique across all backends
    type: "generic"                  # Backend type (optional, defaults to "generic")
    url: "http://backend:port"       # Base URL for the backend
    weight: 1                        # Load balancing weight (1-100)
    api_key: "${API_KEY}"            # API key (optional, supports env var references)
    org_id: "${ORG_ID}"              # Organization ID (optional, for OpenAI)
    models: ["model1", "model2"]     # Optional: explicit model list
    retry_override:                  # Optional: backend-specific retry settings
      max_attempts: 5
      base_delay: "200ms"

Starting Without Backends¶

The router can start with an empty backends list (backends: []), which is useful for:

Infrastructure bootstrapping: Start the router first, then add backends dynamically via the Admin API
Container orchestration: Router container can be ready before backend services
Development workflows: Test admin endpoints before backends are provisioned
Gradual rollout: Start with zero backends and add them progressively

When running with no backends:

/v1/models returns {"object": "list", "data": []}
/v1/chat/completions and other routing endpoints return 503 "No backends available"
/health returns healthy status (the router itself is operational)
Backends can be added via POST /admin/backends

Example minimal configuration for dynamic backend management:

server:
  bind_address: "0.0.0.0:8080"

backends: []  # Start with no backends - add via Admin API later

admin:
  auth:
    method: bearer
    token: "${ADMIN_TOKEN}"

Backend Types Supported:

Type	Description	Default URL
`generic`	OpenAI-compatible API (default)	Must be specified
`openai`	Native OpenAI API with built-in configuration	`https://api.openai.com/v1`
`gemini`	Google Gemini API (OpenAI-compatible endpoint)	`https://generativelanguage.googleapis.com/v1beta/openai`
`azure`	Azure OpenAI Service	Must be specified
`vllm`	vLLM server	Must be specified
`ollama`	Ollama local server	`http://localhost:11434`
`llamacpp`	llama.cpp llama-server (GGUF models)	`http://localhost:8080`
`anthropic`	Anthropic Claude API (native, with request/response translation)	`https://api.anthropic.com`

Native OpenAI Backend¶

When using type: openai, the router provides: - Default URL: https://api.openai.com/v1 (can be overridden for proxies) - Built-in model metadata: Automatic pricing, context windows, and capabilities - Environment variable support: Automatically loads from CONTINUUM_OPENAI_API_KEY and CONTINUUM_OPENAI_ORG_ID

Minimal OpenAI configuration:

backends:
    - name: "openai"
    type: openai
    models:
      - gpt-4o
      - gpt-4o-mini
      - o3-mini

Full OpenAI configuration with explicit API key:

backends:
    - name: "openai-primary"
    type: openai
    api_key: "${CONTINUUM_OPENAI_API_KEY}"
    org_id: "${CONTINUUM_OPENAI_ORG_ID}"     # Optional
    models:
      - gpt-4o
      - gpt-4o-mini
      - o1
      - o1-mini
      - o3-mini
      - text-embedding-3-large

Using OpenAI with a proxy:

backends:
    - name: "openai-proxy"
    type: openai
    url: "https://my-proxy.example.com/v1"   # Override default URL
    api_key: "${PROXY_API_KEY}"
    models:
      - gpt-4o

Environment Variables for OpenAI¶

Variable	Description
`CONTINUUM_OPENAI_API_KEY`	OpenAI API key (automatically loaded for `type: openai` backends)
`CONTINUUM_OPENAI_ORG_ID`	OpenAI Organization ID (optional)

Model Auto-Discovery:

When models is not specified or is empty, backends automatically discover available models from their /v1/models API endpoint during initialization. This feature reduces configuration maintenance and ensures all backend-reported models are routable.

Backend Type	Auto-Discovery Support	Fallback Models
`openai`	✅ Yes	`gpt-4o`, `gpt-4o-mini`, `o3-mini`
`gemini`	✅ Yes	`gemini-2.5-pro`, `gemini-2.5-flash`, `gemini-2.0-flash`
`vllm`	✅ Yes	`vicuna-7b-v1.5`, `llama-2-7b-chat`, `mistral-7b-instruct`
`ollama`	✅ Yes	Uses vLLM discovery mechanism
`llamacpp`	✅ Yes	Auto-discovers from `/v1/models` endpoint
`anthropic`	❌ No (no API)	Hardcoded Claude models
`generic`	❌ No	All models supported (`supports_model()` returns `true`)

Discovery Behavior:

Timeout: 10-second timeout prevents blocking startup
Fallback: If discovery fails (timeout, network error, invalid response), fallback models are used
Logging: Discovered models are logged at INFO level; fallback usage logged at WARN level

Model Resolution Priority: 1. Explicit models list from config (highest priority) 2. Models from model_configs field 3. Auto-discovered models from backend API 4. Hardcoded fallback models (lowest priority)

Explicit model lists improve startup time and reduce backend queries

Native Gemini Backend¶

When using type: gemini, the router provides: - Default URL: https://generativelanguage.googleapis.com/v1beta/openai (OpenAI-compatible endpoint) - Built-in model metadata: Automatic context windows and capabilities for Gemini models - Environment variable support: Automatically loads from CONTINUUM_GEMINI_API_KEY - Extended streaming timeout: 300s timeout for thinking models (gemini-2.5-pro, gemini-3-pro) - Automatic max_tokens adjustment: For thinking models, see below

Minimal Gemini configuration:

backends:
    - name: "gemini"
    type: gemini
    models:
      - gemini-2.5-pro
      - gemini-2.5-flash
      - gemini-2.0-flash

Full Gemini configuration with API Key:

backends:
    - name: "gemini"
    type: gemini
    api_key: "${CONTINUUM_GEMINI_API_KEY}"
    weight: 2
    models:
      - gemini-2.5-pro
      - gemini-2.5-flash
      - gemini-2.0-flash

Gemini Authentication Methods¶

The Gemini backend supports two authentication methods:

API Key Authentication (Default)¶

The simplest authentication method using a Google AI Studio API key:

backends:
    - name: "gemini"
    type: gemini
    api_key: "${CONTINUUM_GEMINI_API_KEY}"
    models:
      - gemini-2.5-pro

Service Account Authentication¶

For enterprise environments and Google Cloud Platform (GCP) deployments, you can use Service Account authentication with automatic OAuth2 token management:

backends:
    - name: "gemini"
    type: gemini
    auth:
      type: service_account
      key_file: "/path/to/service-account.json"
    models:
      - gemini-2.5-pro
      - gemini-2.5-flash

Using environment variable for key file path:

backends:
    - name: "gemini"
    type: gemini
    auth:
      type: service_account
      key_file: "${GOOGLE_APPLICATION_CREDENTIALS}"
    models:
      - gemini-2.5-pro

Service Account Authentication Features:

Feature	Description
Automatic Token Refresh	OAuth2 tokens are automatically refreshed 5 minutes before expiration
Token Caching	Tokens are cached in memory to minimize authentication overhead
Thread-Safe	Concurrent requests safely share token refresh operations
Environment Variable Expansion	Key file paths support `${VAR}` and `~` expansion

Creating a Service Account Key:

Go to Google Cloud Console
Navigate to IAM & Admin > Service Accounts
Create a new service account or select an existing one
Click Keys > Add Key > Create new key
Choose JSON format and download the key file
Store the key file securely and reference it in your configuration

Required Permissions:

The service account needs the following roles for Gemini API access:

roles/aiplatform.user - For Vertex AI Gemini endpoints
Or appropriate Google AI Studio permissions for generativelanguage.googleapis.com

Authentication Priority¶

When multiple authentication methods are configured:

Priority	Method	Condition
1 (Highest)	`auth` block	If `auth.type` is specified
2	`api_key` field	If no `auth` block is present
3	Environment variable	Falls back to `CONTINUUM_GEMINI_API_KEY`

If both api_key and auth are specified, the auth block takes precedence and a warning is logged.

Gemini Thinking Models: Automatic max_tokens Adjustment¶

Gemini "thinking" models (gemini-2.5-pro, gemini-3-pro, and models with -pro-preview suffix) perform extended reasoning before generating responses. To prevent response truncation, the router automatically adjusts max_tokens:

Condition	Behavior
`max_tokens` not specified	Automatically set to 16384
`max_tokens` < 4096	Automatically increased to 16384
`max_tokens` >= 4096	Client value preserved

This ensures thinking models can generate complete responses without truncation due to low default values from client libraries.

Environment Variables for Gemini¶

Variable	Description
`CONTINUUM_GEMINI_API_KEY`	Google Gemini API key (automatically loaded for `type: gemini` backends)
`GOOGLE_APPLICATION_CREDENTIALS`	Path to service account JSON key file (standard GCP environment variable)

Native Anthropic Backend¶

When using type: anthropic, the router provides: - Default URL: https://api.anthropic.com (can be overridden for proxies) - Native API translation: Automatically converts OpenAI format requests to Anthropic Messages API format and vice versa - Anthropic-specific headers: Automatically adds x-api-key and anthropic-version headers - Environment variable support: Automatically loads from CONTINUUM_ANTHROPIC_API_KEY - Extended streaming timeout: 600s timeout for extended thinking models (Claude Opus, Sonnet 4)

Minimal Anthropic configuration:

backends:
    - name: "anthropic"
    type: anthropic
    models:
      - claude-sonnet-4-20250514
      - claude-haiku-3-5-20241022

Full Anthropic configuration:

backends:
    - name: "anthropic"
    type: anthropic
    api_key: "${CONTINUUM_ANTHROPIC_API_KEY}"
    weight: 2
    models:
      - claude-opus-4-5-20250514
      - claude-sonnet-4-20250514
      - claude-haiku-3-5-20241022

Anthropic API Translation¶

The router automatically handles the translation between OpenAI and Anthropic API formats:

OpenAI Format	Anthropic Format
`messages` array with `role: "system"`	Separate `system` parameter
`Authorization: Bearer <key>`	`x-api-key: <key>` header
Optional `max_tokens`	Required `max_tokens` (auto-filled if missing)
`choices[0].message.content`	`content[0].text`
`finish_reason: "stop"`	`stop_reason: "end_turn"`
`usage.prompt_tokens`	`usage.input_tokens`

Example Request Translation:

OpenAI format (incoming from client):

{
  "model": "claude-sonnet-4-20250514",
  "messages": [
    {"role": "system", "content": "You are helpful."},
    {"role": "user", "content": "Hello"}
  ],
  "max_tokens": 1024
}

Anthropic format (sent to API):

{
  "model": "claude-sonnet-4-20250514",
  "system": "You are helpful.",
  "messages": [
    {"role": "user", "content": "Hello"}
  ],
  "max_tokens": 1024
}

Anthropic Native API Endpoints¶

In addition to routing OpenAI-format requests to Anthropic backends, the router also provides native Anthropic API endpoints:

Endpoint	Description
`POST /anthropic/v1/messages`	Native Anthropic Messages API
`POST /anthropic/v1/messages/count_tokens`	Token counting with tiered backend support
`GET /anthropic/v1/models`	Model listing in Anthropic format

These endpoints allow clients that use Anthropic's native API format (such as Claude Code) to connect directly without any request/response transformation overhead.

Claude Code Compatibility¶

The Anthropic Native API endpoints include full compatibility with Claude Code and other advanced Anthropic API clients:

Prompt Caching Support:

The router preserves cache_control fields throughout the request/response pipeline:

System prompt text blocks
User message content blocks (text, image, document)
Tool definitions
Tool use and tool result blocks

Header Forwarding:

Header	Behavior
`anthropic-version`	Forwarded to native Anthropic backends
`anthropic-beta`	Forwarded to enable beta features (e.g., `prompt-caching-2024-07-31`, `interleaved-thinking-2025-05-14`)
`x-request-id`	Forwarded for request tracing

Cache Usage Reporting:

Streaming responses from native Anthropic backends include cache usage information:

{
  "usage": {
    "input_tokens": 2159,
    "cache_creation_input_tokens": 2048,
    "cache_read_input_tokens": 0
  }
}

Anthropic Extended Thinking Models¶

Models supporting extended thinking (Claude Opus, Sonnet 4) may require longer response times. The router automatically: - Sets higher default max_tokens (16384) for thinking models - Uses extended streaming timeout (600s) for these models

OpenAI ↔ Claude Reasoning Parameter Conversion¶

The router automatically converts between OpenAI's reasoning parameters and Claude's thinking parameter, enabling seamless cross-provider reasoning requests.

Supported OpenAI Formats:

Format	API	Example
`reasoning_effort` (flat)	Chat Completions API	`"reasoning_effort": "high"`
`reasoning.effort` (nested)	Responses API	`"reasoning": {"effort": "high"}`

When both formats are present, reasoning_effort (flat) takes precedence.

Effort Level to Budget Tokens Mapping:

Effort Level	Claude `thinking.budget_tokens`
`none`	(thinking disabled)
`minimal`	1,024
`low`	4,096
`medium`	10,240
`high`	32,768

Example Request - Chat Completions API (flat format):

// Client sends OpenAI Chat Completions API request
{
  "model": "claude-sonnet-4-5-20250929",
  "reasoning_effort": "high",
  "messages": [{"role": "user", "content": "Solve this complex problem"}]
}

// Router converts to Claude format
{
  "model": "claude-sonnet-4-5-20250929",
  "thinking": {"type": "enabled", "budget_tokens": 32768},
  "messages": [{"role": "user", "content": "Solve this complex problem"}]
}

Example Request - Responses API (nested format):

// Client sends OpenAI Responses API request
{
  "model": "claude-sonnet-4-5-20250929",
  "reasoning": {"effort": "medium"},
  "messages": [{"role": "user", "content": "Analyze this data"}]
}

// Router converts to Claude format
{
  "model": "claude-sonnet-4-5-20250929",
  "thinking": {"type": "enabled", "budget_tokens": 10240},
  "messages": [{"role": "user", "content": "Analyze this data"}]
}

Response with Reasoning Content:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "The final answer is...",
      "reasoning_content": "Let me analyze this step by step..."
    }
  }]
}

Important Notes: - If thinking parameter is explicitly provided, it takes precedence over reasoning_effort and reasoning.effort - reasoning_effort (flat) takes precedence over reasoning.effort (nested) when both are present - Only models supporting extended thinking (Opus 4.x, Sonnet 4.x) will have reasoning enabled - When reasoning is enabled, the temperature parameter is automatically removed (Claude API requirement) - For streaming responses, thinking content is returned as reasoning_content delta events

Environment Variables for Anthropic¶

Variable	Description
`CONTINUUM_ANTHROPIC_API_KEY`	Anthropic API key (automatically loaded for `type: anthropic` backends)

Native llama.cpp Backend¶

When using type: llamacpp, the router provides native support for llama.cpp llama-server:

Default URL: http://localhost:8080 (llama-server default port)
Health Check: Uses /health endpoint (with fallback to /v1/models)
Model Discovery: Parses llama-server's hybrid /v1/models response format
Rich Metadata: Extracts context window, parameter count, and model size from response

Minimal llama.cpp configuration:

backends:
  - name: "local-llama"
    type: llamacpp
    # No URL needed if using default http://localhost:8080
    # No API key required for local server

Full llama.cpp configuration:

backends:
  - name: "local-llama"
    type: llamacpp
    url: "http://192.168.1.100:8080"  # Custom URL if needed
    weight: 2
    # Models are auto-discovered from /v1/models endpoint

llama.cpp Features¶

Feature	Description
GGUF Models	Native support for GGUF quantized models
Local Inference	No cloud API dependencies
Hardware Support	CPU, NVIDIA, AMD, Apple Silicon
Streaming	Full SSE streaming support
Embeddings	Supports `/v1/embeddings` endpoint
Tool Calling Detection	Auto-detects tool calling support via `/props` endpoint

Tool Calling Auto-Detection¶

The router automatically detects tool calling capability for llama.cpp backends by querying the /props endpoint during model discovery. This enables automatic function calling support without manual configuration.

How it works:

When a llama.cpp backend is discovered, the router fetches the /props endpoint
The chat_template field is analyzed using precise Jinja2 pattern matching to detect tool-related syntax
If tool calling patterns are detected, the model's function_calling capability is automatically enabled
Detection results are stored for reference (including a hash of the chat template)

Detection Patterns:

The router uses precise pattern matching to reduce false positives:

Role-based patterns: message['role'] == 'tool', message.role == "tool"
Tool iteration: for tool in tools, for function in functions
Tool calls access: .tool_calls, ['tool_calls'], message.tool_call
Jinja2 blocks with tool keywords: {% if tools %}, {% for tool_call in ... %}

Example /props response analyzed:

{
  "chat_template": "{% for message in messages %}{% if message['role'] == 'tool' %}...",
  "default_generation_settings": { ... },
  "total_slots": 1
}

Fallback Behavior:

If /props is unavailable: Tool calling is assumed to be supported (optimistic fallback for modern llama.cpp versions)
If /props returns an error: Tool calling is assumed to be supported (ensures compatibility with newer models)
If chat template exceeds 64KB: Detection is skipped and defaults to supported
Detection is case-insensitive for maximum compatibility
Results are merged with any existing model metadata from model-metadata.yaml
Detected capabilities appear in the features field of the /v1/models/{model_id} response

Model Metadata Extraction¶

The router extracts rich metadata from llama-server responses:

Field	Source	Description
Context Window	`meta.n_ctx_train`	Training context window size
Parameter Count	`meta.n_params`	Model parameters (e.g., "4B")
Model Size	`meta.size`	File size in bytes
Capabilities	`models[].capabilities`	Model capabilities array

Starting llama-server¶

# Basic startup
./llama-server -m model.gguf --port 8080

# With GPU layers
./llama-server -m model.gguf --port 8080 -ngl 35

# With custom context size
./llama-server -m model.gguf --port 8080 --ctx-size 8192

Auto-Detection of llama.cpp Backends¶

When a backend is added without a type specified (defaults to generic), the router automatically probes the /v1/models endpoint to detect the backend type. llama.cpp backends are identified by:

owned_by: "llamacpp" in the response
Presence of llama.cpp-specific metadata fields (n_ctx_train, n_params, vocab_type)
Hybrid response format with both models[] and data[] arrays

This auto-detection works for:

Hot-reload configuration changes
Backends added via Admin API without explicit type
Configuration files with type: generic or no type specified

Example: Auto-detected backend via Admin API:

# Add backend without specifying type - auto-detects llama.cpp
curl -X POST http://localhost:8080/admin/backends \
  -H "Content-Type: application/json" \
  -d '{
    "name": "local-llm",
    "url": "http://localhost:8080"
  }'

Unix Domain Socket Backends¶

Continuum Router supports Unix Domain Sockets (UDS) as an alternative transport to TCP for local LLM backends. Unix sockets provide:

Enhanced Security: No TCP port exposure - communication happens through the file system
Lower Latency: No network stack overhead for local communication
Better Performance: Reduced context switching and memory copies
Simple Access Control: Uses standard Unix file permissions

URL Format:

unix:///path/to/socket.sock

Platform Support:

Platform	Support
Linux	Full support via AF_UNIX
macOS	Full support via AF_UNIX
Windows	Not currently supported (planned for future releases)

Configuration Examples:

backends:
  # llama-server with Unix socket
  - name: "llama-socket"
    type: llamacpp
    url: "unix:///var/run/llama-server.sock"
    weight: 2
    models:
      - llama-3.2-3b
      - qwen3-4b

  # Ollama with Unix socket
  - name: "ollama-socket"
    type: ollama
    url: "unix:///var/run/ollama.sock"
    weight: 1
    models:
      - llama3.2
      - mistral

  # vLLM with Unix socket
  - name: "vllm-socket"
    type: vllm
    url: "unix:///tmp/vllm.sock"
    weight: 3
    models:
      - meta-llama/Llama-3.1-8B-Instruct

Starting Backends with Unix Sockets:

# llama-server
./llama-server -m model.gguf --unix /var/run/llama.sock

# Ollama
OLLAMA_HOST="unix:///var/run/ollama.sock" ollama serve

# vLLM
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B \
  --unix-socket /tmp/vllm.sock

Socket Path Conventions:

Path	Use Case
`/var/run/*.sock`	System services (requires root)
`/tmp/*.sock`	Temporary, user-accessible
`~/.local/share/continuum/*.sock`	Per-user persistent sockets

Health Checks: The router automatically performs health checks on Unix socket backends using the same endpoints (/health, /v1/models) as TCP backends.

Current Limitations:

Streaming (SSE) not supported: Unix socket backends do not currently support Server-Sent Events (SSE) streaming. Use TCP backends for streaming chat completions.
Windows platform: Unix sockets are not currently supported on Windows (planned for future releases).
Max response size: Response bodies are limited to 100MB by default to prevent memory exhaustion.

Troubleshooting:

Error	Cause	Solution
"Socket file not found"	Server not running	Start the backend server
"Permission denied"	File permissions	`chmod 660 socket.sock`
"Connection timeout"	Server not accepting connections	Verify server is listening
"Response body exceeds maximum size"	Response too large	Increase maxresponsesize or use streaming with TCP backend

Health Checks Section¶

Configures backend health monitoring:

health_checks:
  enabled: true                    # Enable/disable health monitoring
  interval: "30s"                  # Check frequency
  timeout: "10s"                   # Request timeout
  unhealthy_threshold: 3           # Failures before marking unhealthy
  healthy_threshold: 2             # Successes before marking healthy
  endpoint: "/v1/models"           # Endpoint to check
  warmup_check_interval: "1s"      # Accelerated interval during warmup
  max_warmup_duration: "300s"      # Maximum warmup detection duration

Health Check Process: 1. Router queries the health endpoint on each backend 2. Successful responses increment success counter 3. Failed responses increment failure counter 4. Backends marked unhealthy after reaching failure threshold 5. Backends marked healthy after reaching success threshold 6. Only healthy backends receive traffic

Accelerated Warmup Health Checks¶

The router supports accelerated health checks during backend warmup, which is particularly useful for backends like llama.cpp that return HTTP 503 while loading models.

Backend States:

State	HTTP Response	Behavior
`ready`	200 OK	Normal interval checks
`warming_up`	503 Service Unavailable	Accelerated interval checks
`down`	Connection failure	Normal interval checks
`unknown`	Initial state	First check determines state

Warmup Configuration:

Option	Default	Description
`warmup_check_interval`	`1s`	Accelerated check interval during warmup
`max_warmup_duration`	`300s`	Maximum time to stay in accelerated mode

How it works:

When a backend returns HTTP 503, it enters the warming_up state
Health checks switch to the accelerated interval (default: 1 second)
Once the backend returns HTTP 200, it becomes ready and returns to normal interval
If warmup exceeds max_warmup_duration, the backend is marked as unhealthy

This reduces model availability detection latency from up to 30 seconds (worst case) to approximately 1 second.

Per-Backend Health Check Configuration¶

Each backend type has sensible default health check endpoints. You can override these defaults with a custom health_check configuration per backend.

Default Health Check Endpoints by Backend Type:

Backend Type	Primary Endpoint	Fallback Endpoints	Method	Notes
`openai`	`/v1/models`	-	GET	Standard OpenAI endpoint
`vllm`	`/health`	`/v1/models`	GET	`/health` available after model load
`ollama`	`/api/tags`	`/`	GET	Ollama-specific endpoint
`llamacpp`	`/health`	`/v1/models`	GET	llama-server endpoint
`anthropic`	`/v1/messages`	-	POST	Accepts 200, 400, 401, 429 as healthy
`gemini`	`/models`	`/v1beta/models`	GET	Native Gemini endpoint
`azure`	`/health`	`/v1/models`	GET	Azure OpenAI endpoint
`generic`	`/health`	`/v1/models`	GET	Generic fallback

Fallback Behavior:

When the primary health check endpoint returns HTTP 404, the router automatically tries the fallback endpoints in order. This ensures compatibility with backends that may not implement all standard endpoints.

Custom Health Check Configuration:

backends:
    - name: vllm-custom
    type: vllm
    url: http://localhost:8000
    models:
        - my-model
    health_check:
        endpoint: /custom-health          # Primary endpoint
        fallback_endpoints:               # Tried if primary returns 404
            - /health
            - /v1/models
        method: GET                       # HTTP method: GET, POST, or HEAD
        timeout: 10s                      # Override global health check timeout
        accept_status:                    # Status codes indicating healthy
            - 200
            - 204
        warmup_status:                    # Status codes indicating model loading
            - 503

Health Check Configuration Options:

Option	Type	Default	Description
`endpoint`	string	Backend-type specific	Primary health check endpoint path
`fallback_endpoints`	array	Backend-type specific	Endpoints to try if primary returns 404
`method`	string	`GET`	HTTP method: `GET`, `POST`, or `HEAD`
`body`	object	null	JSON body for POST requests
`accept_status`	array	`[200]`	Status codes indicating the backend is healthy
`warmup_status`	array	`[503]`	Status codes indicating the backend is warming up
`timeout`	string	Global timeout	Override the global health check timeout

Example: Anthropic-Style Health Check:

For backends that use POST requests or accept error codes as healthy indicators:

backends:
    - name: custom-api
    type: generic
    url: http://localhost:9000
    models:
        - custom-model
    health_check:
        endpoint: /api/v1/health
        method: POST
        body:
            check: true
        accept_status:
            - 200
            - 400    # Bad request means server is up
            - 401    # Unauthorized means server is up
            - 429    # Rate limited means server is up

Request Section¶

Controls request handling behavior:

request:
  timeout: "300s"                  # Maximum request duration
  max_retries: 3                   # Retry attempts for failed requests
  retry_delay: "1s"                # Initial delay between retries

Timeout Considerations:

Long timeouts (300s) accommodate slow model inference
Streaming requests may take longer than non-streaming
Balance between user experience and resource usage

Retry Section¶

Global retry configuration for resilience:

retry:
  max_attempts: 3                  # Maximum retry attempts
  base_delay: "100ms"              # Base delay between retries
  max_delay: "30s"                 # Cap on retry delays
  exponential_backoff: true        # Use exponential backoff
  jitter: true                     # Add random jitter

Retry Strategy:

Exponential backoff: delays increase exponentially (100ms, 200ms, 400ms...)
Jitter: adds randomness to prevent thundering herd
Max delay: prevents extremely long waits

Cache Section¶

Controls caching and optimization:

cache:
  model_cache_ttl: "300s"         # How long to cache model lists
  deduplication_ttl: "60s"        # How long to cache identical requests
  enable_deduplication: true      # Enable request deduplication

Cache Stampede Prevention¶

The router implements three strategies to prevent cache stampede (thundering herd problem):

Singleflight Pattern: Only one aggregation request runs at a time
Stale-While-Revalidate: Return stale data while refreshing in background
Background Refresh: Proactive cache updates before expiration

Advanced cache configuration:

model_aggregation:
  cache_ttl: 60                     # Cache TTL in seconds (default: 60)
  soft_ttl_ratio: 0.8               # When to trigger background refresh (default: 0.8 = 80%)
  empty_response_base_ttl_seconds: 5   # Base TTL for empty responses
  empty_response_max_ttl_seconds: 60   # Max TTL with exponential backoff
  max_cache_entries: 100            # Maximum cache entries
  background_refresh:
    enabled: true                   # Enable background refresh
    check_interval: 10s             # Check interval

Option	Default	Description
`cache_ttl`	60s	Hard TTL - cache expires after this time
`soft_ttl_ratio`	0.8	Soft TTL = cachettl * softttl_ratio. Cache is stale but usable between soft and hard TTL
`empty_response_base_ttl_seconds`	5	Base TTL for empty responses (prevents DoS)
`empty_response_max_ttl_seconds`	60	Maximum TTL with exponential backoff (base * 2^n)
`max_cache_entries`	100	Maximum number of cache entries
`background_refresh.enabled`	true	Enable proactive cache refresh
`background_refresh.check_interval`	10s	How often to check cache freshness

Cache Benefits:

Model caching reduces backend queries
Deduplication prevents duplicate processing
TTL prevents stale data issues
Stampede prevention avoids thundering herd
Background refresh ensures cache is always fresh

Logging Section¶

Configures logging output:

logging:
  level: "info"                   # trace, debug, info, warn, error
  format: "json"                  # json, pretty
  enable_colors: false            # Colored output (pretty format only)

Log Levels:

trace: Extremely verbose, includes all details
debug: Detailed debugging information
info: General operational information
warn: Warning messages and potential issues
error: Error conditions only

Log Formats:

json: Structured JSON logging (recommended for production)
pretty: Human-readable format (good for development)

API Keys Section¶

API keys control client access to the router's endpoints. Keys can be configured through multiple sources.

Authentication Mode¶

The mode setting controls whether API authentication is required for API endpoints:

Mode	Behavior
`permissive` (default)	Allow requests without API key. Requests with valid API keys are authenticated.
`blocking`	Only process requests that pass API key authentication. Unauthenticated requests receive 401.

Target Endpoints (when mode is blocking): - /v1/chat/completions - /v1/completions - /v1/responses - /v1/images/generations - /v1/images/edits - /v1/images/variations - /v1/models

Note: Admin, Files, and Metrics endpoints have separate authentication mechanisms and are not affected by this setting.

Section Configuration Properties:

Property	Type	Required	Default	Description
`mode`	string	No	`permissive`	Authentication mode: `permissive` or `blocking`
`api_keys`	array	No	`[]`	Inline API key definitions
`api_keys_file`	string	No	-	Path to external API keys file

api_keys:
  # Authentication mode: "permissive" (default) or "blocking"
  mode: permissive

  # Inline API key definitions
  api_keys:
        - key: "${API_KEY_1}"              # Environment variable substitution
      id: "key-production-1"           # Unique identifier
      user_id: "user-admin"            # Associated user
      organization_id: "org-main"      # Associated organization
      name: "Production Admin Key"     # Human-readable name
      scopes:                          # Permissions
        - read
        - write
        - files
        - admin
      rate_limit: 1000                 # Requests per minute (optional)
      enabled: true                    # Active status
      expires_at: "2025-12-31T23:59:59Z"  # Optional expiration (ISO 8601)

        - key: "${API_KEY_2}"
      id: "key-service-1"
      user_id: "service-bot"
      organization_id: "org-main"
      name: "Service Account"
      scopes: [read, write, files]
      rate_limit: 500
      enabled: true

  # External key file for better security
  api_keys_file: "/etc/continuum-router/api-keys.yaml"

Key Properties:

Property	Type	Required	Description
`key`	string	Yes	The API key value (supports `${ENV_VAR}` substitution)
`id`	string	Yes	Unique identifier for admin operations
`user_id`	string	Yes	User associated with this key
`organization_id`	string	Yes	Organization the user belongs to
`name`	string	No	Human-readable name
`description`	string	No	Notes about the key
`scopes`	array	Yes	Permissions: `read`, `write`, `files`, `admin`
`rate_limit`	integer	No	Maximum requests per minute
`enabled`	boolean	No	Active status (default: true)
`expires_at`	string	No	ISO 8601 expiration timestamp

External Key File Format:

# /etc/continuum-router/api-keys.yaml
keys:
    - key: "sk-prod-xxxxxxxxxxxxxxxxxxxxx"
    id: "key-external-1"
    user_id: "external-user"
    organization_id: "external-org"
    scopes: [read, write, files]
    enabled: true

Security Features:

Key Masking: Full keys are never logged (displayed as sk-***last4)
Expiration Enforcement: Expired keys are automatically rejected
Hot Reload: Update keys without server restart
Audit Logging: All key management operations are logged
Constant-Time Validation: Prevents timing attacks
Max Key Limit: 10,000 keys maximum to prevent DoS

Admin API Endpoints (require admin authentication):

Endpoint	Method	Description
`/admin/api-keys`	GET	List all keys (masked)
`/admin/api-keys/:id`	GET	Get key details
`/admin/api-keys`	POST	Create new key
`/admin/api-keys/:id`	PUT	Update key properties
`/admin/api-keys/:id`	DELETE	Delete key
`/admin/api-keys/:id/rotate`	POST	Generate new key value
`/admin/api-keys/:id/enable`	POST	Enable key
`/admin/api-keys/:id/disable`	POST	Disable key

Advanced Configuration¶

Global Prompts¶

Global prompts allow you to inject system prompts into all requests, providing centralized policy management for security, compliance, and behavioral guidelines. Prompts can be defined inline or loaded from external Markdown files.

Basic Configuration¶

global_prompts:
  # Inline default prompt
  default: |
    You must follow company security policies.
    Never reveal internal system details.
    Be helpful and professional.

  # Merge strategy: prepend (default), append, or replace
  merge_strategy: prepend

  # Custom separator between global and user prompts
  separator: "\n\n---\n\n"

External Prompt Files¶

For complex prompts, you can load content from external Markdown files. This provides: - Better editing experience with syntax highlighting - Version control without config file noise - Hot-reload support for prompt updates

global_prompts:
  # Directory containing prompt files (relative to config directory)
  prompts_dir: "./prompts"

  # Load default prompt from file
  default_file: "system.md"

  # Backend-specific prompts from files
  backends:
    anthropic:
      prompt_file: "anthropic-system.md"
    openai:
      prompt_file: "openai-system.md"

  # Model-specific prompts from files
  models:
    gpt-4o:
      prompt_file: "gpt4o-system.md"
    claude-3-opus:
      prompt_file: "claude-opus-system.md"

  merge_strategy: prepend

Prompt Resolution Priority¶

When determining which prompt to use for a request:

Model-specific prompt (highest priority) - global_prompts.models.<model-id>
Backend-specific prompt - global_prompts.backends.<backend-name>
Default prompt - global_prompts.default or global_prompts.default_file

For each level, if both prompt (inline) and prompt_file are specified, prompt_file takes precedence.

Merge Strategies¶

Strategy	Behavior
`prepend`	Global prompt added before user's system prompt (default)
`append`	Global prompt added after user's system prompt
`replace`	Global prompt replaces user's system prompt entirely

REST API Management¶

Prompt files can be managed at runtime via the Admin API:

# List all prompts
curl http://localhost:8080/admin/config/prompts

# Get specific prompt file
curl http://localhost:8080/admin/config/prompts/prompts/system.md

# Update prompt file
curl -X PUT http://localhost:8080/admin/config/prompts/prompts/system.md \
  -H "Content-Type: application/json" \
  -d '{"content": "# Updated System Prompt\n\nNew content here."}'

# Reload all prompt files from disk
curl -X POST http://localhost:8080/admin/config/prompts/reload

See Admin REST API Reference for complete API documentation.

Security Considerations¶

Path Traversal Protection: All file paths are validated to prevent directory traversal attacks
File Size Limits: Individual files limited to 1MB, total cache limited to 50MB
Relative Paths Only: Prompt files must be within the configured prompts_dir or config directory
Sandboxed Access: Files outside the allowed directory are rejected

Hot Reload¶

Global prompts support immediate hot-reload. Changes to prompt configuration or files take effect on the next request without server restart.

Model Metadata¶

Continuum Router supports rich model metadata to provide detailed information about model capabilities, pricing, and limits. This metadata is returned in /v1/models API responses and can be used by clients to make informed model selection decisions.

Metadata Sources¶

Model metadata can be configured in three ways (in priority order):

Backend-specific model_configs (highest priority)
External metadata file (model-metadata.yaml)
No metadata (models work without metadata)

External Metadata File¶

Create a model-metadata.yaml file:

models:
    - id: "gpt-4"
    aliases:                    # Alternative IDs that share this metadata
      - "gpt-4-0125-preview"
      - "gpt-4-turbo-preview"
      - "gpt-4-vision-preview"
    metadata:
      display_name: "GPT-4"
      summary: "Most capable GPT-4 model for complex tasks"
      capabilities: ["text", "image", "function_calling"]
      knowledge_cutoff: "2024-04"
      pricing:
        input_tokens: 0.03   # Per 1000 tokens
        output_tokens: 0.06  # Per 1000 tokens
      limits:
        context_window: 128000
        max_output: 4096

    - id: "llama-3-70b"
    aliases:                    # Different quantizations of the same model
      - "llama-3-70b-instruct"
      - "llama-3-70b-chat"
      - "llama-3-70b-q4"
      - "llama-3-70b-q8"
    metadata:
      display_name: "Llama 3 70B"
      summary: "Open-source model with strong performance"
      capabilities: ["text", "code"]
      knowledge_cutoff: "2023-12"
      pricing:
        input_tokens: 0.001
        output_tokens: 0.002
      limits:
        context_window: 8192
        max_output: 2048

Reference it in your config:

model_metadata_file: "model-metadata.yaml"

Thinking Pattern Configuration¶

Some models output reasoning/thinking content in non-standard ways. The router supports configuring thinking patterns per model to properly transform streaming responses.

Pattern Types:

Pattern	Description	Example Model
`none`	No thinking pattern (default)	Most models
`standard`	Explicit start/end tags (`<think>...</think>`)	Custom reasoning models
`unterminated_start`	No start tag, only end tag	nemotron-3-nano

Configuration Example:

models:
    - id: nemotron-3-nano
      metadata:
        display_name: "Nemotron 3 Nano"
        capabilities: ["chat", "reasoning"]
        # Thinking pattern configuration
        thinking:
          pattern: unterminated_start
          end_marker: "</think>"
          assume_reasoning_first: true

Thinking Pattern Fields:

Field	Type	Description
`pattern`	string	Pattern type: `none`, `standard`, or `unterminated_start`
`start_marker`	string	Start marker for `standard` pattern (e.g., `<think>`)
`end_marker`	string	End marker (e.g., `</think>`)
`assume_reasoning_first`	boolean	If `true`, treat first tokens as reasoning until end marker

How It Works:

When a model has a thinking pattern configured:

Streaming responses are intercepted and transformed
Content before end_marker is sent as reasoning_content field
Content after end_marker is sent as content field
The output follows OpenAI's reasoning_content format for compatibility

Example Output:

// Reasoning content (before end marker)
{"choices": [{"delta": {"reasoning_content": "Let me analyze..."}}]}

// Regular content (after end marker)
{"choices": [{"delta": {"content": "The answer is 42."}}]}

Namespace-Aware Matching¶

The router intelligently handles model IDs with namespace prefixes. For example:

Backend returns: "custom/gpt-4", "openai/gpt-4", "optimized/gpt-4"
Metadata defined for: "gpt-4"
Result: All variants match and receive the same metadata

This allows different backends to use their own naming conventions while sharing common metadata definitions.

Metadata Priority and Alias Resolution¶

When looking up metadata for a model, the router uses the following priority chain:

Exact model ID match
Exact alias match
Date suffix normalization (automatic, zero-config)
Wildcard pattern alias match
Base model name fallback (namespace stripping)

Within each source (backend config, metadata file, built-in), the same priority applies:

Backend-specific model_configs (highest priority)

backends:
  - name: "my-backend"
    model_configs:
      - id: "gpt-4"
        aliases: ["gpt-4-turbo", "gpt-4-vision"]
        metadata: {...}  # This takes precedence

External metadata file (second priority)

model_metadata_file: "model-metadata.yaml"

Built-in metadata (for OpenAI and Gemini backends)

Automatic Date Suffix Handling¶

LLM providers frequently release model versions with date suffixes. The router automatically detects and normalizes date suffixes without any configuration:

Supported date patterns:

-YYYYMMDD (e.g., claude-opus-4-5-20251130)
-YYYY-MM-DD (e.g., gpt-4o-2024-08-06)
-YYMM (e.g., o1-mini-2409)
@YYYYMMDD (e.g., model@20251130)

How it works:

Request: claude-opus-4-5-20251215
         ↓ (date suffix detected)
Lookup:  claude-opus-4-5-20251101  (existing metadata entry)
         ↓ (base names match)
Result:  Uses claude-opus-4-5-20251101 metadata

This means you only need to configure metadata once per model family, and new dated versions automatically inherit the metadata.

Wildcard Pattern Matching¶

Aliases support glob-style wildcard patterns using the * character:

Prefix matching: claude-* matches claude-opus, claude-sonnet, etc.
Suffix matching: *-preview matches gpt-4o-preview, o1-preview, etc.
Infix matching: gpt-*-turbo matches gpt-4-turbo, gpt-3.5-turbo, etc.

Example configuration with wildcard patterns:

models:
    - id: "claude-opus-4-5-20251101"
    aliases:
        - "claude-opus-4-5"     # Exact match for base name
        - "claude-opus-*"       # Wildcard for any claude-opus variant
    metadata:
        display_name: "Claude Opus 4.5"
        # Automatically matches: claude-opus-4-5-20251130, claude-opus-test, etc.

    - id: "gpt-4o"
    aliases:
        - "gpt-4o-*-preview"    # Matches preview versions
        - "*-4o-turbo"          # Suffix matching
    metadata:
        display_name: "GPT-4o"

Priority note: Exact aliases are always matched before wildcard patterns, ensuring predictable behavior when both could match.

Using Aliases for Model Variants¶

Aliases are particularly useful for:

Different quantizations: qwen3-32b-i1, qwen3-23b-i4 → all use qwen3 metadata
Version variations: gpt-4-0125-preview, gpt-4-turbo → share gpt-4 metadata
Deployment variations: llama-3-70b-instruct, llama-3-70b-chat → same base model
Dated versions: claude-3-5-sonnet-20241022, claude-3-5-sonnet-20241201 → share metadata (automatic with date suffix handling)

Example configuration with aliases:

model_configs:
    - id: "qwen3"
    aliases:
      - "qwen3-32b-i1"     # 32B with 1-bit quantization
      - "qwen3-23b-i4"     # 23B with 4-bit quantization
      - "qwen3-16b-q8"     # 16B with 8-bit quantization
      - "qwen3-*"          # Wildcard for any other qwen3 variant
    metadata:
      display_name: "Qwen 3"
      summary: "Alibaba's Qwen model family"
      # ... rest of metadata

API Response¶

The /v1/models endpoint returns enriched model information:

{
  "object": "list",
  "data": [
    {
      "id": "gpt-4",
      "object": "model",
      "created": 1234567890,
      "owned_by": "openai",
      "backends": ["openai-proxy"],
      "metadata": {
        "display_name": "GPT-4",
        "summary": "Most capable GPT-4 model for complex tasks",
        "capabilities": ["text", "image", "function_calling"],
        "knowledge_cutoff": "2024-04",
        "pricing": {
          "input_tokens": 0.03,
          "output_tokens": 0.06
        },
        "limits": {
          "context_window": 128000,
          "max_output": 4096
        }
      }
    }
  ]
}

Hot Reload¶

Continuum Router supports hot reload for runtime configuration updates without server restart. Configuration changes are detected automatically and applied based on their classification.

Configuration Item Classification¶

Configuration items are classified into three categories based on their hot reload capability:

Immediate Update (No Service Interruption)¶

These settings update immediately without any service disruption:

# Logging configuration
logging:
  level: "info"                  # ✅ Immediate: Log level changes apply instantly
  format: "json"                 # ✅ Immediate: Log format changes apply instantly

# Rate limiting settings
rate_limiting:
  enabled: true                  # ✅ Immediate: Enable/disable rate limiting
  limits:
    per_client:
      requests_per_second: 10    # ✅ Immediate: New limits apply immediately
      burst_capacity: 20         # ✅ Immediate: Burst settings update instantly

# Circuit breaker configuration
circuit_breaker:
  enabled: true                  # ✅ Immediate: Enable/disable circuit breaker
  failure_threshold: 5           # ✅ Immediate: Threshold updates apply instantly
  timeout_seconds: 60            # ✅ Immediate: Timeout changes immediate

# Retry configuration
retry:
  max_attempts: 3                # ✅ Immediate: Retry policy updates instantly
  base_delay: "100ms"            # ✅ Immediate: Backoff settings apply immediately
  exponential_backoff: true      # ✅ Immediate: Strategy changes instant

# Global prompts
global_prompts:
  default: "You are helpful"       # ✅ Immediate: Prompt changes apply to new requests
  default_file: "prompts/system.md"  # ✅ Immediate: File-based prompts also hot-reload

Gradual Update (Existing Connections Maintained)¶

These settings apply to new connections while maintaining existing ones:

# Backend configuration
backends:
    - name: "ollama"               # ✅ Gradual: New requests use updated backend pool
    url: "http://localhost:11434"
    weight: 2                    # ✅ Gradual: Load balancing updates for new requests
    models: ["llama3.2"]         # ✅ Gradual: Model routing updates gradually

# Health check settings
health_checks:
  interval: "30s"                # ✅ Gradual: Next health check cycle uses new interval
  timeout: "10s"                 # ✅ Gradual: New checks use updated timeout
  unhealthy_threshold: 3         # ✅ Gradual: Threshold applies to new evaluations
  healthy_threshold: 2           # ✅ Gradual: Recovery threshold updates gradually

# Timeout configuration
timeouts:
  connection: "10s"              # ✅ Gradual: New requests use updated timeouts
  request:
    standard:
      first_byte: "30s"          # ✅ Gradual: Applies to new requests
      total: "180s"              # ✅ Gradual: New requests use new timeout
    streaming:
      chunk_interval: "30s"      # ✅ Gradual: New streams use updated settings

Requires Restart (Hot Reload Not Possible)¶

These settings require a server restart to take effect. Changes are logged as warnings:

server:
  bind_address: "0.0.0.0:8080"   # ❌ Restart required: TCP/Unix socket binding
  # bind_address:                 # ❌ Restart required: Any address changes
  #   - "0.0.0.0:8080"
  #   - "unix:/var/run/router.sock"
  socket_mode: 0o660              # ❌ Restart required: Socket permissions
  workers: 4                      # ❌ Restart required: Worker thread pool size

When these settings are changed, the router will log a warning like:

WARN server.bind_address changed from '0.0.0.0:8080' to '0.0.0.0:9000' - requires restart to take effect

Hot Reload Process¶

File System Watcher - Detects configuration file changes automatically
Configuration Loading - New configuration is loaded and parsed
Validation - New configuration is validated against schema
Change Detection - ConfigDiff computation identifies what changed
Classification - Changes are classified (immediate/gradual/restart)
Atomic Update - Valid configuration is applied atomically
Component Propagation - Updates are propagated to affected components:
HealthChecker updates check intervals and thresholds
RateLimitStore updates rate limiting rules
CircuitBreaker updates failure thresholds and timeouts
BackendPool updates backend configuration
Immediate Health Check - When backends are added, an immediate health check is triggered so new backends become available within 1-2 seconds instead of waiting for the next periodic check
Error Handling - If invalid, error is logged and old configuration retained

Checking Hot Reload Status¶

Use the admin API to check hot reload status and capabilities:

# Check if hot reload is enabled
curl http://localhost:8080/admin/config/hot-reload-status

# View current configuration
curl http://localhost:8080/admin/config

Hot Reload Behavior Examples¶

Example 1: Changing Log Level (Immediate)

# Before
logging:
  level: "info"

# After
logging:
  level: "debug"

Result: Log level changes immediately. No restart needed. Ongoing requests continue, new logs use debug level.

Example 2: Adding a Backend (Gradual with Immediate Health Check)

# Before
backends:
    - name: "ollama"
    url: "http://localhost:11434"

# After
backends:
    - name: "ollama"
    url: "http://localhost:11434"
    - name: "lmstudio"
    url: "http://localhost:1234"

Result: New backend added to pool with immediate health check triggered. The new backend becomes available within 1-2 seconds (instead of waiting up to 30 seconds for the next periodic health check). Existing requests continue to current backends. New requests can route to lmstudio once health check passes.

Example 2b: Removing a Backend (Graceful Draining)

# Before
backends:
    - name: "ollama"
      url: "http://localhost:11434"
    - name: "lmstudio"
      url: "http://localhost:1234"

# After
backends:
    - name: "ollama"
      url: "http://localhost:11434"

Result: Backend "lmstudio" enters draining state. New requests are not routed to it, but existing in-flight requests (including streaming) continue until completion. After all references are released (or after 5 minutes timeout), the backend is fully removed from memory.

Backend State Lifecycle¶

When a backend is removed from configuration, it goes through a graceful shutdown process:

Active → Draining: Backend is marked as draining. New requests skip this backend.
In-flight Completion: Existing requests/streams continue uninterrupted.
Cleanup: Once all references are released, or after 5-minute timeout, the backend is removed.

This ensures zero impact on ongoing connections during configuration changes.

Example 3: Changing Bind Address (Requires Restart)

# Before
server:
  bind_address: "0.0.0.0:8080"

# After
server:
  bind_address: "0.0.0.0:9000"

Result: Warning logged. Change does not take effect. Restart required to bind to new port.

Distributed Tracing¶

Continuum Router supports distributed tracing for request correlation across backend services. This feature helps with debugging and monitoring requests as they flow through multiple services.

Configuration¶

tracing:
  enabled: true                         # Enable/disable distributed tracing (default: true)
  w3c_trace_context: true               # Support W3C Trace Context header (default: true)
  headers:
    trace_id: "X-Trace-ID"              # Header name for trace ID (default)
    request_id: "X-Request-ID"          # Header name for request ID (default)
    correlation_id: "X-Correlation-ID"  # Header name for correlation ID (default)

How It Works¶

Trace ID Extraction: When a request arrives, the router extracts trace IDs from headers in the following priority order:
W3C traceparent header (if W3C support enabled)
Configured trace_id header (X-Trace-ID)
Configured request_id header (X-Request-ID)
Configured correlation_id header (X-Correlation-ID)
Trace ID Generation: If no trace ID is found in headers, a new UUID is generated.
Header Propagation: The trace ID is propagated to backend services via multiple headers:
X-Request-ID: For broad compatibility
X-Trace-ID: Primary trace identifier
X-Correlation-ID: For correlation tracking
traceparent: W3C Trace Context (if enabled)
tracestate: W3C Trace State (if present in original request)
Retry Preservation: The same trace ID is preserved across all retry attempts, making it easy to correlate multiple backend requests for a single client request.

Structured Logging¶

When tracing is enabled, all log messages include the trace_id field:

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "info",
  "trace_id": "0af7651916cd43dd8448eb211c80319c",
  "message": "Processing chat completions request",
  "backend": "openai",
  "model": "gpt-4o"
}

W3C Trace Context¶

When w3c_trace_context is enabled, the router supports the W3C Trace Context standard:

Incoming: Parses traceparent header (format: 00-{trace_id}-{span_id}-{flags})
Outgoing: Generates new traceparent header with preserved trace ID and new span ID
State: Forwards tracestate header if present in original request

Example traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01

Disabling Tracing¶

To disable distributed tracing:

tracing:
  enabled: false

Load Balancing Strategies¶

load_balancer:
  strategy: "round_robin"         # round_robin, weighted, random
  health_aware: true              # Only use healthy backends

Strategies:

round_robin: Equal distribution across backends
weighted: Distribution based on backend weights
random: Random selection (good for avoiding patterns)

Per-Backend Retry Configuration¶

backends:
    - name: "slow-backend"
    url: "http://slow.example.com"
    retry_override:               # Override global retry settings
      max_attempts: 5             # More attempts for slower backends
      base_delay: "500ms"         # Longer delays
      max_delay: "60s"

Model Fallback¶

Continuum Router supports automatic model fallback when the primary model is unavailable. This feature integrates with the circuit breaker for layered failover protection.

Configuration¶

fallback:
  enabled: true

  # Define fallback chains for each primary model
  fallback_chains:
    # Same-provider fallback
    "gpt-4o":
      - "gpt-4-turbo"
      - "gpt-3.5-turbo"

    "claude-opus-4-5-20251101":
      - "claude-sonnet-4-5"
      - "claude-haiku-4-5"

    # Cross-provider fallback
    "gemini-2.5-pro":
      - "gemini-2.5-flash"
      - "gpt-4o"  # Falls back to OpenAI if Gemini unavailable

  fallback_policy:
    trigger_conditions:
      error_codes: [429, 500, 502, 503, 504]
      timeout: true
      connection_error: true
      model_not_found: true
      circuit_breaker_open: true

    max_fallback_attempts: 3
    fallback_timeout_multiplier: 1.5
    preserve_parameters: true

  model_settings:
    "gpt-4o":
      fallback_enabled: true
      notify_on_fallback: true

Trigger Conditions¶

Condition	Description
`error_codes`	HTTP status codes that trigger fallback (e.g., 429, 500, 502, 503, 504)
`timeout`	Request timeout
`connection_error`	TCP connection failures
`model_not_found`	Model not available on backend
`circuit_breaker_open`	Backend circuit breaker is open

Response Headers¶

When fallback is used, the following headers are added to the response:

Header	Description	Example
`X-Fallback-Used`	Indicates fallback was used	`true`
`X-Original-Model`	Originally requested model	`gpt-4o`
`X-Fallback-Model`	Model that served the request	`gpt-4-turbo`
`X-Fallback-Reason`	Why fallback was triggered	`error_code_429`
`X-Fallback-Attempts`	Number of fallback attempts	`2`

Cross-Provider Parameter Translation¶

When falling back across providers (e.g., OpenAI → Anthropic), the router automatically translates request parameters:

OpenAI Parameter	Anthropic Parameter	Notes
`max_tokens`	`max_tokens`	Auto-filled if missing (required by Anthropic)
`temperature`	`temperature`	Direct mapping
`top_p`	`top_p`	Direct mapping
`stop`	`stop_sequences`	Array conversion

Provider-specific parameters are automatically removed or converted during cross-provider fallback.

Integration with Circuit Breaker¶

The fallback system works in conjunction with the circuit breaker:

Circuit Breaker detects failures and opens when threshold is exceeded
Fallback chain activates when circuit breaker is open
Requests route to fallback models based on configured chains
Circuit breaker tests recovery and closes when backend recovers

# Example: Combined circuit breaker and fallback configuration
circuit_breaker:
  enabled: true
  failure_threshold: 5
  timeout: 60s

fallback:
  enabled: true
  fallback_policy:
    trigger_conditions:
      circuit_breaker_open: true  # Link to circuit breaker

Rate Limiting¶

Continuum Router includes built-in rate limiting for the /v1/models endpoint to prevent abuse and ensure fair resource allocation.

Current Configuration¶

Rate limiting is currently configured with the following default values:

# Note: These values are currently hardcoded but may become configurable in future versions
rate_limiting:
  models_endpoint:
    # Per-client limits (identified by API key or IP address)
    sustained_limit: 100          # Maximum requests per minute
    burst_limit: 20               # Maximum requests in any 5-second window

    # Time windows
    window_duration: 60s          # Sliding window for sustained limit
    burst_window: 5s              # Window for burst detection

    # Client identification priority
    identification:
      - api_key                   # Bearer token (first 16 chars used as ID)
      - x_forwarded_for           # Proxy/load balancer header
      - x_real_ip                 # Alternative IP header
      - fallback: "unknown"       # When no identifier available

How It Works¶

Client Identification: Each request is associated with a client using:
API key from Authorization: Bearer <token> header (preferred)
IP address from proxy headers (fallback)
Dual-Window Approach:
Sustained limit: Prevents excessive usage over time
Burst protection: Catches rapid-fire requests
Independent Quotas: Each client has separate rate limits:
Client A with API key abc123...: 100 req/min
Client B with API key def456...: 100 req/min
Client C from IP 192.168.1.1: 100 req/min

Response Headers¶

When rate limited, the response includes: - Status Code: 429 Too Many Requests - Error Message: Indicates whether burst or sustained limit was exceeded

Cache TTL Optimization¶

To prevent cache poisoning attacks: - Empty model lists: Cached for 5 seconds only - Normal responses: Cached for 60 seconds

This prevents attackers from forcing the router to cache empty responses during backend outages.

Monitoring¶

Rate limit violations are tracked in metrics: - rate_limit_violations: Total rejected requests - empty_responses_returned: Empty model lists served - Per-client violation tracking for identifying problematic clients

Future Enhancements¶

Future versions may support: - Configurable rate limits via YAML/environment variables - Per-endpoint rate limiting - Custom rate limits per API key - Redis-backed distributed rate limiting

Environment-Specific Configurations¶

Development Configuration¶

# config/development.yaml
server:
  bind_address: "127.0.0.1:8080"

backends:
    - name: "local-ollama"
    url: "http://localhost:11434"

health_checks:
  interval: "10s"                 # More frequent checks
  timeout: "5s"

logging:
  level: "debug"                  # Verbose logging
  format: "pretty"                # Human-readable
  enable_colors: true

Production Configuration¶

# config/production.yaml
server:
  bind_address: "0.0.0.0:8080"
  workers: 8                      # More workers for production
  connection_pool_size: 300       # Larger connection pool

backends:
    - name: "primary-openai"
    url: "https://api.openai.com"
    weight: 3
    - name: "secondary-azure"
    url: "https://azure-openai.example.com"
    weight: 2
    - name: "fallback-local"
    url: "http://internal-llm:11434"
    weight: 1

health_checks:
  interval: "60s"                 # Less frequent checks
  timeout: "15s"                  # Longer timeout for network latency
  unhealthy_threshold: 5          # More tolerance
  healthy_threshold: 3

request:
  timeout: "120s"                 # Shorter timeout for production
  max_retries: 5                  # More retries

logging:
  level: "warn"                   # Less verbose logging
  format: "json"                  # Structured logging

Container Configuration¶

# config/container.yaml - optimized for containers
server:
  bind_address: "0.0.0.0:8080"
  workers: 0                      # Auto-detect based on container limits

backends:
    - name: "backend-1"
    url: "${BACKEND_1_URL}"       # Environment variable substitution
    - name: "backend-2"
    url: "${BACKEND_2_URL}"

logging:
  level: "${LOG_LEVEL}"           # Configurable via environment
  format: "json"                  # Always JSON in containers

Examples¶

Multi-Backend Setup¶

# Enterprise multi-backend configuration
server:
  bind_address: "0.0.0.0:8080"
  workers: 8
  connection_pool_size: 400

backends:
  # Primary OpenAI GPT models
    - name: "openai-primary"
    url: "https://api.openai.com"
    weight: 5
    models: ["gpt-4", "gpt-3.5-turbo"]
    retry_override:
      max_attempts: 3
      base_delay: "500ms"

  # Secondary Azure OpenAI
    - name: "azure-openai"  
    url: "https://your-resource.openai.azure.com"
    weight: 3
    models: ["gpt-4", "gpt-35-turbo"]

  # Local Ollama for open models
    - name: "local-ollama"
    url: "http://ollama:11434"
    weight: 2
    models: ["llama2", "mistral", "codellama"]

  # vLLM deployment
    - name: "vllm-cluster"
    url: "http://vllm-service:8000"
    weight: 4
    models: ["meta-llama/Llama-2-7b-chat-hf"]

health_checks:
  enabled: true
  interval: "45s"
  timeout: "15s"
  unhealthy_threshold: 3
  healthy_threshold: 2

request:
  timeout: "180s"
  max_retries: 4

cache:
  model_cache_ttl: "600s"        # 10-minute cache
  deduplication_ttl: "120s"      # 2-minute deduplication
  enable_deduplication: true

logging:
  level: "info"
  format: "json"

High-Performance Configuration¶

# Optimized for high-throughput scenarios
server:
  bind_address: "0.0.0.0:8080"
  workers: 16                     # High worker count
  connection_pool_size: 1000      # Large connection pool

backends:
    - name: "fast-backend-1"
    url: "http://backend1:8000"
    weight: 1
    - name: "fast-backend-2" 
    url: "http://backend2:8000"
    weight: 1
    - name: "fast-backend-3"
    url: "http://backend3:8000"
    weight: 1

health_checks:
  enabled: true
  interval: "30s"
  timeout: "5s"                   # Fast timeout
  unhealthy_threshold: 2          # Fail fast
  healthy_threshold: 1            # Recover quickly

request:
  timeout: "60s"                  # Shorter timeout for high throughput
  max_retries: 2                  # Fewer retries

retry:
  max_attempts: 2
  base_delay: "50ms"              # Fast retries
  max_delay: "5s"
  exponential_backoff: true
  jitter: true

cache:
  model_cache_ttl: "300s"
  deduplication_ttl: "30s"        # Shorter deduplication window
  enable_deduplication: true

logging:
  level: "warn"                   # Minimal logging for performance
  format: "json"

Development Configuration¶

# Developer-friendly configuration
server:
  bind_address: "127.0.0.1:8080"  # Localhost only
  workers: 2                      # Fewer workers for development
  connection_pool_size: 20        # Small pool

backends:
    - name: "local-ollama"
    url: "http://localhost:11434"
    weight: 1

health_checks:
  enabled: true  
  interval: "10s"                 # Frequent checks for quick feedback
  timeout: "3s"
  unhealthy_threshold: 2
  healthy_threshold: 1

request:
  timeout: "300s"                 # Long timeout for debugging
  max_retries: 1                  # Minimal retries for debugging

logging:
  level: "debug"                  # Verbose logging
  format: "pretty"                # Human-readable
  enable_colors: true             # Colored output

cache:
  model_cache_ttl: "60s"          # Short cache for quick testing
  deduplication_ttl: "10s"        # Short deduplication
  enable_deduplication: false     # Disable for testing

Migration Guide¶

From Command-Line Arguments¶

If you're currently using command-line arguments, migrate to configuration files:

Before:

continuum-router --backends "http://localhost:11434,http://localhost:1234" --bind "0.0.0.0:9000"

After: 1. Generate a configuration file:

continuum-router --generate-config > config.yaml

Edit the configuration:

server:
  bind_address: "0.0.0.0:9000"

backends:
    - name: "ollama"
    url: "http://localhost:11434"
    - name: "lm-studio"
    url: "http://localhost:1234"

Use the configuration file:
```
continuum-router --config config.yaml
```

From Environment Variables¶

You can continue using environment variables with configuration files as overrides:

Configuration file (config.yaml):

server:
  bind_address: "0.0.0.0:8080"

backends:
    - name: "default"
    url: "http://localhost:11434"

Environment override:

export CONTINUUM_BIND_ADDRESS="0.0.0.0:9000"
export CONTINUUM_BACKEND_URLS="http://localhost:11434,http://localhost:1234"
continuum-router --config config.yaml

Configuration Validation¶

To validate your configuration without starting the server:

# Test configuration loading
continuum-router --config config.yaml --help

# Check configuration with dry-run (future feature)
continuum-router --config config.yaml --dry-run

This configuration guide provides comprehensive coverage of all configuration options available in Continuum Router. The flexible configuration system allows you to adapt the router to any deployment scenario while maintaining clear precedence rules and validation.