Skip to content

Configuration Guide

This guide provides comprehensive documentation for configuring Continuum Router. The router supports multiple configuration methods with a clear priority system to provide maximum flexibility for different deployment scenarios.

Table of Contents

Configuration Methods

Continuum Router supports three configuration methods:

  1. Configuration File (YAML) - Recommended for production
  2. Environment Variables - Ideal for containerized deployments
  3. Command Line Arguments - Useful for testing and overrides

Configuration Discovery

The router automatically searches for configuration files in these locations (in order):

  1. Path specified by --config flag
  2. ./config.yaml (current directory)
  3. ./config.yml
  4. /etc/continuum-router/config.yaml
  5. /etc/continuum-router/config.yml
  6. ~/.config/continuum-router/config.yaml
  7. ~/.config/continuum-router/config.yml

Configuration Priority

Configuration is applied in the following priority order (highest to lowest):

  1. Command-line arguments (highest priority)
  2. Environment variables
  3. Configuration file
  4. Default values (lowest priority)

This allows you to: - Set base configuration in a file - Override specific settings via environment variables in containers - Make temporary adjustments using command-line arguments

Configuration File Format

Complete Configuration Example

# Continuum Router Configuration
# This example shows all available configuration options with their default values

# Server configuration
server:
  bind_address: "0.0.0.0:8080"          # Address to bind the server to
  workers: 4                             # Number of worker threads (0 = auto-detect)
  connection_pool_size: 100              # Max idle connections per backend

# Model metadata configuration (optional)
model_metadata_file: "model-metadata.yaml"  # Path to external model metadata file

# Backend configuration
backends:
  # Native OpenAI API with built-in configuration
    - name: "openai"
    type: openai                         # Use native OpenAI backend
    api_key: "${CONTINUUM_OPENAI_API_KEY}"  # Loaded from environment
    org_id: "${CONTINUUM_OPENAI_ORG_ID}"    # Optional organization ID
    weight: 3
    models:                              # Specify which models to use
      - gpt-4o
      - gpt-4o-mini
      - o3-mini
      - text-embedding-3-large
    retry_override:                      # Backend-specific retry settings (optional)
      max_attempts: 5
      base_delay: "200ms"
      max_delay: "30s"
      exponential_backoff: true
      jitter: true

  # Generic OpenAI-compatible backend with custom metadata
    - name: "openai-compatible"
    url: "https://custom-llm.example.com"
    weight: 1
    models:
      - "gpt-4"
      - "gpt-3.5-turbo"
    model_configs:                       # Enhanced model configuration with metadata
      - id: "gpt-4"
        aliases:                         # Alternative IDs that share this metadata (optional)
          - "gpt-4-0125-preview"
          - "gpt-4-turbo-preview"
        metadata:
          display_name: "GPT-4"
          summary: "Most capable GPT-4 model for complex tasks"
          capabilities: ["text", "image", "function_calling"]
          knowledge_cutoff: "2024-04"
          pricing:
            input_tokens: 0.03
            output_tokens: 0.06
          limits:
            context_window: 128000
            max_output: 4096

  # Ollama local server with automatic URL detection
    - name: "local-ollama"
    type: ollama                         # Defaults to http://localhost:11434
    weight: 2
    models:
      - "llama2"
      - "mistral"
      - "codellama"

  # vLLM server
    - name: "vllm-server"
    type: vllm
    url: "http://localhost:8000"
    weight: 1
    # Models will be discovered automatically if not specified
    # Models with namespace prefixes (e.g., "custom/gpt-4") will automatically
    # match metadata for base names (e.g., "gpt-4")

  # Google Gemini API (native backend)
    - name: "gemini"
    type: gemini                           # Use native Gemini backend
    api_key: "${CONTINUUM_GEMINI_API_KEY}" # Loaded from environment
    weight: 2
    models:
      - gemini-2.5-pro
      - gemini-2.5-flash
      - gemini-2.0-flash

# Health monitoring configuration
health_checks:
  enabled: true                          # Enable/disable health checks
  interval: "30s"                        # How often to check backend health
  timeout: "10s"                         # Timeout for health check requests
  unhealthy_threshold: 3                 # Failures before marking unhealthy
  healthy_threshold: 2                   # Successes before marking healthy
  endpoint: "/v1/models"                 # Endpoint used for health checks

# Request handling and timeout configuration
timeouts:
  connection: "10s"                      # TCP connection establishment timeout
  request:
    standard:                            # Non-streaming requests
      first_byte: "30s"                  # Time to receive first byte
      total: "180s"                      # Total request timeout (3 minutes)
    streaming:                           # Streaming (SSE) requests
      first_byte: "60s"                  # Time to first SSE chunk
      chunk_interval: "30s"              # Max time between chunks
      total: "600s"                      # Total streaming timeout (10 minutes)
    image_generation:                    # Image generation requests (DALL-E, etc.)
      first_byte: "60s"                  # Time to receive first byte
      total: "180s"                      # Total timeout (3 minutes default)
    model_overrides:                     # Model-specific timeout overrides
      gpt-5-latest:
        streaming:
          total: "1200s"                 # 20 minutes for GPT-5
      gpt-4o:
        streaming:
          total: "900s"                  # 15 minutes for GPT-4o
  health_check:
    timeout: "5s"                        # Health check timeout
    interval: "30s"                      # Health check interval

request:
  max_retries: 3                         # Maximum retry attempts for requests
  retry_delay: "1s"                      # Initial delay between retries

# Global retry and resilience configuration
retry:
  max_attempts: 3                        # Maximum retry attempts
  base_delay: "100ms"                    # Base delay between retries
  max_delay: "30s"                       # Maximum delay between retries
  exponential_backoff: true              # Use exponential backoff
  jitter: true                          # Add random jitter to delays

# Caching and optimization configuration
cache:
  model_cache_ttl: "300s"               # Cache model lists for 5 minutes
  deduplication_ttl: "60s"              # Deduplicate requests for 1 minute
  enable_deduplication: true            # Enable request deduplication

# Logging configuration
logging:
  level: "info"                         # Log level: trace, debug, info, warn, error
  format: "json"                        # Log format: json, pretty
  enable_colors: false                  # Enable colored output (for pretty format)

# Files API configuration
files:
  enabled: true                         # Enable/disable Files API endpoints
  max_file_size: 536870912              # Maximum file size in bytes (default: 512MB)
  storage_path: "./data/files"          # Storage path for uploaded files (supports ~)
  retention_days: 0                     # File retention in days (0 = keep forever)
  metadata_storage: persistent          # Metadata backend: "memory" or "persistent" (default)
  cleanup_orphans_on_startup: false     # Auto-cleanup orphaned files on startup

  # Authentication and authorization
  auth:
    method: api_key                     # "none" or "api_key" (default)
    required_scope: files               # API key scope required for access
    enforce_ownership: true             # Users can only access their own files
    admin_can_access_all: true          # Admin scope grants access to all files

# Load balancing configuration
load_balancer:
  strategy: "round_robin"               # Strategy: round_robin, weighted, random
  health_aware: true                    # Only route to healthy backends

# Circuit breaker configuration (future feature)
circuit_breaker:
  enabled: false                        # Enable circuit breaker
  failure_threshold: 5                  # Failures to open circuit
  recovery_timeout: "60s"               # Time before attempting recovery
  half_open_retries: 3                  # Retries in half-open state

# Rate limiting configuration (future feature)
rate_limiting:
  enabled: false                        # Enable rate limiting
  requests_per_second: 100              # Global requests per second
  burst_size: 200                       # Burst capacity

# Metrics and monitoring configuration (future feature)
metrics:
  enabled: false                        # Enable metrics collection
  endpoint: "/metrics"                  # Metrics endpoint path
  include_labels: true                  # Include detailed labels

Minimal Configuration

# Minimal configuration - other settings will use defaults
server:
  bind_address: "0.0.0.0:8080"

backends:
    - name: "ollama"
    url: "http://localhost:11434"
    - name: "lm-studio"  
    url: "http://localhost:1234"

Environment Variables

All configuration options can be overridden using environment variables with the CONTINUUM_ prefix:

Server Configuration

Variable Type Default Description
CONTINUUM_BIND_ADDRESS string "0.0.0.0:8080" Server bind address
CONTINUUM_WORKERS integer 4 Number of worker threads
CONTINUUM_CONNECTION_POOL_SIZE integer 100 HTTP connection pool size

Backend Configuration

Variable Type Default Description
CONTINUUM_BACKEND_URLS string Comma-separated backend URLs
CONTINUUM_BACKEND_WEIGHTS string Comma-separated weights (must match URLs)

Health Check Configuration

Variable Type Default Description
CONTINUUM_HEALTH_CHECKS_ENABLED boolean true Enable health checks
CONTINUUM_HEALTH_CHECK_INTERVAL string "30s" Health check interval
CONTINUUM_HEALTH_CHECK_TIMEOUT string "10s" Health check timeout
CONTINUUM_UNHEALTHY_THRESHOLD integer 3 Failures before unhealthy
CONTINUUM_HEALTHY_THRESHOLD integer 2 Successes before healthy

Request Configuration

Variable Type Default Description
CONTINUUM_REQUEST_TIMEOUT string "300s" Maximum request timeout
CONTINUUM_MAX_RETRIES integer 3 Maximum retry attempts
CONTINUUM_RETRY_DELAY string "1s" Initial retry delay

Logging Configuration

Variable Type Default Description
CONTINUUM_LOG_LEVEL string "info" Log level
CONTINUUM_LOG_FORMAT string "json" Log format
CONTINUUM_LOG_COLORS boolean false Enable colored output
RUST_LOG string Rust-specific logging configuration

Cache Configuration

Variable Type Default Description
CONTINUUM_MODEL_CACHE_TTL string "300s" Model cache TTL
CONTINUUM_DEDUPLICATION_TTL string "60s" Deduplication TTL
CONTINUUM_ENABLE_DEDUPLICATION boolean true Enable deduplication

Files API Configuration

Variable Type Default Description
CONTINUUM_FILES_ENABLED boolean true Enable/disable Files API
CONTINUUM_FILES_MAX_SIZE integer 536870912 Maximum file size in bytes (512MB)
CONTINUUM_FILES_STORAGE_PATH string "./data/files" Storage path for uploaded files
CONTINUUM_FILES_RETENTION_DAYS integer 0 File retention in days (0 = forever)
CONTINUUM_FILES_METADATA_STORAGE string "persistent" Metadata backend: "memory" or "persistent"
CONTINUUM_FILES_CLEANUP_ORPHANS boolean false Auto-cleanup orphaned files on startup
CONTINUUM_FILES_AUTH_METHOD string "api_key" Authentication method: "none" or "api_key"
CONTINUUM_FILES_AUTH_SCOPE string "files" Required API key scope for Files API access
CONTINUUM_FILES_ENFORCE_OWNERSHIP boolean true Users can only access their own files
CONTINUUM_FILES_ADMIN_ACCESS_ALL boolean true Admin scope grants access to all files
CONTINUUM_DEV_MODE boolean false Enable development API keys (DO NOT use in production)

API Key Management Configuration

Variable Type Default Description
CONTINUUM_API_KEY string - Single API key for simple deployments
CONTINUUM_API_KEY_SCOPES string "read,write" Comma-separated scopes for the API key
CONTINUUM_API_KEY_USER_ID string "admin" User ID associated with the API key
CONTINUUM_API_KEY_ORG_ID string "default" Organization ID associated with the API key
CONTINUUM_DEV_MODE boolean false Enable development API keys (DO NOT use in production)

Example Environment Configuration

# Basic configuration
export CONTINUUM_BIND_ADDRESS="0.0.0.0:9000"
export CONTINUUM_BACKEND_URLS="http://localhost:11434,http://localhost:1234"
export CONTINUUM_LOG_LEVEL="debug"

# Advanced configuration
export CONTINUUM_CONNECTION_POOL_SIZE="200"
export CONTINUUM_HEALTH_CHECK_INTERVAL="60s"
export CONTINUUM_MODEL_CACHE_TTL="600s"
export CONTINUUM_ENABLE_DEDUPLICATION="true"

# Start the router
continuum-router

Command Line Arguments

Command-line arguments provide the highest priority configuration method and are useful for testing and temporary overrides.

Core Options

continuum-router --help
Argument Type Description
-c, --config <FILE> path Configuration file path
--generate-config flag Generate sample config and exit

Backend Configuration

Argument Type Description
--backends <URLs> string Comma-separated backend URLs
--backend-url <URL> string Single backend URL (deprecated)

Server Configuration

Argument Type Description
--bind <ADDRESS> string Server bind address
--connection-pool-size <SIZE> integer HTTP connection pool size

Health Check Configuration

Argument Type Description
--disable-health-checks flag Disable health monitoring
--health-check-interval <SECONDS> integer Health check interval
--health-check-timeout <SECONDS> integer Health check timeout
--unhealthy-threshold <COUNT> integer Failures before unhealthy
--healthy-threshold <COUNT> integer Successes before healthy

Example CLI Usage

# Use config file with overrides
continuum-router --config config.yaml --bind "0.0.0.0:9000"

# Override backends temporarily
continuum-router --config config.yaml --backends "http://localhost:11434"

# Adjust health check settings for testing
continuum-router --config config.yaml --health-check-interval 10

# Generate sample configuration
continuum-router --generate-config > my-config.yaml

Configuration Sections

Server Section

Controls the HTTP server behavior:

server:
  bind_address: "0.0.0.0:8080"    # Host and port to bind
  workers: 4                       # Worker threads (0 = auto)
  connection_pool_size: 100        # HTTP connection pool size

Performance Tuning:

  • workers: Set to 0 for auto-detection, or match CPU cores
  • connection_pool_size: Increase for high-load scenarios (200-500)

Backends Section

Defines the LLM backends to route requests to:

backends:
    - name: "unique-identifier"        # Must be unique across all backends
    type: "generic"                  # Backend type (optional, defaults to "generic")
    url: "http://backend:port"       # Base URL for the backend
    weight: 1                        # Load balancing weight (1-100)
    api_key: "${API_KEY}"            # API key (optional, supports env var references)
    org_id: "${ORG_ID}"              # Organization ID (optional, for OpenAI)
    models: ["model1", "model2"]     # Optional: explicit model list
    retry_override:                  # Optional: backend-specific retry settings
      max_attempts: 5
      base_delay: "200ms"

Backend Types Supported:

Type Description Default URL
generic OpenAI-compatible API (default) Must be specified
openai Native OpenAI API with built-in configuration https://api.openai.com/v1
gemini Google Gemini API (OpenAI-compatible endpoint) https://generativelanguage.googleapis.com/v1beta/openai
azure Azure OpenAI Service Must be specified
vllm vLLM server Must be specified
ollama Ollama local server http://localhost:11434
anthropic Anthropic Claude API (native, with request/response translation) https://api.anthropic.com

Native OpenAI Backend

When using type: openai, the router provides: - Default URL: https://api.openai.com/v1 (can be overridden for proxies) - Built-in model metadata: Automatic pricing, context windows, and capabilities - Environment variable support: Automatically loads from CONTINUUM_OPENAI_API_KEY and CONTINUUM_OPENAI_ORG_ID

Minimal OpenAI configuration:

backends:
    - name: "openai"
    type: openai
    models:
      - gpt-4o
      - gpt-4o-mini
      - o3-mini

Full OpenAI configuration with explicit API key:

backends:
    - name: "openai-primary"
    type: openai
    api_key: "${CONTINUUM_OPENAI_API_KEY}"
    org_id: "${CONTINUUM_OPENAI_ORG_ID}"     # Optional
    models:
      - gpt-4o
      - gpt-4o-mini
      - o1
      - o1-mini
      - o3-mini
      - text-embedding-3-large

Using OpenAI with a proxy:

backends:
    - name: "openai-proxy"
    type: openai
    url: "https://my-proxy.example.com/v1"   # Override default URL
    api_key: "${PROXY_API_KEY}"
    models:
      - gpt-4o

Environment Variables for OpenAI

Variable Description
CONTINUUM_OPENAI_API_KEY OpenAI API key (automatically loaded for type: openai backends)
CONTINUUM_OPENAI_ORG_ID OpenAI Organization ID (optional)

Model Auto-Discovery:

When models is not specified or is empty, backends automatically discover available models from their /v1/models API endpoint during initialization. This feature reduces configuration maintenance and ensures all backend-reported models are routable.

Backend Type Auto-Discovery Support Fallback Models
openai ✅ Yes gpt-4o, gpt-4o-mini, o3-mini
gemini ✅ Yes gemini-2.5-pro, gemini-2.5-flash, gemini-2.0-flash
vllm ✅ Yes vicuna-7b-v1.5, llama-2-7b-chat, mistral-7b-instruct
ollama ✅ Yes Uses vLLM discovery mechanism
anthropic ❌ No (no API) Hardcoded Claude models
generic ❌ No All models supported (supports_model() returns true)

Discovery Behavior:

  • Timeout: 10-second timeout prevents blocking startup
  • Fallback: If discovery fails (timeout, network error, invalid response), fallback models are used
  • Logging: Discovered models are logged at INFO level; fallback usage logged at WARN level

Model Resolution Priority: 1. Explicit models list from config (highest priority) 2. Models from model_configs field 3. Auto-discovered models from backend API 4. Hardcoded fallback models (lowest priority)

  • Explicit model lists improve startup time and reduce backend queries

Native Gemini Backend

When using type: gemini, the router provides: - Default URL: https://generativelanguage.googleapis.com/v1beta/openai (OpenAI-compatible endpoint) - Built-in model metadata: Automatic context windows and capabilities for Gemini models - Environment variable support: Automatically loads from CONTINUUM_GEMINI_API_KEY - Extended streaming timeout: 300s timeout for thinking models (gemini-2.5-pro, gemini-3-pro) - Automatic max_tokens adjustment: For thinking models, see below

Minimal Gemini configuration:

backends:
    - name: "gemini"
    type: gemini
    models:
      - gemini-2.5-pro
      - gemini-2.5-flash
      - gemini-2.0-flash

Full Gemini configuration:

backends:
    - name: "gemini"
    type: gemini
    api_key: "${CONTINUUM_GEMINI_API_KEY}"
    weight: 2
    models:
      - gemini-2.5-pro
      - gemini-2.5-flash
      - gemini-2.0-flash

Gemini Thinking Models: Automatic max_tokens Adjustment

Gemini "thinking" models (gemini-2.5-pro, gemini-3-pro, and models with -pro-preview suffix) perform extended reasoning before generating responses. To prevent response truncation, the router automatically adjusts max_tokens:

Condition Behavior
max_tokens not specified Automatically set to 16384
max_tokens < 4096 Automatically increased to 16384
max_tokens >= 4096 Client value preserved

This ensures thinking models can generate complete responses without truncation due to low default values from client libraries.

Environment Variables for Gemini

Variable Description
CONTINUUM_GEMINI_API_KEY Google Gemini API key (automatically loaded for type: gemini backends)

Native Anthropic Backend

When using type: anthropic, the router provides: - Default URL: https://api.anthropic.com (can be overridden for proxies) - Native API translation: Automatically converts OpenAI format requests to Anthropic Messages API format and vice versa - Anthropic-specific headers: Automatically adds x-api-key and anthropic-version headers - Environment variable support: Automatically loads from CONTINUUM_ANTHROPIC_API_KEY - Extended streaming timeout: 600s timeout for extended thinking models (Claude Opus, Sonnet 4)

Minimal Anthropic configuration:

backends:
    - name: "anthropic"
    type: anthropic
    models:
      - claude-sonnet-4-20250514
      - claude-haiku-3-5-20241022

Full Anthropic configuration:

backends:
    - name: "anthropic"
    type: anthropic
    api_key: "${CONTINUUM_ANTHROPIC_API_KEY}"
    weight: 2
    models:
      - claude-opus-4-5-20250514
      - claude-sonnet-4-20250514
      - claude-haiku-3-5-20241022

Anthropic API Translation

The router automatically handles the translation between OpenAI and Anthropic API formats:

OpenAI Format Anthropic Format
messages array with role: "system" Separate system parameter
Authorization: Bearer <key> x-api-key: <key> header
Optional max_tokens Required max_tokens (auto-filled if missing)
choices[0].message.content content[0].text
finish_reason: "stop" stop_reason: "end_turn"
usage.prompt_tokens usage.input_tokens

Example Request Translation:

OpenAI format (incoming from client):

{
  "model": "claude-sonnet-4-20250514",
  "messages": [
    {"role": "system", "content": "You are helpful."},
    {"role": "user", "content": "Hello"}
  ],
  "max_tokens": 1024
}

Anthropic format (sent to API):

{
  "model": "claude-sonnet-4-20250514",
  "system": "You are helpful.",
  "messages": [
    {"role": "user", "content": "Hello"}
  ],
  "max_tokens": 1024
}

Anthropic Extended Thinking Models

Models supporting extended thinking (Claude Opus, Sonnet 4) may require longer response times. The router automatically: - Sets higher default max_tokens (16384) for thinking models - Uses extended streaming timeout (600s) for these models

OpenAI ↔ Claude Reasoning Parameter Conversion

The router automatically converts between OpenAI's reasoning parameters and Claude's thinking parameter, enabling seamless cross-provider reasoning requests.

Supported OpenAI Formats:

Format API Example
reasoning_effort (flat) Chat Completions API "reasoning_effort": "high"
reasoning.effort (nested) Responses API "reasoning": {"effort": "high"}

When both formats are present, reasoning_effort (flat) takes precedence.

Effort Level to Budget Tokens Mapping:

Effort Level Claude thinking.budget_tokens
none (thinking disabled)
minimal 1,024
low 4,096
medium 10,240
high 32,768

Example Request - Chat Completions API (flat format):

// Client sends OpenAI Chat Completions API request
{
  "model": "claude-sonnet-4-5-20250929",
  "reasoning_effort": "high",
  "messages": [{"role": "user", "content": "Solve this complex problem"}]
}

// Router converts to Claude format
{
  "model": "claude-sonnet-4-5-20250929",
  "thinking": {"type": "enabled", "budget_tokens": 32768},
  "messages": [{"role": "user", "content": "Solve this complex problem"}]
}

Example Request - Responses API (nested format):

// Client sends OpenAI Responses API request
{
  "model": "claude-sonnet-4-5-20250929",
  "reasoning": {"effort": "medium"},
  "messages": [{"role": "user", "content": "Analyze this data"}]
}

// Router converts to Claude format
{
  "model": "claude-sonnet-4-5-20250929",
  "thinking": {"type": "enabled", "budget_tokens": 10240},
  "messages": [{"role": "user", "content": "Analyze this data"}]
}

Response with Reasoning Content:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "The final answer is...",
      "reasoning_content": "Let me analyze this step by step..."
    }
  }]
}

Important Notes: - If thinking parameter is explicitly provided, it takes precedence over reasoning_effort and reasoning.effort - reasoning_effort (flat) takes precedence over reasoning.effort (nested) when both are present - Only models supporting extended thinking (Opus 4.x, Sonnet 4.x) will have reasoning enabled - When reasoning is enabled, the temperature parameter is automatically removed (Claude API requirement) - For streaming responses, thinking content is returned as reasoning_content delta events

Environment Variables for Anthropic

Variable Description
CONTINUUM_ANTHROPIC_API_KEY Anthropic API key (automatically loaded for type: anthropic backends)

Health Checks Section

Configures backend health monitoring:

health_checks:
  enabled: true                    # Enable/disable health monitoring
  interval: "30s"                  # Check frequency
  timeout: "10s"                   # Request timeout
  unhealthy_threshold: 3           # Failures before marking unhealthy
  healthy_threshold: 2             # Successes before marking healthy
  endpoint: "/v1/models"           # Endpoint to check

Health Check Process: 1. Router queries the health endpoint on each backend 2. Successful responses increment success counter 3. Failed responses increment failure counter 4. Backends marked unhealthy after reaching failure threshold 5. Backends marked healthy after reaching success threshold 6. Only healthy backends receive traffic

Request Section

Controls request handling behavior:

request:
  timeout: "300s"                  # Maximum request duration
  max_retries: 3                   # Retry attempts for failed requests
  retry_delay: "1s"                # Initial delay between retries

Timeout Considerations:

  • Long timeouts (300s) accommodate slow model inference
  • Streaming requests may take longer than non-streaming
  • Balance between user experience and resource usage

Retry Section

Global retry configuration for resilience:

retry:
  max_attempts: 3                  # Maximum retry attempts
  base_delay: "100ms"              # Base delay between retries
  max_delay: "30s"                 # Cap on retry delays
  exponential_backoff: true        # Use exponential backoff
  jitter: true                     # Add random jitter

Retry Strategy:

  • Exponential backoff: delays increase exponentially (100ms, 200ms, 400ms...)
  • Jitter: adds randomness to prevent thundering herd
  • Max delay: prevents extremely long waits

Cache Section

Controls caching and optimization:

cache:
  model_cache_ttl: "300s"         # How long to cache model lists
  deduplication_ttl: "60s"        # How long to cache identical requests
  enable_deduplication: true      # Enable request deduplication

Cache Benefits:

  • Model caching reduces backend queries
  • Deduplication prevents duplicate processing
  • TTL prevents stale data issues

Logging Section

Configures logging output:

logging:
  level: "info"                   # trace, debug, info, warn, error
  format: "json"                  # json, pretty
  enable_colors: false            # Colored output (pretty format only)

Log Levels:

  • trace: Extremely verbose, includes all details
  • debug: Detailed debugging information
  • info: General operational information
  • warn: Warning messages and potential issues
  • error: Error conditions only

Log Formats:

  • json: Structured JSON logging (recommended for production)
  • pretty: Human-readable format (good for development)

API Keys Section

API keys control client access to the router's endpoints. Keys can be configured through multiple sources.

Authentication Mode

The mode setting controls whether API authentication is required for API endpoints:

Mode Behavior
permissive (default) Allow requests without API key. Requests with valid API keys are authenticated.
blocking Only process requests that pass API key authentication. Unauthenticated requests receive 401.

Target Endpoints (when mode is blocking): - /v1/chat/completions - /v1/completions - /v1/responses - /v1/images/generations - /v1/images/edits - /v1/images/variations - /v1/models

Note: Admin, Files, and Metrics endpoints have separate authentication mechanisms and are not affected by this setting.

Section Configuration Properties:

Property Type Required Default Description
mode string No permissive Authentication mode: permissive or blocking
api_keys array No [] Inline API key definitions
api_keys_file string No - Path to external API keys file
api_keys:
  # Authentication mode: "permissive" (default) or "blocking"
  mode: permissive

  # Inline API key definitions
  api_keys:
        - key: "${API_KEY_1}"              # Environment variable substitution
      id: "key-production-1"           # Unique identifier
      user_id: "user-admin"            # Associated user
      organization_id: "org-main"      # Associated organization
      name: "Production Admin Key"     # Human-readable name
      scopes:                          # Permissions
        - read
        - write
        - files
        - admin
      rate_limit: 1000                 # Requests per minute (optional)
      enabled: true                    # Active status
      expires_at: "2025-12-31T23:59:59Z"  # Optional expiration (ISO 8601)

        - key: "${API_KEY_2}"
      id: "key-service-1"
      user_id: "service-bot"
      organization_id: "org-main"
      name: "Service Account"
      scopes: [read, write, files]
      rate_limit: 500
      enabled: true

  # External key file for better security
  api_keys_file: "/etc/continuum-router/api-keys.yaml"

Key Properties:

Property Type Required Description
key string Yes The API key value (supports ${ENV_VAR} substitution)
id string Yes Unique identifier for admin operations
user_id string Yes User associated with this key
organization_id string Yes Organization the user belongs to
name string No Human-readable name
description string No Notes about the key
scopes array Yes Permissions: read, write, files, admin
rate_limit integer No Maximum requests per minute
enabled boolean No Active status (default: true)
expires_at string No ISO 8601 expiration timestamp

External Key File Format:

# /etc/continuum-router/api-keys.yaml
keys:
    - key: "sk-prod-xxxxxxxxxxxxxxxxxxxxx"
    id: "key-external-1"
    user_id: "external-user"
    organization_id: "external-org"
    scopes: [read, write, files]
    enabled: true

Security Features:

  • Key Masking: Full keys are never logged (displayed as sk-***last4)
  • Expiration Enforcement: Expired keys are automatically rejected
  • Hot Reload: Update keys without server restart
  • Audit Logging: All key management operations are logged
  • Constant-Time Validation: Prevents timing attacks
  • Max Key Limit: 10,000 keys maximum to prevent DoS

Admin API Endpoints (require admin authentication):

Endpoint Method Description
/admin/api-keys GET List all keys (masked)
/admin/api-keys/:id GET Get key details
/admin/api-keys POST Create new key
/admin/api-keys/:id PUT Update key properties
/admin/api-keys/:id DELETE Delete key
/admin/api-keys/:id/rotate POST Generate new key value
/admin/api-keys/:id/enable POST Enable key
/admin/api-keys/:id/disable POST Disable key

Advanced Configuration

Global Prompts

Global prompts allow you to inject system prompts into all requests, providing centralized policy management for security, compliance, and behavioral guidelines. Prompts can be defined inline or loaded from external Markdown files.

Basic Configuration

global_prompts:
  # Inline default prompt
  default: |
    You must follow company security policies.
    Never reveal internal system details.
    Be helpful and professional.

  # Merge strategy: prepend (default), append, or replace
  merge_strategy: prepend

  # Custom separator between global and user prompts
  separator: "\n\n---\n\n"

External Prompt Files

For complex prompts, you can load content from external Markdown files. This provides: - Better editing experience with syntax highlighting - Version control without config file noise - Hot-reload support for prompt updates

global_prompts:
  # Directory containing prompt files (relative to config directory)
  prompts_dir: "./prompts"

  # Load default prompt from file
  default_file: "system.md"

  # Backend-specific prompts from files
  backends:
    anthropic:
      prompt_file: "anthropic-system.md"
    openai:
      prompt_file: "openai-system.md"

  # Model-specific prompts from files
  models:
    gpt-4o:
      prompt_file: "gpt4o-system.md"
    claude-3-opus:
      prompt_file: "claude-opus-system.md"

  merge_strategy: prepend

Prompt Resolution Priority

When determining which prompt to use for a request:

  1. Model-specific prompt (highest priority) - global_prompts.models.<model-id>
  2. Backend-specific prompt - global_prompts.backends.<backend-name>
  3. Default prompt - global_prompts.default or global_prompts.default_file

For each level, if both prompt (inline) and prompt_file are specified, prompt_file takes precedence.

Merge Strategies

Strategy Behavior
prepend Global prompt added before user's system prompt (default)
append Global prompt added after user's system prompt
replace Global prompt replaces user's system prompt entirely

REST API Management

Prompt files can be managed at runtime via the Admin API:

# List all prompts
curl http://localhost:8080/admin/config/prompts

# Get specific prompt file
curl http://localhost:8080/admin/config/prompts/prompts/system.md

# Update prompt file
curl -X PUT http://localhost:8080/admin/config/prompts/prompts/system.md \
  -H "Content-Type: application/json" \
  -d '{"content": "# Updated System Prompt\n\nNew content here."}'

# Reload all prompt files from disk
curl -X POST http://localhost:8080/admin/config/prompts/reload

See Admin REST API Reference for complete API documentation.

Security Considerations

  • Path Traversal Protection: All file paths are validated to prevent directory traversal attacks
  • File Size Limits: Individual files limited to 1MB, total cache limited to 50MB
  • Relative Paths Only: Prompt files must be within the configured prompts_dir or config directory
  • Sandboxed Access: Files outside the allowed directory are rejected

Hot Reload

Global prompts support immediate hot-reload. Changes to prompt configuration or files take effect on the next request without server restart.

Model Metadata

Continuum Router supports rich model metadata to provide detailed information about model capabilities, pricing, and limits. This metadata is returned in /v1/models API responses and can be used by clients to make informed model selection decisions.

Metadata Sources

Model metadata can be configured in three ways (in priority order):

  1. Backend-specific model_configs (highest priority)
  2. External metadata file (model-metadata.yaml)
  3. No metadata (models work without metadata)

External Metadata File

Create a model-metadata.yaml file:

models:
    - id: "gpt-4"
    aliases:                    # Alternative IDs that share this metadata
      - "gpt-4-0125-preview"
      - "gpt-4-turbo-preview"
      - "gpt-4-vision-preview"
    metadata:
      display_name: "GPT-4"
      summary: "Most capable GPT-4 model for complex tasks"
      capabilities: ["text", "image", "function_calling"]
      knowledge_cutoff: "2024-04"
      pricing:
        input_tokens: 0.03   # Per 1000 tokens
        output_tokens: 0.06  # Per 1000 tokens
      limits:
        context_window: 128000
        max_output: 4096

    - id: "llama-3-70b"
    aliases:                    # Different quantizations of the same model
      - "llama-3-70b-instruct"
      - "llama-3-70b-chat"
      - "llama-3-70b-q4"
      - "llama-3-70b-q8"
    metadata:
      display_name: "Llama 3 70B"
      summary: "Open-source model with strong performance"
      capabilities: ["text", "code"]
      knowledge_cutoff: "2023-12"
      pricing:
        input_tokens: 0.001
        output_tokens: 0.002
      limits:
        context_window: 8192
        max_output: 2048

Reference it in your config:

model_metadata_file: "model-metadata.yaml"

Namespace-Aware Matching

The router intelligently handles model IDs with namespace prefixes. For example:

  • Backend returns: "custom/gpt-4", "openai/gpt-4", "optimized/gpt-4"
  • Metadata defined for: "gpt-4"
  • Result: All variants match and receive the same metadata

This allows different backends to use their own naming conventions while sharing common metadata definitions.

Metadata Priority and Alias Resolution

When multiple metadata sources exist for a model:

  1. Backend-specific model_configs (highest priority)

    backends:
      - name: "my-backend"
        model_configs:
          - id: "gpt-4"
            aliases: ["gpt-4-turbo", "gpt-4-vision"]
            metadata: {...}  # This takes precedence
    

  2. External metadata file (second priority)

    model_metadata_file: "model-metadata.yaml"
    

  3. Alias matching (when exact ID not found)

  4. If requesting "gpt-4-turbo" and it's listed as an alias
  5. Returns metadata from the parent model configuration

  6. Base model fallback (for namespace matching)

  7. If "custom/gpt-4" has no exact match
  8. Falls back to metadata for "gpt-4"

Using Aliases for Model Variants

Aliases are particularly useful for: - Different quantizations: qwen3-32b-i1, qwen3-23b-i4 → all use qwen3 metadata - Version variations: gpt-4-0125-preview, gpt-4-turbo → share gpt-4 metadata
- Deployment variations: llama-3-70b-instruct, llama-3-70b-chat → same base model

Example configuration with aliases:

model_configs:
    - id: "qwen3"
    aliases:
      - "qwen3-32b-i1"     # 32B with 1-bit quantization
      - "qwen3-23b-i4"     # 23B with 4-bit quantization
      - "qwen3-16b-q8"     # 16B with 8-bit quantization
    metadata:
      display_name: "Qwen 3"
      summary: "Alibaba's Qwen model family"
      # ... rest of metadata

API Response

The /v1/models endpoint returns enriched model information:

{
  "object": "list",
  "data": [
    {
      "id": "gpt-4",
      "object": "model",
      "created": 1234567890,
      "owned_by": "openai",
      "backends": ["openai-proxy"],
      "metadata": {
        "display_name": "GPT-4",
        "summary": "Most capable GPT-4 model for complex tasks",
        "capabilities": ["text", "image", "function_calling"],
        "knowledge_cutoff": "2024-04",
        "pricing": {
          "input_tokens": 0.03,
          "output_tokens": 0.06
        },
        "limits": {
          "context_window": 128000,
          "max_output": 4096
        }
      }
    }
  ]
}

Hot Reload

Continuum Router supports hot reload for runtime configuration updates without server restart. Configuration changes are detected automatically and applied based on their classification.

Configuration Item Classification

Configuration items are classified into three categories based on their hot reload capability:

Immediate Update (No Service Interruption)

These settings update immediately without any service disruption:

# Logging configuration
logging:
  level: "info"                  # ✅ Immediate: Log level changes apply instantly
  format: "json"                 # ✅ Immediate: Log format changes apply instantly

# Rate limiting settings
rate_limiting:
  enabled: true                  # ✅ Immediate: Enable/disable rate limiting
  limits:
    per_client:
      requests_per_second: 10    # ✅ Immediate: New limits apply immediately
      burst_capacity: 20         # ✅ Immediate: Burst settings update instantly

# Circuit breaker configuration
circuit_breaker:
  enabled: true                  # ✅ Immediate: Enable/disable circuit breaker
  failure_threshold: 5           # ✅ Immediate: Threshold updates apply instantly
  timeout_seconds: 60            # ✅ Immediate: Timeout changes immediate

# Retry configuration
retry:
  max_attempts: 3                # ✅ Immediate: Retry policy updates instantly
  base_delay: "100ms"            # ✅ Immediate: Backoff settings apply immediately
  exponential_backoff: true      # ✅ Immediate: Strategy changes instant

# Global prompts
global_prompts:
  default: "You are helpful"       # ✅ Immediate: Prompt changes apply to new requests
  default_file: "prompts/system.md"  # ✅ Immediate: File-based prompts also hot-reload
Gradual Update (Existing Connections Maintained)

These settings apply to new connections while maintaining existing ones:

# Backend configuration
backends:
    - name: "ollama"               # ✅ Gradual: New requests use updated backend pool
    url: "http://localhost:11434"
    weight: 2                    # ✅ Gradual: Load balancing updates for new requests
    models: ["llama3.2"]         # ✅ Gradual: Model routing updates gradually

# Health check settings
health_checks:
  interval: "30s"                # ✅ Gradual: Next health check cycle uses new interval
  timeout: "10s"                 # ✅ Gradual: New checks use updated timeout
  unhealthy_threshold: 3         # ✅ Gradual: Threshold applies to new evaluations
  healthy_threshold: 2           # ✅ Gradual: Recovery threshold updates gradually

# Timeout configuration
timeouts:
  connection: "10s"              # ✅ Gradual: New requests use updated timeouts
  request:
    standard:
      first_byte: "30s"          # ✅ Gradual: Applies to new requests
      total: "180s"              # ✅ Gradual: New requests use new timeout
    streaming:
      chunk_interval: "30s"      # ✅ Gradual: New streams use updated settings
Requires Restart (Hot Reload Not Possible)

These settings require a server restart to take effect. Changes are logged as warnings:

server:
  bind_address: "0.0.0.0:8080"   # ❌ Restart required: TCP socket binding
  workers: 4                      # ❌ Restart required: Worker thread pool size

When these settings are changed, the router will log a warning like:

WARN server.bind_address changed from '0.0.0.0:8080' to '0.0.0.0:9000' - requires restart to take effect

Hot Reload Process

  1. File System Watcher - Detects configuration file changes automatically
  2. Configuration Loading - New configuration is loaded and parsed
  3. Validation - New configuration is validated against schema
  4. Change Detection - ConfigDiff computation identifies what changed
  5. Classification - Changes are classified (immediate/gradual/restart)
  6. Atomic Update - Valid configuration is applied atomically
  7. Component Propagation - Updates are propagated to affected components:
  8. HealthChecker updates check intervals and thresholds
  9. RateLimitStore updates rate limiting rules
  10. CircuitBreaker updates failure thresholds and timeouts
  11. BackendPool updates backend configuration
  12. Error Handling - If invalid, error is logged and old configuration retained

Checking Hot Reload Status

Use the admin API to check hot reload status and capabilities:

# Check if hot reload is enabled
curl http://localhost:8080/admin/config/hot-reload-status

# View current configuration
curl http://localhost:8080/admin/config

Hot Reload Behavior Examples

Example 1: Changing Log Level (Immediate)

# Before
logging:
  level: "info"

# After
logging:
  level: "debug"
Result: Log level changes immediately. No restart needed. Ongoing requests continue, new logs use debug level.

Example 2: Adding a Backend (Gradual)

# Before
backends:
    - name: "ollama"
    url: "http://localhost:11434"

# After
backends:
    - name: "ollama"
    url: "http://localhost:11434"
    - name: "lmstudio"
    url: "http://localhost:1234"
Result: New backend added to pool. Existing requests continue to current backends. New requests can route to lmstudio.

Example 3: Changing Bind Address (Requires Restart)

# Before
server:
  bind_address: "0.0.0.0:8080"

# After
server:
  bind_address: "0.0.0.0:9000"
Result: Warning logged. Change does not take effect. Restart required to bind to new port.

Load Balancing Strategies

load_balancer:
  strategy: "round_robin"         # round_robin, weighted, random
  health_aware: true              # Only use healthy backends

Strategies:

  • round_robin: Equal distribution across backends
  • weighted: Distribution based on backend weights
  • random: Random selection (good for avoiding patterns)

Per-Backend Retry Configuration

backends:
    - name: "slow-backend"
    url: "http://slow.example.com"
    retry_override:               # Override global retry settings
      max_attempts: 5             # More attempts for slower backends
      base_delay: "500ms"         # Longer delays
      max_delay: "60s"

Model Fallback

Continuum Router supports automatic model fallback when the primary model is unavailable. This feature integrates with the circuit breaker for layered failover protection.

Configuration

fallback:
  enabled: true

  # Define fallback chains for each primary model
  fallback_chains:
    # Same-provider fallback
    "gpt-4o":
      - "gpt-4-turbo"
      - "gpt-3.5-turbo"

    "claude-opus-4-5-20251101":
      - "claude-sonnet-4-5"
      - "claude-haiku-4-5"

    # Cross-provider fallback
    "gemini-2.5-pro":
      - "gemini-2.5-flash"
      - "gpt-4o"  # Falls back to OpenAI if Gemini unavailable

  fallback_policy:
    trigger_conditions:
      error_codes: [429, 500, 502, 503, 504]
      timeout: true
      connection_error: true
      model_not_found: true
      circuit_breaker_open: true

    max_fallback_attempts: 3
    fallback_timeout_multiplier: 1.5
    preserve_parameters: true

  model_settings:
    "gpt-4o":
      fallback_enabled: true
      notify_on_fallback: true

Trigger Conditions

Condition Description
error_codes HTTP status codes that trigger fallback (e.g., 429, 500, 502, 503, 504)
timeout Request timeout
connection_error TCP connection failures
model_not_found Model not available on backend
circuit_breaker_open Backend circuit breaker is open

Response Headers

When fallback is used, the following headers are added to the response:

Header Description Example
X-Fallback-Used Indicates fallback was used true
X-Original-Model Originally requested model gpt-4o
X-Fallback-Model Model that served the request gpt-4-turbo
X-Fallback-Reason Why fallback was triggered error_code_429
X-Fallback-Attempts Number of fallback attempts 2

Cross-Provider Parameter Translation

When falling back across providers (e.g., OpenAI → Anthropic), the router automatically translates request parameters:

OpenAI Parameter Anthropic Parameter Notes
max_tokens max_tokens Auto-filled if missing (required by Anthropic)
temperature temperature Direct mapping
top_p top_p Direct mapping
stop stop_sequences Array conversion

Provider-specific parameters are automatically removed or converted during cross-provider fallback.

Integration with Circuit Breaker

The fallback system works in conjunction with the circuit breaker:

  1. Circuit Breaker detects failures and opens when threshold is exceeded
  2. Fallback chain activates when circuit breaker is open
  3. Requests route to fallback models based on configured chains
  4. Circuit breaker tests recovery and closes when backend recovers
# Example: Combined circuit breaker and fallback configuration
circuit_breaker:
  enabled: true
  failure_threshold: 5
  timeout: 60s

fallback:
  enabled: true
  fallback_policy:
    trigger_conditions:
      circuit_breaker_open: true  # Link to circuit breaker

Rate Limiting

Continuum Router includes built-in rate limiting for the /v1/models endpoint to prevent abuse and ensure fair resource allocation.

Current Configuration

Rate limiting is currently configured with the following default values:

# Note: These values are currently hardcoded but may become configurable in future versions
rate_limiting:
  models_endpoint:
    # Per-client limits (identified by API key or IP address)
    sustained_limit: 100          # Maximum requests per minute
    burst_limit: 20               # Maximum requests in any 5-second window

    # Time windows
    window_duration: 60s          # Sliding window for sustained limit
    burst_window: 5s              # Window for burst detection

    # Client identification priority
    identification:
      - api_key                   # Bearer token (first 16 chars used as ID)
      - x_forwarded_for           # Proxy/load balancer header
      - x_real_ip                 # Alternative IP header
      - fallback: "unknown"       # When no identifier available

How It Works

  1. Client Identification: Each request is associated with a client using:
  2. API key from Authorization: Bearer <token> header (preferred)
  3. IP address from proxy headers (fallback)

  4. Dual-Window Approach:

  5. Sustained limit: Prevents excessive usage over time
  6. Burst protection: Catches rapid-fire requests

  7. Independent Quotas: Each client has separate rate limits:

  8. Client A with API key abc123...: 100 req/min
  9. Client B with API key def456...: 100 req/min
  10. Client C from IP 192.168.1.1: 100 req/min

Response Headers

When rate limited, the response includes: - Status Code: 429 Too Many Requests - Error Message: Indicates whether burst or sustained limit was exceeded

Cache TTL Optimization

To prevent cache poisoning attacks: - Empty model lists: Cached for 5 seconds only - Normal responses: Cached for 60 seconds

This prevents attackers from forcing the router to cache empty responses during backend outages.

Monitoring

Rate limit violations are tracked in metrics: - rate_limit_violations: Total rejected requests - empty_responses_returned: Empty model lists served - Per-client violation tracking for identifying problematic clients

Future Enhancements

Future versions may support: - Configurable rate limits via YAML/environment variables - Per-endpoint rate limiting - Custom rate limits per API key - Redis-backed distributed rate limiting

Environment-Specific Configurations

Development Configuration

# config/development.yaml
server:
  bind_address: "127.0.0.1:8080"

backends:
    - name: "local-ollama"
    url: "http://localhost:11434"

health_checks:
  interval: "10s"                 # More frequent checks
  timeout: "5s"

logging:
  level: "debug"                  # Verbose logging
  format: "pretty"                # Human-readable
  enable_colors: true

Production Configuration

# config/production.yaml
server:
  bind_address: "0.0.0.0:8080"
  workers: 8                      # More workers for production
  connection_pool_size: 300       # Larger connection pool

backends:
    - name: "primary-openai"
    url: "https://api.openai.com"
    weight: 3
    - name: "secondary-azure"
    url: "https://azure-openai.example.com"
    weight: 2
    - name: "fallback-local"
    url: "http://internal-llm:11434"
    weight: 1

health_checks:
  interval: "60s"                 # Less frequent checks
  timeout: "15s"                  # Longer timeout for network latency
  unhealthy_threshold: 5          # More tolerance
  healthy_threshold: 3

request:
  timeout: "120s"                 # Shorter timeout for production
  max_retries: 5                  # More retries

logging:
  level: "warn"                   # Less verbose logging
  format: "json"                  # Structured logging

Container Configuration

# config/container.yaml - optimized for containers
server:
  bind_address: "0.0.0.0:8080"
  workers: 0                      # Auto-detect based on container limits

backends:
    - name: "backend-1"
    url: "${BACKEND_1_URL}"       # Environment variable substitution
    - name: "backend-2"
    url: "${BACKEND_2_URL}"

logging:
  level: "${LOG_LEVEL}"           # Configurable via environment
  format: "json"                  # Always JSON in containers

Examples

Multi-Backend Setup

# Enterprise multi-backend configuration
server:
  bind_address: "0.0.0.0:8080"
  workers: 8
  connection_pool_size: 400

backends:
  # Primary OpenAI GPT models
    - name: "openai-primary"
    url: "https://api.openai.com"
    weight: 5
    models: ["gpt-4", "gpt-3.5-turbo"]
    retry_override:
      max_attempts: 3
      base_delay: "500ms"

  # Secondary Azure OpenAI
    - name: "azure-openai"  
    url: "https://your-resource.openai.azure.com"
    weight: 3
    models: ["gpt-4", "gpt-35-turbo"]

  # Local Ollama for open models
    - name: "local-ollama"
    url: "http://ollama:11434"
    weight: 2
    models: ["llama2", "mistral", "codellama"]

  # vLLM deployment
    - name: "vllm-cluster"
    url: "http://vllm-service:8000"
    weight: 4
    models: ["meta-llama/Llama-2-7b-chat-hf"]

health_checks:
  enabled: true
  interval: "45s"
  timeout: "15s"
  unhealthy_threshold: 3
  healthy_threshold: 2

request:
  timeout: "180s"
  max_retries: 4

cache:
  model_cache_ttl: "600s"        # 10-minute cache
  deduplication_ttl: "120s"      # 2-minute deduplication
  enable_deduplication: true

logging:
  level: "info"
  format: "json"

High-Performance Configuration

# Optimized for high-throughput scenarios
server:
  bind_address: "0.0.0.0:8080"
  workers: 16                     # High worker count
  connection_pool_size: 1000      # Large connection pool

backends:
    - name: "fast-backend-1"
    url: "http://backend1:8000"
    weight: 1
    - name: "fast-backend-2" 
    url: "http://backend2:8000"
    weight: 1
    - name: "fast-backend-3"
    url: "http://backend3:8000"
    weight: 1

health_checks:
  enabled: true
  interval: "30s"
  timeout: "5s"                   # Fast timeout
  unhealthy_threshold: 2          # Fail fast
  healthy_threshold: 1            # Recover quickly

request:
  timeout: "60s"                  # Shorter timeout for high throughput
  max_retries: 2                  # Fewer retries

retry:
  max_attempts: 2
  base_delay: "50ms"              # Fast retries
  max_delay: "5s"
  exponential_backoff: true
  jitter: true

cache:
  model_cache_ttl: "300s"
  deduplication_ttl: "30s"        # Shorter deduplication window
  enable_deduplication: true

logging:
  level: "warn"                   # Minimal logging for performance
  format: "json"

Development Configuration

# Developer-friendly configuration
server:
  bind_address: "127.0.0.1:8080"  # Localhost only
  workers: 2                      # Fewer workers for development
  connection_pool_size: 20        # Small pool

backends:
    - name: "local-ollama"
    url: "http://localhost:11434"
    weight: 1

health_checks:
  enabled: true  
  interval: "10s"                 # Frequent checks for quick feedback
  timeout: "3s"
  unhealthy_threshold: 2
  healthy_threshold: 1

request:
  timeout: "300s"                 # Long timeout for debugging
  max_retries: 1                  # Minimal retries for debugging

logging:
  level: "debug"                  # Verbose logging
  format: "pretty"                # Human-readable
  enable_colors: true             # Colored output

cache:
  model_cache_ttl: "60s"          # Short cache for quick testing
  deduplication_ttl: "10s"        # Short deduplication
  enable_deduplication: false     # Disable for testing

Migration Guide

From Command-Line Arguments

If you're currently using command-line arguments, migrate to configuration files:

Before:

continuum-router --backends "http://localhost:11434,http://localhost:1234" --bind "0.0.0.0:9000"

After: 1. Generate a configuration file:

continuum-router --generate-config > config.yaml

  1. Edit the configuration:

    server:
      bind_address: "0.0.0.0:9000"
    
    backends:
        - name: "ollama"
        url: "http://localhost:11434"
        - name: "lm-studio"
        url: "http://localhost:1234"
    

  2. Use the configuration file:

    continuum-router --config config.yaml
    

From Environment Variables

You can continue using environment variables with configuration files as overrides:

Configuration file (config.yaml):

server:
  bind_address: "0.0.0.0:8080"

backends:
    - name: "default"
    url: "http://localhost:11434"

Environment override:

export CONTINUUM_BIND_ADDRESS="0.0.0.0:9000"
export CONTINUUM_BACKEND_URLS="http://localhost:11434,http://localhost:1234"
continuum-router --config config.yaml

Configuration Validation

To validate your configuration without starting the server:

# Test configuration loading
continuum-router --config config.yaml --help

# Check configuration with dry-run (future feature)
continuum-router --config config.yaml --dry-run

This configuration guide provides comprehensive coverage of all configuration options available in Continuum Router. The flexible configuration system allows you to adapt the router to any deployment scenario while maintaining clear precedence rules and validation.