Skip to content

Advanced Configuration

Global Prompts

Global prompts allow you to inject system prompts into all requests, providing centralized policy management for security, compliance, and behavioral guidelines. Prompts can be defined inline or loaded from external Markdown files.

Basic Configuration

global_prompts:
  # Inline default prompt
  default: |
    You must follow company security policies.
    Never reveal internal system details.
    Be helpful and professional.

  # Merge strategy: prepend (default), append, or replace
  merge_strategy: prepend

  # Custom separator between global and user prompts
  separator: "\n\n---\n\n"

External Prompt Files

For complex prompts, you can load content from external Markdown files. This provides: - Better editing experience with syntax highlighting - Version control without config file noise - Hot-reload support for prompt updates

global_prompts:
  # Directory containing prompt files (relative to config directory)
  prompts_dir: "./prompts"

  # Load default prompt from file
  default_file: "system.md"

  # Backend-specific prompts from files
  backends:
    anthropic:
      prompt_file: "anthropic-system.md"
    openai:
      prompt_file: "openai-system.md"

  # Model-specific prompts from files
  models:
    gpt-4o:
      prompt_file: "gpt4o-system.md"
    claude-3-opus:
      prompt_file: "claude-opus-system.md"

  merge_strategy: prepend

Prompt Resolution Priority

When determining which prompt to use for a request:

  1. Model-specific prompt (highest priority) - global_prompts.models.<model-id>
  2. Backend-specific prompt - global_prompts.backends.<backend-name>
  3. Default prompt - global_prompts.default or global_prompts.default_file

For each level, if both prompt (inline) and prompt_file are specified, prompt_file takes precedence.

Merge Strategies

Strategy Behavior
prepend Global prompt added before user's system prompt (default)
append Global prompt added after user's system prompt
replace Global prompt replaces user's system prompt entirely

REST API Management

Prompt files can be managed at runtime via the Admin API:

# List all prompts
curl http://localhost:8080/admin/config/prompts

# Get specific prompt file
curl http://localhost:8080/admin/config/prompts/prompts/system.md

# Update prompt file
curl -X PUT http://localhost:8080/admin/config/prompts/prompts/system.md \
  -H "Content-Type: application/json" \
  -d '{"content": "# Updated System Prompt\n\nNew content here."}'

# Reload all prompt files from disk
curl -X POST http://localhost:8080/admin/config/prompts/reload

See Admin REST API Reference for complete API documentation.

Security Considerations

  • Path Traversal Protection: All file paths are validated to prevent directory traversal attacks
  • File Size Limits: Individual files limited to 1MB, total cache limited to 50MB
  • Relative Paths Only: Prompt files must be within the configured prompts_dir or config directory
  • Sandboxed Access: Files outside the allowed directory are rejected

Hot Reload

Global prompts support immediate hot-reload. Changes to prompt configuration or files take effect on the next request without server restart.

Model Metadata

Continuum Router supports rich model metadata to provide detailed information about model capabilities, pricing, and limits. This metadata is returned in /v1/models API responses and can be used by clients to make informed model selection decisions.

Metadata Sources

Model metadata can be configured in three ways (in priority order):

  1. Backend-specific model_configs (highest priority)
  2. External metadata file (model-metadata.yaml)
  3. No metadata (models work without metadata)

External Metadata File

Create a model-metadata.yaml file:

models:
    - id: "gpt-4"
    aliases:                    # Alternative IDs that share this metadata
      - "gpt-4-0125-preview"
      - "gpt-4-turbo-preview"
      - "gpt-4-vision-preview"
    metadata:
      display_name: "GPT-4"
      summary: "Most capable GPT-4 model for complex tasks"
      capabilities: ["text", "image", "function_calling"]
      knowledge_cutoff: "2024-04"
      pricing:
        input_tokens: 0.03   # Per 1000 tokens
        output_tokens: 0.06  # Per 1000 tokens
      limits:
        context_window: 128000
        max_output: 4096

    - id: "llama-3-70b"
    aliases:                    # Different quantizations of the same model
      - "llama-3-70b-instruct"
      - "llama-3-70b-chat"
      - "llama-3-70b-q4"
      - "llama-3-70b-q8"
    metadata:
      display_name: "Llama 3 70B"
      summary: "Open-source model with strong performance"
      capabilities: ["text", "code"]
      knowledge_cutoff: "2023-12"
      pricing:
        input_tokens: 0.001
        output_tokens: 0.002
      limits:
        context_window: 8192
        max_output: 2048

Reference it in your config:

model_metadata_file: "model-metadata.yaml"

Thinking Pattern Configuration

Some models output reasoning/thinking content in non-standard ways. The router supports configuring thinking patterns per model to properly transform streaming responses.

Pattern Types:

Pattern Description Example Model
none No thinking pattern (default) Most models
standard Explicit start/end tags (<think>...</think>) Custom reasoning models
unterminated_start No start tag, only end tag nemotron-3-nano

Configuration Example:

models:
    - id: nemotron-3-nano
      metadata:
        display_name: "Nemotron 3 Nano"
        capabilities: ["chat", "reasoning"]
        # Thinking pattern configuration
        thinking:
          pattern: unterminated_start
          end_marker: "</think>"
          assume_reasoning_first: true

Thinking Pattern Fields:

Field Type Description
pattern string Pattern type: none, standard, or unterminated_start
start_marker string Start marker for standard pattern (e.g., <think>)
end_marker string End marker (e.g., </think>)
assume_reasoning_first boolean If true, treat first tokens as reasoning until end marker

How It Works:

When a model has a thinking pattern configured:

  1. Streaming responses are intercepted and transformed
  2. Content before end_marker is sent as reasoning_content field
  3. Content after end_marker is sent as content field
  4. The output follows OpenAI's reasoning_content format for compatibility

Example Output:

// Reasoning content (before end marker)
{"choices": [{"delta": {"reasoning_content": "Let me analyze..."}}]}

// Regular content (after end marker)
{"choices": [{"delta": {"content": "The answer is 42."}}]}

Namespace-Aware Matching

The router intelligently handles model IDs with namespace prefixes. For example:

  • Backend returns: "custom/gpt-4", "openai/gpt-4", "optimized/gpt-4"
  • Metadata defined for: "gpt-4"
  • Result: All variants match and receive the same metadata

This allows different backends to use their own naming conventions while sharing common metadata definitions.

Metadata Priority and Alias Resolution

When looking up metadata for a model, the router uses the following priority chain:

  1. Exact model ID match
  2. Exact alias match
  3. Date suffix normalization (automatic, zero-config)
  4. Wildcard pattern alias match
  5. Base model name fallback (namespace stripping)

Within each source (backend config, metadata file, built-in), the same priority applies:

  1. Backend-specific model_configs (highest priority)

    backends:
      - name: "my-backend"
        model_configs:
          - id: "gpt-4"
            aliases: ["gpt-4-turbo", "gpt-4-vision"]
            metadata: {...}  # This takes precedence
    

  2. External metadata file (second priority)

    model_metadata_file: "model-metadata.yaml"
    

  3. Built-in metadata (for OpenAI and Gemini backends)

Automatic Date Suffix Handling

LLM providers frequently release model versions with date suffixes. The router automatically detects and normalizes date suffixes without any configuration:

Supported date patterns:

  • -YYYYMMDD (e.g., claude-opus-4-5-20251130)
  • -YYYY-MM-DD (e.g., gpt-4o-2024-08-06)
  • -YYMM (e.g., o1-mini-2409)
  • @YYYYMMDD (e.g., model@20251130)

How it works:

Request: claude-opus-4-5-20251215
         ↓ (date suffix detected)
Lookup:  claude-opus-4-5-20251101  (existing metadata entry)
         ↓ (base names match)
Result:  Uses claude-opus-4-5-20251101 metadata

This means you only need to configure metadata once per model family, and new dated versions automatically inherit the metadata.

Wildcard Pattern Matching

Aliases support glob-style wildcard patterns using the * character:

  • Prefix matching: claude-* matches claude-opus, claude-sonnet, etc.
  • Suffix matching: *-preview matches gpt-4o-preview, o1-preview, etc.
  • Infix matching: gpt-*-turbo matches gpt-4-turbo, gpt-3.5-turbo, etc.

Example configuration with wildcard patterns:

models:
    - id: "claude-opus-4-5-20251101"
    aliases:
        - "claude-opus-4-5"     # Exact match for base name
        - "claude-opus-*"       # Wildcard for any claude-opus variant
    metadata:
        display_name: "Claude Opus 4.5"
        # Automatically matches: claude-opus-4-5-20251130, claude-opus-test, etc.

    - id: "gpt-4o"
    aliases:
        - "gpt-4o-*-preview"    # Matches preview versions
        - "*-4o-turbo"          # Suffix matching
    metadata:
        display_name: "GPT-4o"

Priority note: Exact aliases are always matched before wildcard patterns, ensuring predictable behavior when both could match.

Using Aliases for Model Variants

Aliases are particularly useful for:

  • Different quantizations: qwen3-32b-i1, qwen3-23b-i4 → all use qwen3 metadata
  • Version variations: gpt-4-0125-preview, gpt-4-turbo → share gpt-4 metadata
  • Deployment variations: llama-3-70b-instruct, llama-3-70b-chat → same base model
  • Dated versions: claude-3-5-sonnet-20241022, claude-3-5-sonnet-20241201 → share metadata (automatic with date suffix handling)

Example configuration with aliases:

model_configs:
    - id: "qwen3"
    aliases:
      - "qwen3-32b-i1"     # 32B with 1-bit quantization
      - "qwen3-23b-i4"     # 23B with 4-bit quantization
      - "qwen3-16b-q8"     # 16B with 8-bit quantization
      - "qwen3-*"          # Wildcard for any other qwen3 variant
    metadata:
      display_name: "Qwen 3"
      summary: "Alibaba's Qwen model family"
      # ... rest of metadata

API Response

The /v1/models endpoint returns enriched model information:

{
  "object": "list",
  "data": [
    {
      "id": "gpt-4",
      "object": "model",
      "created": 1234567890,
      "owned_by": "openai",
      "backends": ["openai-proxy"],
      "metadata": {
        "display_name": "GPT-4",
        "summary": "Most capable GPT-4 model for complex tasks",
        "capabilities": ["text", "image", "function_calling"],
        "knowledge_cutoff": "2024-04",
        "pricing": {
          "input_tokens": 0.03,
          "output_tokens": 0.06
        },
        "limits": {
          "context_window": 128000,
          "max_output": 4096
        }
      }
    }
  ]
}

Hot Reload

Continuum Router supports hot reload for runtime configuration updates without server restart. Configuration changes are detected automatically and applied based on their classification.

Configuration Item Classification

Configuration items are classified into three categories based on their hot reload capability:

Immediate Update (No Service Interruption)

These settings update immediately without any service disruption:

# Logging configuration
logging:
  level: "info"                  # ✅ Immediate: Log level changes apply instantly
  format: "json"                 # ✅ Immediate: Log format changes apply instantly

# Rate limiting settings
rate_limiting:
  enabled: true                  # ✅ Immediate: Enable/disable rate limiting
  limits:
    per_client:
      requests_per_second: 10    # ✅ Immediate: New limits apply immediately
      burst_capacity: 20         # ✅ Immediate: Burst settings update instantly

# Circuit breaker configuration
circuit_breaker:
  enabled: true                  # ✅ Immediate: Enable/disable circuit breaker
  failure_threshold: 5           # ✅ Immediate: Threshold updates apply instantly
  timeout_seconds: 60            # ✅ Immediate: Timeout changes immediate

# Retry configuration
retry:
  max_attempts: 3                # ✅ Immediate: Retry policy updates instantly
  base_delay: "100ms"            # ✅ Immediate: Backoff settings apply immediately
  exponential_backoff: true      # ✅ Immediate: Strategy changes instant

# Global prompts
global_prompts:
  default: "You are helpful"       # ✅ Immediate: Prompt changes apply to new requests
  default_file: "prompts/system.md"  # ✅ Immediate: File-based prompts also hot-reload

# Admin statistics
admin:
  stats:
    retention_window: "24h"        # ✅ Immediate: Retention window updates instantly
    token_tracking: true           # ✅ Immediate: Token tracking toggle applies immediately

Gradual Update (Existing Connections Maintained)

These settings apply to new connections while maintaining existing ones:

# Backend configuration
backends:
    - name: "ollama"               # ✅ Gradual: New requests use updated backend pool
    url: "http://localhost:11434"
    weight: 2                    # ✅ Gradual: Load balancing updates for new requests
    models: ["llama3.2"]         # ✅ Gradual: Model routing updates gradually

# Health check settings
health_checks:
  interval: "30s"                # ✅ Gradual: Next health check cycle uses new interval
  timeout: "10s"                 # ✅ Gradual: New checks use updated timeout
  unhealthy_threshold: 3         # ✅ Gradual: Threshold applies to new evaluations
  healthy_threshold: 2           # ✅ Gradual: Recovery threshold updates gradually

# Timeout configuration
timeouts:
  connection: "10s"              # ✅ Gradual: New requests use updated timeouts
  request:
    standard:
      first_byte: "30s"          # ✅ Gradual: Applies to new requests
      total: "180s"              # ✅ Gradual: New requests use new timeout
    streaming:
      chunk_interval: "30s"      # ✅ Gradual: New streams use updated settings

Requires Restart (Hot Reload Not Possible)

These settings require a server restart to take effect. Changes are logged as warnings:

server:
  bind_address: "0.0.0.0:8080"   # ❌ Restart required: TCP/Unix socket binding
  # bind_address:                 # ❌ Restart required: Any address changes
  #   - "0.0.0.0:8080"
  #   - "unix:/var/run/router.sock"
  socket_mode: 0o660              # ❌ Restart required: Socket permissions
  workers: 4                      # ❌ Restart required: Worker thread pool size

When these settings are changed, the router will log a warning like:

WARN server.bind_address changed from '0.0.0.0:8080' to '0.0.0.0:9000' - requires restart to take effect

Hot Reload Process

  1. File System Watcher - Detects configuration file changes automatically
  2. Configuration Loading - New configuration is loaded and parsed
  3. Validation - New configuration is validated against schema
  4. Change Detection - ConfigDiff computation identifies what changed
  5. Classification - Changes are classified (immediate/gradual/restart)
  6. Atomic Update - Valid configuration is applied atomically
  7. Component Propagation - Updates are propagated to affected components:
  8. HealthChecker updates check intervals and thresholds
  9. RateLimitStore updates rate limiting rules
  10. CircuitBreaker updates failure thresholds and timeouts
  11. BackendPool updates backend configuration
  12. Immediate Health Check - When backends are added, an immediate health check is triggered so new backends become available within 1-2 seconds instead of waiting for the next periodic check
  13. Error Handling - If invalid, error is logged and old configuration retained

Checking Hot Reload Status

Use the admin API to check hot reload status and capabilities:

# Check if hot reload is enabled
curl http://localhost:8080/admin/config/hot-reload-status

# View current configuration
curl http://localhost:8080/admin/config

Hot Reload Behavior Examples

Example 1: Changing Log Level (Immediate)

# Before
logging:
  level: "info"

# After
logging:
  level: "debug"
Result: Log level changes immediately. No restart needed. Ongoing requests continue, new logs use debug level.

Example 2: Adding a Backend (Gradual with Immediate Health Check)

# Before
backends:
    - name: "ollama"
    url: "http://localhost:11434"

# After
backends:
    - name: "ollama"
    url: "http://localhost:11434"
    - name: "lmstudio"
    url: "http://localhost:1234"
Result: New backend added to pool with immediate health check triggered. The new backend becomes available within 1-2 seconds (instead of waiting up to 30 seconds for the next periodic health check). Existing requests continue to current backends. New requests can route to lmstudio once health check passes.

Example 2b: Removing a Backend (Graceful Draining)

# Before
backends:
    - name: "ollama"
      url: "http://localhost:11434"
    - name: "lmstudio"
      url: "http://localhost:1234"

# After
backends:
    - name: "ollama"
      url: "http://localhost:11434"
Result: Backend "lmstudio" enters draining state. New requests are not routed to it, but existing in-flight requests (including streaming) continue until completion. After all references are released (or after 5 minutes timeout), the backend is fully removed from memory.

Backend State Lifecycle

When a backend is removed from configuration, it goes through a graceful shutdown process:

  1. Active → Draining: Backend is marked as draining. New requests skip this backend.
  2. In-flight Completion: Existing requests/streams continue uninterrupted.
  3. Cleanup: Once all references are released, or after 5-minute timeout, the backend is removed.

This ensures zero impact on ongoing connections during configuration changes.

Example 3: Changing Bind Address (Requires Restart)

# Before
server:
  bind_address: "0.0.0.0:8080"

# After
server:
  bind_address: "0.0.0.0:9000"
Result: Warning logged. Change does not take effect. Restart required to bind to new port.

Distributed Tracing

Continuum Router supports distributed tracing for request correlation across backend services. This feature helps with debugging and monitoring requests as they flow through multiple services.

Configuration

tracing:
  enabled: true                         # Enable/disable distributed tracing (default: true)
  w3c_trace_context: true               # Support W3C Trace Context header (default: true)
  headers:
    trace_id: "X-Trace-ID"              # Header name for trace ID (default)
    request_id: "X-Request-ID"          # Header name for request ID (default)
    correlation_id: "X-Correlation-ID"  # Header name for correlation ID (default)

How It Works

  1. Trace ID Extraction: When a request arrives, the router extracts trace IDs from headers in the following priority order:
  2. W3C traceparent header (if W3C support enabled)
  3. Configured trace_id header (X-Trace-ID)
  4. Configured request_id header (X-Request-ID)
  5. Configured correlation_id header (X-Correlation-ID)

  6. Trace ID Generation: If no trace ID is found in headers, a new UUID is generated.

  7. Header Propagation: The trace ID is propagated to backend services via multiple headers:

  8. X-Request-ID: For broad compatibility
  9. X-Trace-ID: Primary trace identifier
  10. X-Correlation-ID: For correlation tracking
  11. traceparent: W3C Trace Context (if enabled)
  12. tracestate: W3C Trace State (if present in original request)

  13. Retry Preservation: The same trace ID is preserved across all retry attempts, making it easy to correlate multiple backend requests for a single client request.

Structured Logging

When tracing is enabled, all log messages include the trace_id field:

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "info",
  "trace_id": "0af7651916cd43dd8448eb211c80319c",
  "message": "Processing chat completions request",
  "backend": "openai",
  "model": "gpt-4o"
}

W3C Trace Context

When w3c_trace_context is enabled, the router supports the W3C Trace Context standard:

  • Incoming: Parses traceparent header (format: 00-{trace_id}-{span_id}-{flags})
  • Outgoing: Generates new traceparent header with preserved trace ID and new span ID
  • State: Forwards tracestate header if present in original request

Example traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01

Disabling Tracing

To disable distributed tracing:

tracing:
  enabled: false

Load Balancing Strategies

load_balancer:
  strategy: "round_robin"         # round_robin, weighted, random
  health_aware: true              # Only use healthy backends

Strategies:

  • round_robin: Equal distribution across backends
  • weighted: Distribution based on backend weights
  • random: Random selection (good for avoiding patterns)

Per-Backend Retry Configuration

backends:
    - name: "slow-backend"
    url: "http://slow.example.com"
    retry_override:               # Override global retry settings
      max_attempts: 5             # More attempts for slower backends
      base_delay: "500ms"         # Longer delays
      max_delay: "60s"

Model Fallback

Continuum Router supports automatic model fallback when the primary model is unavailable. This feature integrates with the circuit breaker for layered failover protection.

Pre-Stream vs. Mid-Stream Fallback

The router provides two independent fallback mechanisms:

Mechanism When it activates Config section Default
Pre-stream fallback Before or at the start of a response: connection errors, timeouts, trigger error codes, unhealthy backend at routing time fallback Enabled when fallback.enabled: true
Mid-stream fallback After streaming has started and the backend fails mid-response fallback + streaming.mid_stream_fallback Activates when fallback.enabled: true and a fallback chain is configured. Continuation mode is enabled by default.

When fallback.enabled: true and a fallback chain is configured for the requested model, mid-stream connection drops are suppressed and the router transparently switches to the next backend — even if streaming.mid_stream_fallback.enabled is false.

streaming.mid_stream_fallback.enabled controls continuation behavior only: whether the fallback backend receives a continuation prompt (using accumulated partial response) or a full restart of the original request. The default is true (continuation mode), which provides seamless output for the client. Setting it to false forces restart mode, which may cause duplicate or incoherent content if partial output was already sent to the client.

Configuration

fallback:
  enabled: true

  # Define fallback chains for each primary model
  fallback_chains:
    # Same-provider fallback
    "gpt-4o":
      - "gpt-4-turbo"
      - "gpt-3.5-turbo"

    "claude-opus-4-5-20251101":
      - "claude-sonnet-4-5"
      - "claude-haiku-4-5"

    # Cross-provider fallback
    "gemini-2.5-pro":
      - "gemini-2.5-flash"
      - "gpt-4o"  # Falls back to OpenAI if Gemini unavailable

  fallback_policy:
    trigger_conditions:
      error_codes: [429, 500, 502, 503, 504]
      timeout: true
      connection_error: true
      model_not_found: true
      circuit_breaker_open: true

    max_fallback_attempts: 3
    fallback_timeout_multiplier: 1.5
    preserve_parameters: true

  model_settings:
    "gpt-4o":
      fallback_enabled: true
      notify_on_fallback: true

Trigger Conditions

Condition Description
error_codes HTTP status codes that trigger fallback (e.g., 429, 500, 502, 503, 504)
timeout Request timeout
connection_error TCP connection failures
model_not_found Model not available on backend
circuit_breaker_open Backend circuit breaker is open

Response Headers

When fallback is used, the following headers are added to the response:

Header Description Example
X-Fallback-Used Indicates fallback was used true
X-Original-Model Originally requested model gpt-4o
X-Fallback-Model Model that served the request gpt-4-turbo
X-Fallback-Reason Why fallback was triggered error_code_429
X-Fallback-Attempts Number of fallback attempts 2

Cross-Provider Parameter Translation

When falling back across providers (e.g., OpenAI → Anthropic), the router automatically translates request parameters:

OpenAI Parameter Anthropic Parameter Notes
max_tokens max_tokens Auto-filled if missing (required by Anthropic)
temperature temperature Direct mapping
top_p top_p Direct mapping
stop stop_sequences Array conversion

Provider-specific parameters are automatically removed or converted during cross-provider fallback.

Integration with Circuit Breaker

The fallback system works in conjunction with the circuit breaker:

  1. Circuit Breaker detects failures and opens when threshold is exceeded
  2. Fallback chain activates when circuit breaker is open
  3. Requests route to fallback models based on configured chains
  4. Circuit breaker tests recovery and closes when backend recovers
# Example: Combined circuit breaker and fallback configuration
circuit_breaker:
  enabled: true
  failure_threshold: 5
  timeout: 60s

fallback:
  enabled: true
  fallback_policy:
    trigger_conditions:
      circuit_breaker_open: true  # Link to circuit breaker

Mid-Stream Fallback

Mid-stream fallback allows the router to transparently continue an active SSE stream on a fallback backend when the primary backend fails mid-response. The client's connection remains open and sees a seamless response with only a brief pause during the switchover.

Mid-stream fallback activates automatically when fallback.enabled: true and a fallback chain is configured for the requested model. The streaming.mid_stream_fallback section controls how the fallback backend is invoked (continuation vs restart mode), not whether fallback happens.

Configuration

fallback:
  enabled: true  # Required: enables mid-stream fallback path
  fallback_chains:
    "gpt-4o":
      - "gpt-4-turbo"
      - "gpt-3.5-turbo"

streaming:
  mid_stream_fallback:
    # Enable continuation mode (default: true).
    # When true, accumulated partial response is used to build a continuation prompt,
    # producing seamless output for the client.
    # When false, the fallback backend restarts the request from scratch, which may
    # cause duplicate or incoherent content if partial output was already sent.
    enabled: true

    # Minimum estimated tokens accumulated before using continuation mode (default: 50)
    # Below this threshold the request is restarted from scratch on the fallback backend
    # instead of appending a continuation prompt.
    min_accumulated_tokens: 50

    # Maximum fallback attempts per streaming request (default: 2, max: 10)
    max_fallback_attempts: 2

    # Prompt appended as a user message after the partial assistant response
    continuation_prompt: "Continue from where you left off exactly. Do not repeat any previously generated content."

How It Works

  1. The client sends a streaming chat completion request.
  2. The router begins streaming from the primary backend, accumulating response content.
  3. If the backend fails mid-stream (connection drop, timeout, error event):

    • The error is NOT forwarded to the client.
    • The accumulated partial response is captured.
    • The next healthy backend in the fallback chain is selected (unhealthy backends are skipped).
    • A continuation or restart request is sent to the fallback backend.
    • Streaming resumes on the fallback backend without closing the client connection.
  4. The client receives a seamless response with only a brief pause during the switchover.

Continuation vs. Restart Mode

The min_accumulated_tokens threshold controls which recovery mode is used:

Condition Mode Behavior
enabled: true (default) and tokens ≥ min_accumulated_tokens and not truncated Continuation Original messages + partial assistant response + continuation prompt
enabled: true (default) and tokens < min_accumulated_tokens Restart Original request replayed (not enough context to continue)
enabled: true (default) and content truncated (> 100 KB) Restart Forced restart to avoid incoherent context
mid_stream_fallback.enabled: false Restart Original request replayed on fallback backend from scratch

Continuation mode (the default) produces seamless output for the client. Restart mode is used automatically when there is too little context to continue meaningfully, or when the accumulated response is too long to include safely. Explicitly setting enabled: false forces restart mode unconditionally, which may cause duplicate or incoherent content visible to the client.

Edge Case Handling

The mid-stream fallback path addresses several edge cases automatically:

  • Global timeout budget: All fallback attempts share the original request start time. Each attempt checks remaining budget before sending, preventing indefinite timeout accumulation across the chain.
  • Cross-provider parameter translation: When the fallback model is on a different provider (e.g., OpenAI → Anthropic), request parameters are automatically translated — provider-specific fields removed and parameter names mapped.
  • Concurrent request storms: A global semaphore (50 permits) limits simultaneous fallback attempts. Requests that cannot acquire a permit within 5 seconds are rejected gracefully.
  • Accumulator truncation: When accumulated response content exceeds 100 KB, the continuation mode is forced to restart to avoid sending incoherent context to the fallback backend.
  • Health re-check: Backend health is re-verified before each fallback attempt in the chain. Unhealthy backends are skipped to the next entry.
  • Missing [DONE] marker: Streams ending without [DONE] but with finish_reason: "stop" are treated as completed successfully, preventing unnecessary fallback.

Metrics

Three Prometheus metrics track mid-stream fallback activity. See Mid-Stream Fallback Metrics for details.

Minimizing Failover Latency

When a backend goes down during streaming, the time until the fallback backend takes over depends on several configuration parameters across different subsystems. Below is a tuning guide for minimizing this switchover delay.

How failover delay is composed

The total time a client waits during a mid-stream failover is roughly:

failover_delay ≈ failure_detection_time + health_recheck_time + fallback_connection_time

Each component maps to specific configuration:

Component What determines it Default Tuning target
Failure detection Stream inactivity timeout (hardcoded 60 s) or TCP read error (immediate) or chunk_interval timeout 30–60 s Lower chunk_interval
Health re-check Health check before fallback attempt timeout: 5s Keep low
Fallback connection TCP connect + TLS handshake to fallback backend connection: 10s Lower connection
# 1. Timeouts — the most impactful settings for failover speed
timeouts:
  connection: 5s               # Faster TCP connect timeout (default: 10s)
  request:
    streaming:
      first_byte: 30s          # How long to wait for the first token (default: 60s)
      chunk_interval: 10s      # Max silence between chunks before treating as failure (default: 30s)
      total: 600s              # Total streaming budget (keep generous)

# 2. Health checks — detect backend failures proactively
health_checks:
  interval: 10s                # Check every 10s instead of 30s (default: 30s)
  timeout: 3s                  # Fail health checks faster (default: 5s)
  unhealthy_threshold: 2       # Mark unhealthy after 2 failures (default: 3)
  healthy_threshold: 1         # Recover after 1 success (default: 2)
  warmup_check_interval: 1s   # Fast checks during backend startup

# 3. Circuit breaker — stop routing to a failed backend immediately
circuit_breaker:
  enabled: true
  failure_threshold: 3         # Open circuit after 3 failures (default: 5)
  timeout: 30s                 # Try recovery after 30s (default: 60s)
  half_open_max_requests: 2
  half_open_success_threshold: 1
  timeout_as_failure: true     # Count timeouts toward circuit breaker

# 4. Fallback chain — must be configured for mid-stream fallback to activate
fallback:
  enabled: true
  fallback_chains:
    "gpt-4o":
        - "gpt-4-turbo"
        - "gpt-3.5-turbo"
  fallback_policy:
    trigger_conditions:
      error_codes: [429, 500, 502, 503, 504]
      timeout: true
      connection_error: true
      circuit_breaker_open: true

# 5. Mid-stream fallback — continuation mode (default: enabled)
streaming:
  mid_stream_fallback:
    enabled: true              # Use continuation mode (default)
    max_fallback_attempts: 3   # Allow more retries for resilience (default: 2)
    min_accumulated_tokens: 30 # Lower threshold for continuation vs restart (default: 50)

Parameter impact summary

Parameter Effect on failover speed Trade-off
timeouts.request.streaming.chunk_interval High — directly controls how quickly a stalled stream is detected Too low may cause false positives on slow models (e.g., reasoning models with long thinking phases)
timeouts.connection Medium — limits TCP connect delay to fallback backend Too low may fail on high-latency networks
health_checks.interval Medium — faster detection means the circuit breaker opens sooner, preventing requests from reaching a dead backend More frequent checks increase backend load
health_checks.unhealthy_threshold Medium — fewer failures needed to mark backend unhealthy Lower values increase sensitivity to transient errors
circuit_breaker.failure_threshold Medium — fewer failures to open circuit Too aggressive may open circuit on temporary spikes
circuit_breaker.timeout Low — affects recovery time, not failover speed Shorter means faster recovery but more probing of unhealthy backends
mid_stream_fallback.max_fallback_attempts Low — more attempts increase resilience but not speed of individual switchover More attempts consume more of the global timeout budget

Failure detection scenarios

Different failure types are detected at different speeds:

Failure type Detection time Mechanism
TCP connection reset / backend crash Immediate (< 1 s) Stream read error triggers instant fallback
Backend returns 5xx error Immediate (< 1 s) HTTP status check before streaming begins
Backend becomes unresponsive (stall) chunk_interval (default 30 s) Inactivity timeout on the stream
Backend sends error SSE events After 5 errors Error count threshold in stream processing
Backend process killed mid-response Immediate (< 1 s) TCP FIN/RST detected as stream read error

The most common scenario in production — a backend becoming unresponsive — is governed by chunk_interval. For latency-sensitive applications, lowering this to 10–15 seconds is recommended, with model-specific overrides for slow models:

timeouts:
  request:
    streaming:
      chunk_interval: 10s      # Fast detection for most models
    model_overrides:
      gemini-2.5-pro:          # Reasoning models need longer intervals
        streaming:
          chunk_interval: 30s
          first_byte: 120s

Rate Limiting

Continuum Router includes built-in rate limiting for the /v1/models endpoint to prevent abuse and ensure fair resource allocation.

Current Configuration

Rate limiting is currently configured with the following default values:

# Note: These values are currently hardcoded but may become configurable in future versions
rate_limiting:
  models_endpoint:
    # Per-client limits (identified by API key or IP address)
    sustained_limit: 100          # Maximum requests per minute
    burst_limit: 20               # Maximum requests in any 5-second window

    # Time windows
    window_duration: 60s          # Sliding window for sustained limit
    burst_window: 5s              # Window for burst detection

    # Client identification priority
    identification:
      - api_key                   # Bearer token (first 16 chars used as ID)
      - x_forwarded_for           # Proxy/load balancer header
      - x_real_ip                 # Alternative IP header
      - fallback: "unknown"       # When no identifier available

How It Works

  1. Client Identification: Each request is associated with a client using:
  2. API key from Authorization: Bearer <token> header (preferred)
  3. IP address from proxy headers (fallback)

  4. Dual-Window Approach:

  5. Sustained limit: Prevents excessive usage over time
  6. Burst protection: Catches rapid-fire requests

  7. Independent Quotas: Each client has separate rate limits:

  8. Client A with API key abc123...: 100 req/min
  9. Client B with API key def456...: 100 req/min
  10. Client C from IP 192.168.1.1: 100 req/min

Response Headers

When rate limited, the response includes: - Status Code: 429 Too Many Requests - Error Message: Indicates whether burst or sustained limit was exceeded

Cache TTL Optimization

To prevent cache poisoning attacks: - Empty model lists: Cached for 5 seconds only - Normal responses: Cached for 60 seconds

This prevents attackers from forcing the router to cache empty responses during backend outages.

Monitoring

Rate limit violations are tracked in metrics: - rate_limit_violations: Total rejected requests - empty_responses_returned: Empty model lists served - Per-client violation tracking for identifying problematic clients

Future Enhancements

Future versions may support: - Configurable rate limits via YAML/environment variables - Per-endpoint rate limiting - Custom rate limits per API key - Redis-backed distributed rate limiting

Environment-Specific Configurations

Development Configuration

# config/development.yaml
server:
  bind_address: "127.0.0.1:8080"

backends:
    - name: "local-ollama"
    url: "http://localhost:11434"

health_checks:
  interval: "10s"                 # More frequent checks
  timeout: "5s"

logging:
  level: "debug"                  # Verbose logging
  format: "pretty"                # Human-readable
  enable_colors: true

Production Configuration

# config/production.yaml
server:
  bind_address: "0.0.0.0:8080"
  workers: 8                      # More workers for production
  connection_pool_size: 300       # Larger connection pool

backends:
    - name: "primary-openai"
    url: "https://api.openai.com"
    weight: 3
    - name: "secondary-azure"
    url: "https://azure-openai.example.com"
    weight: 2
    - name: "fallback-local"
    url: "http://internal-llm:11434"
    weight: 1

health_checks:
  interval: "60s"                 # Less frequent checks
  timeout: "15s"                  # Longer timeout for network latency
  unhealthy_threshold: 5          # More tolerance
  healthy_threshold: 3

request:
  timeout: "120s"                 # Shorter timeout for production
  max_retries: 5                  # More retries

logging:
  level: "warn"                   # Less verbose logging
  format: "json"                  # Structured logging

Container Configuration

# config/container.yaml - optimized for containers
server:
  bind_address: "0.0.0.0:8080"
  workers: 0                      # Auto-detect based on container limits

backends:
    - name: "backend-1"
    url: "${BACKEND_1_URL}"       # Environment variable substitution
    - name: "backend-2"
    url: "${BACKEND_2_URL}"

logging:
  level: "${LOG_LEVEL}"           # Configurable via environment
  format: "json"                  # Always JSON in containers