Advanced Configuration¶

Global Prompts¶

Global prompts allow you to inject system prompts into all requests, providing centralized policy management for security, compliance, and behavioral guidelines. Prompts can be defined inline or loaded from external Markdown files.

Basic Configuration¶

global_prompts:
  # Inline default prompt
  default: |
    You must follow company security policies.
    Never reveal internal system details.
    Be helpful and professional.

  # Merge strategy: prepend (default), append, or replace
  merge_strategy: prepend

  # Custom separator between global and user prompts
  separator: "\n\n---\n\n"

External Prompt Files¶

For complex prompts, you can load content from external Markdown files. This provides: - Better editing experience with syntax highlighting - Version control without config file noise - Hot-reload support for prompt updates

global_prompts:
  # Directory containing prompt files (relative to config directory)
  prompts_dir: "./prompts"

  # Load default prompt from file
  default_file: "system.md"

  # Backend-specific prompts from files
  backends:
    anthropic:
      prompt_file: "anthropic-system.md"
    openai:
      prompt_file: "openai-system.md"

  # Model-specific prompts from files
  models:
    gpt-4o:
      prompt_file: "gpt4o-system.md"
    claude-3-opus:
      prompt_file: "claude-opus-system.md"

  merge_strategy: prepend

Prompt Resolution Priority¶

When determining which prompt to use for a request:

Model-specific prompt (highest priority) - global_prompts.models.<model-id>
Backend-specific prompt - global_prompts.backends.<backend-name>
Default prompt - global_prompts.default or global_prompts.default_file

For each level, if both prompt (inline) and prompt_file are specified, prompt_file takes precedence.

Merge Strategies¶

Strategy	Behavior
`prepend`	Global prompt added before user's system prompt (default)
`append`	Global prompt added after user's system prompt
`replace`	Global prompt replaces user's system prompt entirely

REST API Management¶

Prompt files can be managed at runtime via the Admin API:

# List all prompts
curl http://localhost:8080/admin/config/prompts

# Get specific prompt file
curl http://localhost:8080/admin/config/prompts/prompts/system.md

# Update prompt file
curl -X PUT http://localhost:8080/admin/config/prompts/prompts/system.md \
  -H "Content-Type: application/json" \
  -d '{"content": "# Updated System Prompt\n\nNew content here."}'

# Reload all prompt files from disk
curl -X POST http://localhost:8080/admin/config/prompts/reload

See Admin REST API Reference for complete API documentation.

Security Considerations¶

Path Traversal Protection: All file paths are validated to prevent directory traversal attacks
File Size Limits: Individual files limited to 1MB, total cache limited to 50MB
Relative Paths Only: Prompt files must be within the configured prompts_dir or config directory
Sandboxed Access: Files outside the allowed directory are rejected

Hot Reload¶

Global prompts support immediate hot-reload. Changes to prompt configuration or files take effect on the next request without server restart.

Model Metadata¶

Continuum Router supports rich model metadata to provide detailed information about model capabilities, pricing, and limits. This metadata is returned in /v1/models API responses and can be used by clients to make informed model selection decisions.

Metadata Sources¶

Model metadata can be configured in three ways (in priority order):

Backend-specific model_configs (highest priority)
External metadata file (model-metadata.yaml)
No metadata (models work without metadata)

External Metadata File¶

Create a model-metadata.yaml file:

models:
    - id: "gpt-4"
    aliases:                    # Alternative IDs that share this metadata
      - "gpt-4-0125-preview"
      - "gpt-4-turbo-preview"
      - "gpt-4-vision-preview"
    metadata:
      display_name: "GPT-4"
      summary: "Most capable GPT-4 model for complex tasks"
      capabilities: ["text", "image", "function_calling"]
      knowledge_cutoff: "2024-04"
      pricing:
        input_tokens: 0.03   # Per 1000 tokens
        output_tokens: 0.06  # Per 1000 tokens
      limits:
        context_window: 128000
        max_output: 4096

    - id: "llama-3-70b"
    aliases:                    # Different quantizations of the same model
      - "llama-3-70b-instruct"
      - "llama-3-70b-chat"
      - "llama-3-70b-q4"
      - "llama-3-70b-q8"
    metadata:
      display_name: "Llama 3 70B"
      summary: "Open-source model with strong performance"
      capabilities: ["text", "code"]
      knowledge_cutoff: "2023-12"
      pricing:
        input_tokens: 0.001
        output_tokens: 0.002
      limits:
        context_window: 8192
        max_output: 2048

Reference it in your config:

model_metadata_file: "model-metadata.yaml"

Thinking Pattern Configuration¶

Some models output reasoning/thinking content in non-standard ways. The router supports configuring thinking patterns per model to properly transform streaming responses.

Pattern Types:

Pattern	Description	Example Model
`none`	No thinking pattern (default)	Most models
`standard`	Explicit start/end tags (`<think>...</think>`)	Custom reasoning models
`unterminated_start`	No start tag, only end tag	nemotron-3-nano

Configuration Example:

models:
    - id: nemotron-3-nano
      metadata:
        display_name: "Nemotron 3 Nano"
        capabilities: ["chat", "reasoning"]
        # Thinking pattern configuration
        thinking:
          pattern: unterminated_start
          end_marker: "</think>"
          assume_reasoning_first: true

Thinking Pattern Fields:

Field	Type	Description
`pattern`	string	Pattern type: `none`, `standard`, or `unterminated_start`
`start_marker`	string	Start marker for `standard` pattern (e.g., `<think>`)
`end_marker`	string	End marker (e.g., `</think>`)
`assume_reasoning_first`	boolean	If `true`, treat first tokens as reasoning until end marker

How It Works:

When a model has a thinking pattern configured:

Streaming responses are intercepted and transformed
Content before end_marker is sent as reasoning_content field
Content after end_marker is sent as content field
The output follows OpenAI's reasoning_content format for compatibility

Example Output:

// Reasoning content (before end marker)
{"choices": [{"delta": {"reasoning_content": "Let me analyze..."}}]}

// Regular content (after end marker)
{"choices": [{"delta": {"content": "The answer is 42."}}]}

Namespace-Aware Matching¶

The router intelligently handles model IDs with namespace prefixes. For example:

Backend returns: "custom/gpt-4", "openai/gpt-4", "optimized/gpt-4"
Metadata defined for: "gpt-4"
Result: All variants match and receive the same metadata

This allows different backends to use their own naming conventions while sharing common metadata definitions.

Metadata Priority and Alias Resolution¶

When looking up metadata for a model, the router uses the following priority chain:

Exact model ID match
Exact alias match
Date suffix normalization (automatic, zero-config)
Wildcard pattern alias match
Base model name fallback (namespace stripping)

Within each source (backend config, metadata file, built-in), the same priority applies:

Backend-specific model_configs (highest priority)

backends:
  - name: "my-backend"
    model_configs:
      - id: "gpt-4"
        aliases: ["gpt-4-turbo", "gpt-4-vision"]
        metadata: {...}  # This takes precedence

External metadata file (second priority)

model_metadata_file: "model-metadata.yaml"

Built-in metadata (for OpenAI and Gemini backends)

Automatic Date Suffix Handling¶

LLM providers frequently release model versions with date suffixes. The router automatically detects and normalizes date suffixes without any configuration:

Supported date patterns:

-YYYYMMDD (e.g., claude-opus-4-5-20251130)
-YYYY-MM-DD (e.g., gpt-4o-2024-08-06)
-YYMM (e.g., o1-mini-2409)
@YYYYMMDD (e.g., model@20251130)

How it works:

Request: claude-opus-4-5-20251215
         ↓ (date suffix detected)
Lookup:  claude-opus-4-5-20251101  (existing metadata entry)
         ↓ (base names match)
Result:  Uses claude-opus-4-5-20251101 metadata

This means you only need to configure metadata once per model family, and new dated versions automatically inherit the metadata.

Wildcard Pattern Matching¶

Aliases support glob-style wildcard patterns using the * character:

Prefix matching: claude-* matches claude-opus, claude-sonnet, etc.
Suffix matching: *-preview matches gpt-4o-preview, o1-preview, etc.
Infix matching: gpt-*-turbo matches gpt-4-turbo, gpt-3.5-turbo, etc.

Example configuration with wildcard patterns:

models:
    - id: "claude-opus-4-5-20251101"
    aliases:
        - "claude-opus-4-5"     # Exact match for base name
        - "claude-opus-*"       # Wildcard for any claude-opus variant
    metadata:
        display_name: "Claude Opus 4.5"
        # Automatically matches: claude-opus-4-5-20251130, claude-opus-test, etc.

    - id: "gpt-4o"
    aliases:
        - "gpt-4o-*-preview"    # Matches preview versions
        - "*-4o-turbo"          # Suffix matching
    metadata:
        display_name: "GPT-4o"

Priority note: Exact aliases are always matched before wildcard patterns, ensuring predictable behavior when both could match.

Using Aliases for Model Variants¶

Aliases are particularly useful for:

Different quantizations: qwen3-32b-i1, qwen3-23b-i4 → all use qwen3 metadata
Version variations: gpt-4-0125-preview, gpt-4-turbo → share gpt-4 metadata
Deployment variations: llama-3-70b-instruct, llama-3-70b-chat → same base model
Dated versions: claude-3-5-sonnet-20241022, claude-3-5-sonnet-20241201 → share metadata (automatic with date suffix handling)

Example configuration with aliases:

model_configs:
    - id: "qwen3"
    aliases:
      - "qwen3-32b-i1"     # 32B with 1-bit quantization
      - "qwen3-23b-i4"     # 23B with 4-bit quantization
      - "qwen3-16b-q8"     # 16B with 8-bit quantization
      - "qwen3-*"          # Wildcard for any other qwen3 variant
    metadata:
      display_name: "Qwen 3"
      summary: "Alibaba's Qwen model family"
      # ... rest of metadata

API Response¶

The /v1/models endpoint returns enriched model information:

{
  "object": "list",
  "data": [
    {
      "id": "gpt-4",
      "object": "model",
      "created": 1234567890,
      "owned_by": "openai",
      "backends": ["openai-proxy"],
      "metadata": {
        "display_name": "GPT-4",
        "summary": "Most capable GPT-4 model for complex tasks",
        "capabilities": ["text", "image", "function_calling"],
        "knowledge_cutoff": "2024-04",
        "pricing": {
          "input_tokens": 0.03,
          "output_tokens": 0.06
        },
        "limits": {
          "context_window": 128000,
          "max_output": 4096
        }
      }
    }
  ]
}

Hot Reload¶

Continuum Router supports hot reload for runtime configuration updates without server restart. Configuration changes are detected automatically and applied based on their classification.

Configuration Item Classification¶

Configuration items are classified into three categories based on their hot reload capability:

Immediate Update (No Service Interruption)¶

These settings update immediately without any service disruption:

# Logging configuration
logging:
  level: "info"                  # ✅ Immediate: Log level changes apply instantly
  format: "json"                 # ✅ Immediate: Log format changes apply instantly

# Rate limiting settings
rate_limiting:
  enabled: true                  # ✅ Immediate: Enable/disable rate limiting
  limits:
    per_client:
      requests_per_second: 10    # ✅ Immediate: New limits apply immediately
      burst_capacity: 20         # ✅ Immediate: Burst settings update instantly

# Circuit breaker configuration
circuit_breaker:
  enabled: true                  # ✅ Immediate: Enable/disable circuit breaker
  failure_threshold: 5           # ✅ Immediate: Threshold updates apply instantly
  timeout_seconds: 60            # ✅ Immediate: Timeout changes immediate

# Retry configuration
retry:
  max_attempts: 3                # ✅ Immediate: Retry policy updates instantly
  base_delay: "100ms"            # ✅ Immediate: Backoff settings apply immediately
  exponential_backoff: true      # ✅ Immediate: Strategy changes instant

# Global prompts
global_prompts:
  default: "You are helpful"       # ✅ Immediate: Prompt changes apply to new requests
  default_file: "prompts/system.md"  # ✅ Immediate: File-based prompts also hot-reload

# Admin statistics
admin:
  stats:
    retention_window: "24h"        # ✅ Immediate: Retention window updates instantly
    token_tracking: true           # ✅ Immediate: Token tracking toggle applies immediately

Gradual Update (Existing Connections Maintained)¶

These settings apply to new connections while maintaining existing ones:

# Backend configuration
backends:
    - name: "ollama"               # ✅ Gradual: New requests use updated backend pool
    url: "http://localhost:11434"
    weight: 2                    # ✅ Gradual: Load balancing updates for new requests
    models: ["llama3.2"]         # ✅ Gradual: Model routing updates gradually

# Health check settings
health_checks:
  interval: "30s"                # ✅ Gradual: Next health check cycle uses new interval
  timeout: "10s"                 # ✅ Gradual: New checks use updated timeout
  unhealthy_threshold: 3         # ✅ Gradual: Threshold applies to new evaluations
  healthy_threshold: 2           # ✅ Gradual: Recovery threshold updates gradually

# Timeout configuration
timeouts:
  connection: "10s"              # ✅ Gradual: New requests use updated timeouts
  request:
    standard:
      first_byte: "30s"          # ✅ Gradual: Applies to new requests
      total: "180s"              # ✅ Gradual: New requests use new timeout
    streaming:
      chunk_interval: "30s"      # ✅ Gradual: New streams use updated settings

Requires Restart (Hot Reload Not Possible)¶

These settings require a server restart to take effect. Changes are logged as warnings:

server:
  bind_address: "0.0.0.0:8080"   # ❌ Restart required: TCP/Unix socket binding
  # bind_address:                 # ❌ Restart required: Any address changes
  #   - "0.0.0.0:8080"
  #   - "unix:/var/run/router.sock"
  socket_mode: 0o660              # ❌ Restart required: Socket permissions
  workers: 4                      # ❌ Restart required: Worker thread pool size

When these settings are changed, the router will log a warning like:

WARN server.bind_address changed from '0.0.0.0:8080' to '0.0.0.0:9000' - requires restart to take effect

Hot Reload Process¶

File System Watcher - Detects configuration file changes automatically
Configuration Loading - New configuration is loaded and parsed
Validation - New configuration is validated against schema
Change Detection - ConfigDiff computation identifies what changed
Classification - Changes are classified (immediate/gradual/restart)
Atomic Update - Valid configuration is applied atomically
Component Propagation - Updates are propagated to affected components:
HealthChecker updates check intervals and thresholds
RateLimitStore updates rate limiting rules
CircuitBreaker updates failure thresholds and timeouts
BackendPool updates backend configuration
Immediate Health Check - When backends are added, an immediate health check is triggered so new backends become available within 1-2 seconds instead of waiting for the next periodic check
Error Handling - If invalid, error is logged and old configuration retained

Checking Hot Reload Status¶

Use the admin API to check hot reload status and capabilities:

# Check if hot reload is enabled
curl http://localhost:8080/admin/config/hot-reload-status

# View current configuration
curl http://localhost:8080/admin/config

Hot Reload Behavior Examples¶

Example 1: Changing Log Level (Immediate)

# Before
logging:
  level: "info"

# After
logging:
  level: "debug"

Result: Log level changes immediately. No restart needed. Ongoing requests continue, new logs use debug level.

Example 2: Adding a Backend (Gradual with Immediate Health Check)

# Before
backends:
    - name: "ollama"
    url: "http://localhost:11434"

# After
backends:
    - name: "ollama"
    url: "http://localhost:11434"
    - name: "lmstudio"
    url: "http://localhost:1234"

Result: New backend added to pool with immediate health check triggered. The new backend becomes available within 1-2 seconds (instead of waiting up to 30 seconds for the next periodic health check). Existing requests continue to current backends. New requests can route to lmstudio once health check passes.

Example 2b: Removing a Backend (Graceful Draining)

# Before
backends:
    - name: "ollama"
      url: "http://localhost:11434"
    - name: "lmstudio"
      url: "http://localhost:1234"

# After
backends:
    - name: "ollama"
      url: "http://localhost:11434"

Result: Backend "lmstudio" enters draining state. New requests are not routed to it, but existing in-flight requests (including streaming) continue until completion. After all references are released (or after 5 minutes timeout), the backend is fully removed from memory.

Backend State Lifecycle¶

When a backend is removed from configuration, it goes through a graceful shutdown process:

Active → Draining: Backend is marked as draining. New requests skip this backend.
In-flight Completion: Existing requests/streams continue uninterrupted.
Cleanup: Once all references are released, or after 5-minute timeout, the backend is removed.

This ensures zero impact on ongoing connections during configuration changes.

Example 3: Changing Bind Address (Requires Restart)

# Before
server:
  bind_address: "0.0.0.0:8080"

# After
server:
  bind_address: "0.0.0.0:9000"

Result: Warning logged. Change does not take effect. Restart required to bind to new port.

Distributed Tracing¶

Continuum Router supports distributed tracing for request correlation across backend services. This feature helps with debugging and monitoring requests as they flow through multiple services.

Configuration¶

tracing:
  enabled: true                         # Enable/disable distributed tracing (default: true)
  w3c_trace_context: true               # Support W3C Trace Context header (default: true)
  headers:
    trace_id: "X-Trace-ID"              # Header name for trace ID (default)
    request_id: "X-Request-ID"          # Header name for request ID (default)
    correlation_id: "X-Correlation-ID"  # Header name for correlation ID (default)

How It Works¶

Trace ID Extraction: When a request arrives, the router extracts trace IDs from headers in the following priority order:
W3C traceparent header (if W3C support enabled)
Configured trace_id header (X-Trace-ID)
Configured request_id header (X-Request-ID)
Configured correlation_id header (X-Correlation-ID)
Trace ID Generation: If no trace ID is found in headers, a new UUID is generated.
Header Propagation: The trace ID is propagated to backend services via multiple headers:
X-Request-ID: For broad compatibility
X-Trace-ID: Primary trace identifier
X-Correlation-ID: For correlation tracking
traceparent: W3C Trace Context (if enabled)
tracestate: W3C Trace State (if present in original request)
Retry Preservation: The same trace ID is preserved across all retry attempts, making it easy to correlate multiple backend requests for a single client request.

Structured Logging¶

When tracing is enabled, all log messages include the trace_id field:

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "info",
  "trace_id": "0af7651916cd43dd8448eb211c80319c",
  "message": "Processing chat completions request",
  "backend": "openai",
  "model": "gpt-4o"
}

W3C Trace Context¶

When w3c_trace_context is enabled, the router supports the W3C Trace Context standard:

Incoming: Parses traceparent header (format: 00-{trace_id}-{span_id}-{flags})
Outgoing: Generates new traceparent header with preserved trace ID and new span ID
State: Forwards tracestate header if present in original request

Example traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01

Disabling Tracing¶

To disable distributed tracing:

tracing:
  enabled: false

Load Balancing Strategies¶

load_balancer:
  strategy: "round_robin"         # round_robin, weighted, random
  health_aware: true              # Only use healthy backends

Strategies:

round_robin: Equal distribution across backends
weighted: Distribution based on backend weights
random: Random selection (good for avoiding patterns)

Per-Backend Retry Configuration¶

backends:
    - name: "slow-backend"
    url: "http://slow.example.com"
    retry_override:               # Override global retry settings
      max_attempts: 5             # More attempts for slower backends
      base_delay: "500ms"         # Longer delays
      max_delay: "60s"

Model Fallback¶

Continuum Router supports automatic model fallback when the primary model is unavailable. This feature integrates with the circuit breaker for layered failover protection.

Pre-Stream vs. Mid-Stream Fallback¶

The router provides two independent fallback mechanisms:

Mechanism	When it activates	Config section	Default
Pre-stream fallback	Before or at the start of a response: connection errors, timeouts, trigger error codes, unhealthy backend at routing time	`fallback`	Enabled when `fallback.enabled: true`
Mid-stream fallback	After streaming has started and the backend fails mid-response	`fallback` + `streaming.mid_stream_fallback`	Activates when `fallback.enabled: true` and a fallback chain is configured. Continuation mode is enabled by default.

When fallback.enabled: true and a fallback chain is configured for the requested model, mid-stream connection drops are suppressed and the router transparently switches to the next backend — even if streaming.mid_stream_fallback.enabled is false.

streaming.mid_stream_fallback.enabled controls continuation behavior only: whether the fallback backend receives a continuation prompt (using accumulated partial response) or a full restart of the original request. The default is true (continuation mode), which provides seamless output for the client. Setting it to false forces restart mode, which may cause duplicate or incoherent content if partial output was already sent to the client.

Configuration¶

fallback:
  enabled: true

  # Define fallback chains for each primary model
  fallback_chains:
    # Same-provider fallback
    "gpt-4o":
      - "gpt-4-turbo"
      - "gpt-3.5-turbo"

    "claude-opus-4-5-20251101":
      - "claude-sonnet-4-5"
      - "claude-haiku-4-5"

    # Cross-provider fallback
    "gemini-2.5-pro":
      - "gemini-2.5-flash"
      - "gpt-4o"  # Falls back to OpenAI if Gemini unavailable

  fallback_policy:
    trigger_conditions:
      error_codes: [429, 500, 502, 503, 504]
      timeout: true
      connection_error: true
      model_not_found: true
      circuit_breaker_open: true

    max_fallback_attempts: 3
    fallback_timeout_multiplier: 1.5
    preserve_parameters: true

  model_settings:
    "gpt-4o":
      fallback_enabled: true
      notify_on_fallback: true

Trigger Conditions¶

Condition	Description
`error_codes`	HTTP status codes that trigger fallback (e.g., 429, 500, 502, 503, 504)
`timeout`	Request timeout
`connection_error`	TCP connection failures
`model_not_found`	Model not available on backend
`circuit_breaker_open`	Backend circuit breaker is open

Response Headers¶

When fallback is used, the following headers are added to the response:

Header	Description	Example
`X-Fallback-Used`	Indicates fallback was used	`true`
`X-Original-Model`	Originally requested model	`gpt-4o`
`X-Fallback-Model`	Model that served the request	`gpt-4-turbo`
`X-Fallback-Reason`	Why fallback was triggered	`error_code_429`
`X-Fallback-Attempts`	Number of fallback attempts	`2`

Cross-Provider Parameter Translation¶

When falling back across providers (e.g., OpenAI → Anthropic), the router automatically translates request parameters:

OpenAI Parameter	Anthropic Parameter	Notes
`max_tokens`	`max_tokens`	Auto-filled if missing (required by Anthropic)
`temperature`	`temperature`	Direct mapping
`top_p`	`top_p`	Direct mapping
`stop`	`stop_sequences`	Array conversion

Provider-specific parameters are automatically removed or converted during cross-provider fallback.

Integration with Circuit Breaker¶

The fallback system works in conjunction with the circuit breaker:

Circuit Breaker detects failures and opens when threshold is exceeded
Fallback chain activates when circuit breaker is open
Requests route to fallback models based on configured chains
Circuit breaker tests recovery and closes when backend recovers

# Example: Combined circuit breaker and fallback configuration
circuit_breaker:
  enabled: true
  failure_threshold: 5
  timeout: 60s

fallback:
  enabled: true
  fallback_policy:
    trigger_conditions:
      circuit_breaker_open: true  # Link to circuit breaker

Mid-Stream Fallback¶

Mid-stream fallback allows the router to transparently continue an active SSE stream on a fallback backend when the primary backend fails mid-response. The client's connection remains open and sees a seamless response with only a brief pause during the switchover.

Mid-stream fallback activates automatically when fallback.enabled: true and a fallback chain is configured for the requested model. The streaming.mid_stream_fallback section controls how the fallback backend is invoked (continuation vs restart mode), not whether fallback happens.

Configuration¶

fallback:
  enabled: true  # Required: enables mid-stream fallback path
  fallback_chains:
    "gpt-4o":
      - "gpt-4-turbo"
      - "gpt-3.5-turbo"

streaming:
  mid_stream_fallback:
    # Enable continuation mode (default: true).
    # When true, accumulated partial response is used to build a continuation prompt,
    # producing seamless output for the client.
    # When false, the fallback backend restarts the request from scratch, which may
    # cause duplicate or incoherent content if partial output was already sent.
    enabled: true

    # Minimum estimated tokens accumulated before using continuation mode (default: 50)
    # Below this threshold the request is restarted from scratch on the fallback backend
    # instead of appending a continuation prompt.
    min_accumulated_tokens: 50

    # Maximum fallback attempts per streaming request (default: 2, max: 10)
    max_fallback_attempts: 2

    # Prompt appended as a user message after the partial assistant response
    continuation_prompt: "Continue from where you left off exactly. Do not repeat any previously generated content."

How It Works¶

The client sends a streaming chat completion request.
The router begins streaming from the primary backend, accumulating response content.
If the backend fails mid-stream (connection drop, timeout, error event):
- The error is NOT forwarded to the client.
- The accumulated partial response is captured.
- The next healthy backend in the fallback chain is selected (unhealthy backends are skipped).
- A continuation or restart request is sent to the fallback backend.
- Streaming resumes on the fallback backend without closing the client connection.
The client receives a seamless response with only a brief pause during the switchover.

Continuation vs. Restart Mode¶

The min_accumulated_tokens threshold controls which recovery mode is used:

Condition	Mode	Behavior
`enabled: true` (default) and tokens ≥ `min_accumulated_tokens` and not truncated	Continuation	Original messages + partial assistant response + continuation prompt
`enabled: true` (default) and tokens < `min_accumulated_tokens`	Restart	Original request replayed (not enough context to continue)
`enabled: true` (default) and content truncated (> 100 KB)	Restart	Forced restart to avoid incoherent context
`mid_stream_fallback.enabled: false`	Restart	Original request replayed on fallback backend from scratch

Continuation mode (the default) produces seamless output for the client. Restart mode is used automatically when there is too little context to continue meaningfully, or when the accumulated response is too long to include safely. Explicitly setting enabled: false forces restart mode unconditionally, which may cause duplicate or incoherent content visible to the client.

Edge Case Handling¶

The mid-stream fallback path addresses several edge cases automatically:

Global timeout budget: All fallback attempts share the original request start time. Each attempt checks remaining budget before sending, preventing indefinite timeout accumulation across the chain.
Cross-provider parameter translation: When the fallback model is on a different provider (e.g., OpenAI → Anthropic), request parameters are automatically translated — provider-specific fields removed and parameter names mapped.
Concurrent request storms: A global semaphore (50 permits) limits simultaneous fallback attempts. Requests that cannot acquire a permit within 5 seconds are rejected gracefully.
Accumulator truncation: When accumulated response content exceeds 100 KB, the continuation mode is forced to restart to avoid sending incoherent context to the fallback backend.
Health re-check: Backend health is re-verified before each fallback attempt in the chain. Unhealthy backends are skipped to the next entry.
Missing [DONE] marker: Streams ending without [DONE] but with finish_reason: "stop" are treated as completed successfully, preventing unnecessary fallback.

Metrics¶

Three Prometheus metrics track mid-stream fallback activity. See Mid-Stream Fallback Metrics for details.

Minimizing Failover Latency¶

When a backend goes down during streaming, the time until the fallback backend takes over depends on several configuration parameters across different subsystems. Below is a tuning guide for minimizing this switchover delay.

How failover delay is composed¶

The total time a client waits during a mid-stream failover is roughly:

failover_delay ≈ failure_detection_time + health_recheck_time + fallback_connection_time

Each component maps to specific configuration:

Component	What determines it	Default	Tuning target
Failure detection	Stream inactivity timeout (hardcoded 60 s) or TCP read error (immediate) or `chunk_interval` timeout	30–60 s	Lower `chunk_interval`
Health re-check	Health check before fallback attempt	`timeout: 5s`	Keep low
Fallback connection	TCP connect + TLS handshake to fallback backend	`connection: 10s`	Lower `connection`

Recommended configuration for fast failover¶

# 1. Timeouts — the most impactful settings for failover speed
timeouts:
  connection: 5s               # Faster TCP connect timeout (default: 10s)
  request:
    streaming:
      first_byte: 30s          # How long to wait for the first token (default: 60s)
      chunk_interval: 10s      # Max silence between chunks before treating as failure (default: 30s)
      total: 600s              # Total streaming budget (keep generous)

# 2. Health checks — detect backend failures proactively
health_checks:
  interval: 10s                # Check every 10s instead of 30s (default: 30s)
  timeout: 3s                  # Fail health checks faster (default: 5s)
  unhealthy_threshold: 2       # Mark unhealthy after 2 failures (default: 3)
  healthy_threshold: 1         # Recover after 1 success (default: 2)
  warmup_check_interval: 1s   # Fast checks during backend startup

# 3. Circuit breaker — stop routing to a failed backend immediately
circuit_breaker:
  enabled: true
  failure_threshold: 3         # Open circuit after 3 failures (default: 5)
  timeout: 30s                 # Try recovery after 30s (default: 60s)
  half_open_max_requests: 2
  half_open_success_threshold: 1
  timeout_as_failure: true     # Count timeouts toward circuit breaker

# 4. Fallback chain — must be configured for mid-stream fallback to activate
fallback:
  enabled: true
  fallback_chains:
    "gpt-4o":
        - "gpt-4-turbo"
        - "gpt-3.5-turbo"
  fallback_policy:
    trigger_conditions:
      error_codes: [429, 500, 502, 503, 504]
      timeout: true
      connection_error: true
      circuit_breaker_open: true

# 5. Mid-stream fallback — continuation mode (default: enabled)
streaming:
  mid_stream_fallback:
    enabled: true              # Use continuation mode (default)
    max_fallback_attempts: 3   # Allow more retries for resilience (default: 2)
    min_accumulated_tokens: 30 # Lower threshold for continuation vs restart (default: 50)

Parameter impact summary¶

Parameter	Effect on failover speed	Trade-off
`timeouts.request.streaming.chunk_interval`	High — directly controls how quickly a stalled stream is detected	Too low may cause false positives on slow models (e.g., reasoning models with long thinking phases)
`timeouts.connection`	Medium — limits TCP connect delay to fallback backend	Too low may fail on high-latency networks
`health_checks.interval`	Medium — faster detection means the circuit breaker opens sooner, preventing requests from reaching a dead backend	More frequent checks increase backend load
`health_checks.unhealthy_threshold`	Medium — fewer failures needed to mark backend unhealthy	Lower values increase sensitivity to transient errors
`circuit_breaker.failure_threshold`	Medium — fewer failures to open circuit	Too aggressive may open circuit on temporary spikes
`circuit_breaker.timeout`	Low — affects recovery time, not failover speed	Shorter means faster recovery but more probing of unhealthy backends
`mid_stream_fallback.max_fallback_attempts`	Low — more attempts increase resilience but not speed of individual switchover	More attempts consume more of the global timeout budget

Failure detection scenarios¶

Different failure types are detected at different speeds:

Failure type	Detection time	Mechanism
TCP connection reset / backend crash	Immediate (< 1 s)	Stream read error triggers instant fallback
Backend returns 5xx error	Immediate (< 1 s)	HTTP status check before streaming begins
Backend becomes unresponsive (stall)	`chunk_interval` (default 30 s)	Inactivity timeout on the stream
Backend sends error SSE events	After 5 errors	Error count threshold in stream processing
Backend process killed mid-response	Immediate (< 1 s)	TCP FIN/RST detected as stream read error

The most common scenario in production — a backend becoming unresponsive — is governed by chunk_interval. For latency-sensitive applications, lowering this to 10–15 seconds is recommended, with model-specific overrides for slow models:

timeouts:
  request:
    streaming:
      chunk_interval: 10s      # Fast detection for most models
    model_overrides:
      gemini-2.5-pro:          # Reasoning models need longer intervals
        streaming:
          chunk_interval: 30s
          first_byte: 120s

Rate Limiting¶

Continuum Router includes built-in rate limiting for the /v1/models endpoint to prevent abuse and ensure fair resource allocation.

Current Configuration¶

Rate limiting is currently configured with the following default values:

# Note: These values are currently hardcoded but may become configurable in future versions
rate_limiting:
  models_endpoint:
    # Per-client limits (identified by API key or IP address)
    sustained_limit: 100          # Maximum requests per minute
    burst_limit: 20               # Maximum requests in any 5-second window

    # Time windows
    window_duration: 60s          # Sliding window for sustained limit
    burst_window: 5s              # Window for burst detection

    # Client identification priority
    identification:
      - api_key                   # Bearer token (first 16 chars used as ID)
      - x_forwarded_for           # Proxy/load balancer header
      - x_real_ip                 # Alternative IP header
      - fallback: "unknown"       # When no identifier available

How It Works¶

Client Identification: Each request is associated with a client using:
API key from Authorization: Bearer <token> header (preferred)
IP address from proxy headers (fallback)
Dual-Window Approach:
Sustained limit: Prevents excessive usage over time
Burst protection: Catches rapid-fire requests
Independent Quotas: Each client has separate rate limits:
Client A with API key abc123...: 100 req/min
Client B with API key def456...: 100 req/min
Client C from IP 192.168.1.1: 100 req/min

Response Headers¶

When rate limited, the response includes: - Status Code: 429 Too Many Requests - Error Message: Indicates whether burst or sustained limit was exceeded

Cache TTL Optimization¶

To prevent cache poisoning attacks: - Empty model lists: Cached for 5 seconds only - Normal responses: Cached for 60 seconds

This prevents attackers from forcing the router to cache empty responses during backend outages.

Monitoring¶

Rate limit violations are tracked in metrics: - rate_limit_violations: Total rejected requests - empty_responses_returned: Empty model lists served - Per-client violation tracking for identifying problematic clients

Future Enhancements¶

Future versions may support: - Configurable rate limits via YAML/environment variables - Per-endpoint rate limiting - Custom rate limits per API key - Redis-backed distributed rate limiting

Environment-Specific Configurations¶

Development Configuration¶

# config/development.yaml
server:
  bind_address: "127.0.0.1:8080"

backends:
    - name: "local-ollama"
    url: "http://localhost:11434"

health_checks:
  interval: "10s"                 # More frequent checks
  timeout: "5s"

logging:
  level: "debug"                  # Verbose logging
  format: "pretty"                # Human-readable
  enable_colors: true

Production Configuration¶

# config/production.yaml
server:
  bind_address: "0.0.0.0:8080"
  workers: 8                      # More workers for production
  connection_pool_size: 300       # Larger connection pool

backends:
    - name: "primary-openai"
    url: "https://api.openai.com"
    weight: 3
    - name: "secondary-azure"
    url: "https://azure-openai.example.com"
    weight: 2
    - name: "fallback-local"
    url: "http://internal-llm:11434"
    weight: 1

health_checks:
  interval: "60s"                 # Less frequent checks
  timeout: "15s"                  # Longer timeout for network latency
  unhealthy_threshold: 5          # More tolerance
  healthy_threshold: 3

request:
  timeout: "120s"                 # Shorter timeout for production
  max_retries: 5                  # More retries

logging:
  level: "warn"                   # Less verbose logging
  format: "json"                  # Structured logging

Container Configuration¶

# config/container.yaml - optimized for containers
server:
  bind_address: "0.0.0.0:8080"
  workers: 0                      # Auto-detect based on container limits

backends:
    - name: "backend-1"
    url: "${BACKEND_1_URL}"       # Environment variable substitution
    - name: "backend-2"
    url: "${BACKEND_2_URL}"

logging:
  level: "${LOG_LEVEL}"           # Configurable via environment
  format: "json"                  # Always JSON in containers