Advanced Configuration¶
Global Prompts¶
Global prompts allow you to inject system prompts into all requests, providing centralized policy management for security, compliance, and behavioral guidelines. Prompts can be defined inline or loaded from external Markdown files.
Basic Configuration¶
global_prompts:
# Inline default prompt
default: |
You must follow company security policies.
Never reveal internal system details.
Be helpful and professional.
# Merge strategy: prepend (default), append, or replace
merge_strategy: prepend
# Custom separator between global and user prompts
separator: "\n\n---\n\n"
External Prompt Files¶
For complex prompts, you can load content from external Markdown files. This provides: - Better editing experience with syntax highlighting - Version control without config file noise - Hot-reload support for prompt updates
global_prompts:
# Directory containing prompt files (relative to config directory)
prompts_dir: "./prompts"
# Load default prompt from file
default_file: "system.md"
# Backend-specific prompts from files
backends:
anthropic:
prompt_file: "anthropic-system.md"
openai:
prompt_file: "openai-system.md"
# Model-specific prompts from files
models:
gpt-4o:
prompt_file: "gpt4o-system.md"
claude-3-opus:
prompt_file: "claude-opus-system.md"
merge_strategy: prepend
Prompt Resolution Priority¶
When determining which prompt to use for a request:
- Model-specific prompt (highest priority) -
global_prompts.models.<model-id> - Backend-specific prompt -
global_prompts.backends.<backend-name> - Default prompt -
global_prompts.defaultorglobal_prompts.default_file
For each level, if both prompt (inline) and prompt_file are specified, prompt_file takes precedence.
Merge Strategies¶
| Strategy | Behavior |
|---|---|
prepend | Global prompt added before user's system prompt (default) |
append | Global prompt added after user's system prompt |
replace | Global prompt replaces user's system prompt entirely |
REST API Management¶
Prompt files can be managed at runtime via the Admin API:
# List all prompts
curl http://localhost:8080/admin/config/prompts
# Get specific prompt file
curl http://localhost:8080/admin/config/prompts/prompts/system.md
# Update prompt file
curl -X PUT http://localhost:8080/admin/config/prompts/prompts/system.md \
-H "Content-Type: application/json" \
-d '{"content": "# Updated System Prompt\n\nNew content here."}'
# Reload all prompt files from disk
curl -X POST http://localhost:8080/admin/config/prompts/reload
See Admin REST API Reference for complete API documentation.
Security Considerations¶
- Path Traversal Protection: All file paths are validated to prevent directory traversal attacks
- File Size Limits: Individual files limited to 1MB, total cache limited to 50MB
- Relative Paths Only: Prompt files must be within the configured
prompts_diror config directory - Sandboxed Access: Files outside the allowed directory are rejected
Hot Reload¶
Global prompts support immediate hot-reload. Changes to prompt configuration or files take effect on the next request without server restart.
Model Metadata¶
Continuum Router supports rich model metadata to provide detailed information about model capabilities, pricing, and limits. This metadata is returned in /v1/models API responses and can be used by clients to make informed model selection decisions.
Metadata Sources¶
Model metadata can be configured in three ways (in priority order):
- Backend-specific model_configs (highest priority)
- External metadata file (model-metadata.yaml)
- No metadata (models work without metadata)
External Metadata File¶
Create a model-metadata.yaml file:
models:
- id: "gpt-4"
aliases: # Alternative IDs that share this metadata
- "gpt-4-0125-preview"
- "gpt-4-turbo-preview"
- "gpt-4-vision-preview"
metadata:
display_name: "GPT-4"
summary: "Most capable GPT-4 model for complex tasks"
capabilities: ["text", "image", "function_calling"]
knowledge_cutoff: "2024-04"
pricing:
input_tokens: 0.03 # Per 1000 tokens
output_tokens: 0.06 # Per 1000 tokens
limits:
context_window: 128000
max_output: 4096
- id: "llama-3-70b"
aliases: # Different quantizations of the same model
- "llama-3-70b-instruct"
- "llama-3-70b-chat"
- "llama-3-70b-q4"
- "llama-3-70b-q8"
metadata:
display_name: "Llama 3 70B"
summary: "Open-source model with strong performance"
capabilities: ["text", "code"]
knowledge_cutoff: "2023-12"
pricing:
input_tokens: 0.001
output_tokens: 0.002
limits:
context_window: 8192
max_output: 2048
Reference it in your config:
Thinking Pattern Configuration¶
Some models output reasoning/thinking content in non-standard ways. The router supports configuring thinking patterns per model to properly transform streaming responses.
Pattern Types:
| Pattern | Description | Example Model |
|---|---|---|
none | No thinking pattern (default) | Most models |
standard | Explicit start/end tags (<think>...</think>) | Custom reasoning models |
unterminated_start | No start tag, only end tag | nemotron-3-nano |
Configuration Example:
models:
- id: nemotron-3-nano
metadata:
display_name: "Nemotron 3 Nano"
capabilities: ["chat", "reasoning"]
# Thinking pattern configuration
thinking:
pattern: unterminated_start
end_marker: "</think>"
assume_reasoning_first: true
Thinking Pattern Fields:
| Field | Type | Description |
|---|---|---|
pattern | string | Pattern type: none, standard, or unterminated_start |
start_marker | string | Start marker for standard pattern (e.g., <think>) |
end_marker | string | End marker (e.g., </think>) |
assume_reasoning_first | boolean | If true, treat first tokens as reasoning until end marker |
How It Works:
When a model has a thinking pattern configured:
- Streaming responses are intercepted and transformed
- Content before
end_markeris sent asreasoning_contentfield - Content after
end_markeris sent ascontentfield - The output follows OpenAI's
reasoning_contentformat for compatibility
Example Output:
// Reasoning content (before end marker)
{"choices": [{"delta": {"reasoning_content": "Let me analyze..."}}]}
// Regular content (after end marker)
{"choices": [{"delta": {"content": "The answer is 42."}}]}
Namespace-Aware Matching¶
The router intelligently handles model IDs with namespace prefixes. For example:
- Backend returns:
"custom/gpt-4","openai/gpt-4","optimized/gpt-4" - Metadata defined for:
"gpt-4" - Result: All variants match and receive the same metadata
This allows different backends to use their own naming conventions while sharing common metadata definitions.
Metadata Priority and Alias Resolution¶
When looking up metadata for a model, the router uses the following priority chain:
- Exact model ID match
- Exact alias match
- Date suffix normalization (automatic, zero-config)
- Wildcard pattern alias match
- Base model name fallback (namespace stripping)
Within each source (backend config, metadata file, built-in), the same priority applies:
-
Backend-specific
model_configs(highest priority) -
External metadata file (second priority)
-
Built-in metadata (for OpenAI and Gemini backends)
Automatic Date Suffix Handling¶
LLM providers frequently release model versions with date suffixes. The router automatically detects and normalizes date suffixes without any configuration:
Supported date patterns:
-YYYYMMDD(e.g.,claude-opus-4-5-20251130)-YYYY-MM-DD(e.g.,gpt-4o-2024-08-06)-YYMM(e.g.,o1-mini-2409)@YYYYMMDD(e.g.,model@20251130)
How it works:
Request: claude-opus-4-5-20251215
↓ (date suffix detected)
Lookup: claude-opus-4-5-20251101 (existing metadata entry)
↓ (base names match)
Result: Uses claude-opus-4-5-20251101 metadata
This means you only need to configure metadata once per model family, and new dated versions automatically inherit the metadata.
Wildcard Pattern Matching¶
Aliases support glob-style wildcard patterns using the * character:
- Prefix matching:
claude-*matchesclaude-opus,claude-sonnet, etc. - Suffix matching:
*-previewmatchesgpt-4o-preview,o1-preview, etc. - Infix matching:
gpt-*-turbomatchesgpt-4-turbo,gpt-3.5-turbo, etc.
Example configuration with wildcard patterns:
models:
- id: "claude-opus-4-5-20251101"
aliases:
- "claude-opus-4-5" # Exact match for base name
- "claude-opus-*" # Wildcard for any claude-opus variant
metadata:
display_name: "Claude Opus 4.5"
# Automatically matches: claude-opus-4-5-20251130, claude-opus-test, etc.
- id: "gpt-4o"
aliases:
- "gpt-4o-*-preview" # Matches preview versions
- "*-4o-turbo" # Suffix matching
metadata:
display_name: "GPT-4o"
Priority note: Exact aliases are always matched before wildcard patterns, ensuring predictable behavior when both could match.
Using Aliases for Model Variants¶
Aliases are particularly useful for:
- Different quantizations:
qwen3-32b-i1,qwen3-23b-i4→ all useqwen3metadata - Version variations:
gpt-4-0125-preview,gpt-4-turbo→ sharegpt-4metadata - Deployment variations:
llama-3-70b-instruct,llama-3-70b-chat→ same base model - Dated versions:
claude-3-5-sonnet-20241022,claude-3-5-sonnet-20241201→ share metadata (automatic with date suffix handling)
Example configuration with aliases:
model_configs:
- id: "qwen3"
aliases:
- "qwen3-32b-i1" # 32B with 1-bit quantization
- "qwen3-23b-i4" # 23B with 4-bit quantization
- "qwen3-16b-q8" # 16B with 8-bit quantization
- "qwen3-*" # Wildcard for any other qwen3 variant
metadata:
display_name: "Qwen 3"
summary: "Alibaba's Qwen model family"
# ... rest of metadata
API Response¶
The /v1/models endpoint returns enriched model information:
{
"object": "list",
"data": [
{
"id": "gpt-4",
"object": "model",
"created": 1234567890,
"owned_by": "openai",
"backends": ["openai-proxy"],
"metadata": {
"display_name": "GPT-4",
"summary": "Most capable GPT-4 model for complex tasks",
"capabilities": ["text", "image", "function_calling"],
"knowledge_cutoff": "2024-04",
"pricing": {
"input_tokens": 0.03,
"output_tokens": 0.06
},
"limits": {
"context_window": 128000,
"max_output": 4096
}
}
}
]
}
Hot Reload¶
Continuum Router supports hot reload for runtime configuration updates without server restart. Configuration changes are detected automatically and applied based on their classification.
Configuration Item Classification¶
Configuration items are classified into three categories based on their hot reload capability:
Immediate Update (No Service Interruption)¶
These settings update immediately without any service disruption:
# Logging configuration
logging:
level: "info" # ✅ Immediate: Log level changes apply instantly
format: "json" # ✅ Immediate: Log format changes apply instantly
# Rate limiting settings
rate_limiting:
enabled: true # ✅ Immediate: Enable/disable rate limiting
limits:
per_client:
requests_per_second: 10 # ✅ Immediate: New limits apply immediately
burst_capacity: 20 # ✅ Immediate: Burst settings update instantly
# Circuit breaker configuration
circuit_breaker:
enabled: true # ✅ Immediate: Enable/disable circuit breaker
failure_threshold: 5 # ✅ Immediate: Threshold updates apply instantly
timeout_seconds: 60 # ✅ Immediate: Timeout changes immediate
# Retry configuration
retry:
max_attempts: 3 # ✅ Immediate: Retry policy updates instantly
base_delay: "100ms" # ✅ Immediate: Backoff settings apply immediately
exponential_backoff: true # ✅ Immediate: Strategy changes instant
# Global prompts
global_prompts:
default: "You are helpful" # ✅ Immediate: Prompt changes apply to new requests
default_file: "prompts/system.md" # ✅ Immediate: File-based prompts also hot-reload
# Admin statistics
admin:
stats:
retention_window: "24h" # ✅ Immediate: Retention window updates instantly
token_tracking: true # ✅ Immediate: Token tracking toggle applies immediately
Gradual Update (Existing Connections Maintained)¶
These settings apply to new connections while maintaining existing ones:
# Backend configuration
backends:
- name: "ollama" # ✅ Gradual: New requests use updated backend pool
url: "http://localhost:11434"
weight: 2 # ✅ Gradual: Load balancing updates for new requests
models: ["llama3.2"] # ✅ Gradual: Model routing updates gradually
# Health check settings
health_checks:
interval: "30s" # ✅ Gradual: Next health check cycle uses new interval
timeout: "10s" # ✅ Gradual: New checks use updated timeout
unhealthy_threshold: 3 # ✅ Gradual: Threshold applies to new evaluations
healthy_threshold: 2 # ✅ Gradual: Recovery threshold updates gradually
# Timeout configuration
timeouts:
connection: "10s" # ✅ Gradual: New requests use updated timeouts
request:
standard:
first_byte: "30s" # ✅ Gradual: Applies to new requests
total: "180s" # ✅ Gradual: New requests use new timeout
streaming:
chunk_interval: "30s" # ✅ Gradual: New streams use updated settings
Requires Restart (Hot Reload Not Possible)¶
These settings require a server restart to take effect. Changes are logged as warnings:
server:
bind_address: "0.0.0.0:8080" # ❌ Restart required: TCP/Unix socket binding
# bind_address: # ❌ Restart required: Any address changes
# - "0.0.0.0:8080"
# - "unix:/var/run/router.sock"
socket_mode: 0o660 # ❌ Restart required: Socket permissions
workers: 4 # ❌ Restart required: Worker thread pool size
When these settings are changed, the router will log a warning like:
WARN server.bind_address changed from '0.0.0.0:8080' to '0.0.0.0:9000' - requires restart to take effect
Hot Reload Process¶
- File System Watcher - Detects configuration file changes automatically
- Configuration Loading - New configuration is loaded and parsed
- Validation - New configuration is validated against schema
- Change Detection - ConfigDiff computation identifies what changed
- Classification - Changes are classified (immediate/gradual/restart)
- Atomic Update - Valid configuration is applied atomically
- Component Propagation - Updates are propagated to affected components:
- HealthChecker updates check intervals and thresholds
- RateLimitStore updates rate limiting rules
- CircuitBreaker updates failure thresholds and timeouts
- BackendPool updates backend configuration
- Immediate Health Check - When backends are added, an immediate health check is triggered so new backends become available within 1-2 seconds instead of waiting for the next periodic check
- Error Handling - If invalid, error is logged and old configuration retained
Checking Hot Reload Status¶
Use the admin API to check hot reload status and capabilities:
# Check if hot reload is enabled
curl http://localhost:8080/admin/config/hot-reload-status
# View current configuration
curl http://localhost:8080/admin/config
Hot Reload Behavior Examples¶
Example 1: Changing Log Level (Immediate)
Result: Log level changes immediately. No restart needed. Ongoing requests continue, new logs use debug level.Example 2: Adding a Backend (Gradual with Immediate Health Check)
# Before
backends:
- name: "ollama"
url: "http://localhost:11434"
# After
backends:
- name: "ollama"
url: "http://localhost:11434"
- name: "lmstudio"
url: "http://localhost:1234"
Example 2b: Removing a Backend (Graceful Draining)
# Before
backends:
- name: "ollama"
url: "http://localhost:11434"
- name: "lmstudio"
url: "http://localhost:1234"
# After
backends:
- name: "ollama"
url: "http://localhost:11434"
Backend State Lifecycle¶
When a backend is removed from configuration, it goes through a graceful shutdown process:
- Active → Draining: Backend is marked as draining. New requests skip this backend.
- In-flight Completion: Existing requests/streams continue uninterrupted.
- Cleanup: Once all references are released, or after 5-minute timeout, the backend is removed.
This ensures zero impact on ongoing connections during configuration changes.
Example 3: Changing Bind Address (Requires Restart)
Result: Warning logged. Change does not take effect. Restart required to bind to new port.Distributed Tracing¶
Continuum Router supports distributed tracing for request correlation across backend services. This feature helps with debugging and monitoring requests as they flow through multiple services.
Configuration¶
tracing:
enabled: true # Enable/disable distributed tracing (default: true)
w3c_trace_context: true # Support W3C Trace Context header (default: true)
headers:
trace_id: "X-Trace-ID" # Header name for trace ID (default)
request_id: "X-Request-ID" # Header name for request ID (default)
correlation_id: "X-Correlation-ID" # Header name for correlation ID (default)
How It Works¶
- Trace ID Extraction: When a request arrives, the router extracts trace IDs from headers in the following priority order:
- W3C
traceparentheader (if W3C support enabled) - Configured
trace_idheader (X-Trace-ID) - Configured
request_idheader (X-Request-ID) -
Configured
correlation_idheader (X-Correlation-ID) -
Trace ID Generation: If no trace ID is found in headers, a new UUID is generated.
-
Header Propagation: The trace ID is propagated to backend services via multiple headers:
X-Request-ID: For broad compatibilityX-Trace-ID: Primary trace identifierX-Correlation-ID: For correlation trackingtraceparent: W3C Trace Context (if enabled)-
tracestate: W3C Trace State (if present in original request) -
Retry Preservation: The same trace ID is preserved across all retry attempts, making it easy to correlate multiple backend requests for a single client request.
Structured Logging¶
When tracing is enabled, all log messages include the trace_id field:
{
"timestamp": "2024-01-15T10:30:00Z",
"level": "info",
"trace_id": "0af7651916cd43dd8448eb211c80319c",
"message": "Processing chat completions request",
"backend": "openai",
"model": "gpt-4o"
}
W3C Trace Context¶
When w3c_trace_context is enabled, the router supports the W3C Trace Context standard:
- Incoming: Parses
traceparentheader (format:00-{trace_id}-{span_id}-{flags}) - Outgoing: Generates new
traceparentheader with preserved trace ID and new span ID - State: Forwards
tracestateheader if present in original request
Example traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
Disabling Tracing¶
To disable distributed tracing:
Load Balancing Strategies¶
load_balancer:
strategy: "round_robin" # round_robin, weighted, random
health_aware: true # Only use healthy backends
Strategies:
round_robin: Equal distribution across backendsweighted: Distribution based on backend weightsrandom: Random selection (good for avoiding patterns)
Per-Backend Retry Configuration¶
backends:
- name: "slow-backend"
url: "http://slow.example.com"
retry_override: # Override global retry settings
max_attempts: 5 # More attempts for slower backends
base_delay: "500ms" # Longer delays
max_delay: "60s"
Model Fallback¶
Continuum Router supports automatic model fallback when the primary model is unavailable. This feature integrates with the circuit breaker for layered failover protection.
Pre-Stream vs. Mid-Stream Fallback¶
The router provides two independent fallback mechanisms:
| Mechanism | When it activates | Config section | Default |
|---|---|---|---|
| Pre-stream fallback | Before or at the start of a response: connection errors, timeouts, trigger error codes, unhealthy backend at routing time | fallback | Enabled when fallback.enabled: true |
| Mid-stream fallback | After streaming has started and the backend fails mid-response | fallback + streaming.mid_stream_fallback | Activates when fallback.enabled: true and a fallback chain is configured. Continuation mode is enabled by default. |
When fallback.enabled: true and a fallback chain is configured for the requested model, mid-stream connection drops are suppressed and the router transparently switches to the next backend — even if streaming.mid_stream_fallback.enabled is false.
streaming.mid_stream_fallback.enabled controls continuation behavior only: whether the fallback backend receives a continuation prompt (using accumulated partial response) or a full restart of the original request. The default is true (continuation mode), which provides seamless output for the client. Setting it to false forces restart mode, which may cause duplicate or incoherent content if partial output was already sent to the client.
Configuration¶
fallback:
enabled: true
# Define fallback chains for each primary model
fallback_chains:
# Same-provider fallback
"gpt-4o":
- "gpt-4-turbo"
- "gpt-3.5-turbo"
"claude-opus-4-5-20251101":
- "claude-sonnet-4-5"
- "claude-haiku-4-5"
# Cross-provider fallback
"gemini-2.5-pro":
- "gemini-2.5-flash"
- "gpt-4o" # Falls back to OpenAI if Gemini unavailable
fallback_policy:
trigger_conditions:
error_codes: [429, 500, 502, 503, 504]
timeout: true
connection_error: true
model_not_found: true
circuit_breaker_open: true
max_fallback_attempts: 3
fallback_timeout_multiplier: 1.5
preserve_parameters: true
model_settings:
"gpt-4o":
fallback_enabled: true
notify_on_fallback: true
Trigger Conditions¶
| Condition | Description |
|---|---|
error_codes | HTTP status codes that trigger fallback (e.g., 429, 500, 502, 503, 504) |
timeout | Request timeout |
connection_error | TCP connection failures |
model_not_found | Model not available on backend |
circuit_breaker_open | Backend circuit breaker is open |
Response Headers¶
When fallback is used, the following headers are added to the response:
| Header | Description | Example |
|---|---|---|
X-Fallback-Used | Indicates fallback was used | true |
X-Original-Model | Originally requested model | gpt-4o |
X-Fallback-Model | Model that served the request | gpt-4-turbo |
X-Fallback-Reason | Why fallback was triggered | error_code_429 |
X-Fallback-Attempts | Number of fallback attempts | 2 |
Cross-Provider Parameter Translation¶
When falling back across providers (e.g., OpenAI → Anthropic), the router automatically translates request parameters:
| OpenAI Parameter | Anthropic Parameter | Notes |
|---|---|---|
max_tokens | max_tokens | Auto-filled if missing (required by Anthropic) |
temperature | temperature | Direct mapping |
top_p | top_p | Direct mapping |
stop | stop_sequences | Array conversion |
Provider-specific parameters are automatically removed or converted during cross-provider fallback.
Integration with Circuit Breaker¶
The fallback system works in conjunction with the circuit breaker:
- Circuit Breaker detects failures and opens when threshold is exceeded
- Fallback chain activates when circuit breaker is open
- Requests route to fallback models based on configured chains
- Circuit breaker tests recovery and closes when backend recovers
# Example: Combined circuit breaker and fallback configuration
circuit_breaker:
enabled: true
failure_threshold: 5
timeout: 60s
fallback:
enabled: true
fallback_policy:
trigger_conditions:
circuit_breaker_open: true # Link to circuit breaker
Mid-Stream Fallback¶
Mid-stream fallback allows the router to transparently continue an active SSE stream on a fallback backend when the primary backend fails mid-response. The client's connection remains open and sees a seamless response with only a brief pause during the switchover.
Mid-stream fallback activates automatically when fallback.enabled: true and a fallback chain is configured for the requested model. The streaming.mid_stream_fallback section controls how the fallback backend is invoked (continuation vs restart mode), not whether fallback happens.
Configuration¶
fallback:
enabled: true # Required: enables mid-stream fallback path
fallback_chains:
"gpt-4o":
- "gpt-4-turbo"
- "gpt-3.5-turbo"
streaming:
mid_stream_fallback:
# Enable continuation mode (default: true).
# When true, accumulated partial response is used to build a continuation prompt,
# producing seamless output for the client.
# When false, the fallback backend restarts the request from scratch, which may
# cause duplicate or incoherent content if partial output was already sent.
enabled: true
# Minimum estimated tokens accumulated before using continuation mode (default: 50)
# Below this threshold the request is restarted from scratch on the fallback backend
# instead of appending a continuation prompt.
min_accumulated_tokens: 50
# Maximum fallback attempts per streaming request (default: 2, max: 10)
max_fallback_attempts: 2
# Prompt appended as a user message after the partial assistant response
continuation_prompt: "Continue from where you left off exactly. Do not repeat any previously generated content."
How It Works¶
- The client sends a streaming chat completion request.
- The router begins streaming from the primary backend, accumulating response content.
-
If the backend fails mid-stream (connection drop, timeout, error event):
- The error is NOT forwarded to the client.
- The accumulated partial response is captured.
- The next healthy backend in the fallback chain is selected (unhealthy backends are skipped).
- A continuation or restart request is sent to the fallback backend.
- Streaming resumes on the fallback backend without closing the client connection.
-
The client receives a seamless response with only a brief pause during the switchover.
Continuation vs. Restart Mode¶
The min_accumulated_tokens threshold controls which recovery mode is used:
| Condition | Mode | Behavior |
|---|---|---|
enabled: true (default) and tokens ≥ min_accumulated_tokens and not truncated | Continuation | Original messages + partial assistant response + continuation prompt |
enabled: true (default) and tokens < min_accumulated_tokens | Restart | Original request replayed (not enough context to continue) |
enabled: true (default) and content truncated (> 100 KB) | Restart | Forced restart to avoid incoherent context |
mid_stream_fallback.enabled: false | Restart | Original request replayed on fallback backend from scratch |
Continuation mode (the default) produces seamless output for the client. Restart mode is used automatically when there is too little context to continue meaningfully, or when the accumulated response is too long to include safely. Explicitly setting enabled: false forces restart mode unconditionally, which may cause duplicate or incoherent content visible to the client.
Edge Case Handling¶
The mid-stream fallback path addresses several edge cases automatically:
- Global timeout budget: All fallback attempts share the original request start time. Each attempt checks remaining budget before sending, preventing indefinite timeout accumulation across the chain.
- Cross-provider parameter translation: When the fallback model is on a different provider (e.g., OpenAI → Anthropic), request parameters are automatically translated — provider-specific fields removed and parameter names mapped.
- Concurrent request storms: A global semaphore (50 permits) limits simultaneous fallback attempts. Requests that cannot acquire a permit within 5 seconds are rejected gracefully.
- Accumulator truncation: When accumulated response content exceeds 100 KB, the continuation mode is forced to restart to avoid sending incoherent context to the fallback backend.
- Health re-check: Backend health is re-verified before each fallback attempt in the chain. Unhealthy backends are skipped to the next entry.
- Missing
[DONE]marker: Streams ending without[DONE]but withfinish_reason: "stop"are treated as completed successfully, preventing unnecessary fallback.
Metrics¶
Three Prometheus metrics track mid-stream fallback activity. See Mid-Stream Fallback Metrics for details.
Minimizing Failover Latency¶
When a backend goes down during streaming, the time until the fallback backend takes over depends on several configuration parameters across different subsystems. Below is a tuning guide for minimizing this switchover delay.
How failover delay is composed¶
The total time a client waits during a mid-stream failover is roughly:
Each component maps to specific configuration:
| Component | What determines it | Default | Tuning target |
|---|---|---|---|
| Failure detection | Stream inactivity timeout (hardcoded 60 s) or TCP read error (immediate) or chunk_interval timeout | 30–60 s | Lower chunk_interval |
| Health re-check | Health check before fallback attempt | timeout: 5s | Keep low |
| Fallback connection | TCP connect + TLS handshake to fallback backend | connection: 10s | Lower connection |
Recommended configuration for fast failover¶
# 1. Timeouts — the most impactful settings for failover speed
timeouts:
connection: 5s # Faster TCP connect timeout (default: 10s)
request:
streaming:
first_byte: 30s # How long to wait for the first token (default: 60s)
chunk_interval: 10s # Max silence between chunks before treating as failure (default: 30s)
total: 600s # Total streaming budget (keep generous)
# 2. Health checks — detect backend failures proactively
health_checks:
interval: 10s # Check every 10s instead of 30s (default: 30s)
timeout: 3s # Fail health checks faster (default: 5s)
unhealthy_threshold: 2 # Mark unhealthy after 2 failures (default: 3)
healthy_threshold: 1 # Recover after 1 success (default: 2)
warmup_check_interval: 1s # Fast checks during backend startup
# 3. Circuit breaker — stop routing to a failed backend immediately
circuit_breaker:
enabled: true
failure_threshold: 3 # Open circuit after 3 failures (default: 5)
timeout: 30s # Try recovery after 30s (default: 60s)
half_open_max_requests: 2
half_open_success_threshold: 1
timeout_as_failure: true # Count timeouts toward circuit breaker
# 4. Fallback chain — must be configured for mid-stream fallback to activate
fallback:
enabled: true
fallback_chains:
"gpt-4o":
- "gpt-4-turbo"
- "gpt-3.5-turbo"
fallback_policy:
trigger_conditions:
error_codes: [429, 500, 502, 503, 504]
timeout: true
connection_error: true
circuit_breaker_open: true
# 5. Mid-stream fallback — continuation mode (default: enabled)
streaming:
mid_stream_fallback:
enabled: true # Use continuation mode (default)
max_fallback_attempts: 3 # Allow more retries for resilience (default: 2)
min_accumulated_tokens: 30 # Lower threshold for continuation vs restart (default: 50)
Parameter impact summary¶
| Parameter | Effect on failover speed | Trade-off |
|---|---|---|
timeouts.request.streaming.chunk_interval | High — directly controls how quickly a stalled stream is detected | Too low may cause false positives on slow models (e.g., reasoning models with long thinking phases) |
timeouts.connection | Medium — limits TCP connect delay to fallback backend | Too low may fail on high-latency networks |
health_checks.interval | Medium — faster detection means the circuit breaker opens sooner, preventing requests from reaching a dead backend | More frequent checks increase backend load |
health_checks.unhealthy_threshold | Medium — fewer failures needed to mark backend unhealthy | Lower values increase sensitivity to transient errors |
circuit_breaker.failure_threshold | Medium — fewer failures to open circuit | Too aggressive may open circuit on temporary spikes |
circuit_breaker.timeout | Low — affects recovery time, not failover speed | Shorter means faster recovery but more probing of unhealthy backends |
mid_stream_fallback.max_fallback_attempts | Low — more attempts increase resilience but not speed of individual switchover | More attempts consume more of the global timeout budget |
Failure detection scenarios¶
Different failure types are detected at different speeds:
| Failure type | Detection time | Mechanism |
|---|---|---|
| TCP connection reset / backend crash | Immediate (< 1 s) | Stream read error triggers instant fallback |
| Backend returns 5xx error | Immediate (< 1 s) | HTTP status check before streaming begins |
| Backend becomes unresponsive (stall) | chunk_interval (default 30 s) | Inactivity timeout on the stream |
| Backend sends error SSE events | After 5 errors | Error count threshold in stream processing |
| Backend process killed mid-response | Immediate (< 1 s) | TCP FIN/RST detected as stream read error |
The most common scenario in production — a backend becoming unresponsive — is governed by chunk_interval. For latency-sensitive applications, lowering this to 10–15 seconds is recommended, with model-specific overrides for slow models:
timeouts:
request:
streaming:
chunk_interval: 10s # Fast detection for most models
model_overrides:
gemini-2.5-pro: # Reasoning models need longer intervals
streaming:
chunk_interval: 30s
first_byte: 120s
Rate Limiting¶
Continuum Router includes built-in rate limiting for the /v1/models endpoint to prevent abuse and ensure fair resource allocation.
Current Configuration¶
Rate limiting is currently configured with the following default values:
# Note: These values are currently hardcoded but may become configurable in future versions
rate_limiting:
models_endpoint:
# Per-client limits (identified by API key or IP address)
sustained_limit: 100 # Maximum requests per minute
burst_limit: 20 # Maximum requests in any 5-second window
# Time windows
window_duration: 60s # Sliding window for sustained limit
burst_window: 5s # Window for burst detection
# Client identification priority
identification:
- api_key # Bearer token (first 16 chars used as ID)
- x_forwarded_for # Proxy/load balancer header
- x_real_ip # Alternative IP header
- fallback: "unknown" # When no identifier available
How It Works¶
- Client Identification: Each request is associated with a client using:
- API key from
Authorization: Bearer <token>header (preferred) -
IP address from proxy headers (fallback)
-
Dual-Window Approach:
- Sustained limit: Prevents excessive usage over time
-
Burst protection: Catches rapid-fire requests
-
Independent Quotas: Each client has separate rate limits:
- Client A with API key
abc123...: 100 req/min - Client B with API key
def456...: 100 req/min - Client C from IP
192.168.1.1: 100 req/min
Response Headers¶
When rate limited, the response includes: - Status Code: 429 Too Many Requests - Error Message: Indicates whether burst or sustained limit was exceeded
Cache TTL Optimization¶
To prevent cache poisoning attacks: - Empty model lists: Cached for 5 seconds only - Normal responses: Cached for 60 seconds
This prevents attackers from forcing the router to cache empty responses during backend outages.
Monitoring¶
Rate limit violations are tracked in metrics: - rate_limit_violations: Total rejected requests - empty_responses_returned: Empty model lists served - Per-client violation tracking for identifying problematic clients
Future Enhancements¶
Future versions may support: - Configurable rate limits via YAML/environment variables - Per-endpoint rate limiting - Custom rate limits per API key - Redis-backed distributed rate limiting
Environment-Specific Configurations¶
Development Configuration¶
# config/development.yaml
server:
bind_address: "127.0.0.1:8080"
backends:
- name: "local-ollama"
url: "http://localhost:11434"
health_checks:
interval: "10s" # More frequent checks
timeout: "5s"
logging:
level: "debug" # Verbose logging
format: "pretty" # Human-readable
enable_colors: true
Production Configuration¶
# config/production.yaml
server:
bind_address: "0.0.0.0:8080"
workers: 8 # More workers for production
connection_pool_size: 300 # Larger connection pool
backends:
- name: "primary-openai"
url: "https://api.openai.com"
weight: 3
- name: "secondary-azure"
url: "https://azure-openai.example.com"
weight: 2
- name: "fallback-local"
url: "http://internal-llm:11434"
weight: 1
health_checks:
interval: "60s" # Less frequent checks
timeout: "15s" # Longer timeout for network latency
unhealthy_threshold: 5 # More tolerance
healthy_threshold: 3
request:
timeout: "120s" # Shorter timeout for production
max_retries: 5 # More retries
logging:
level: "warn" # Less verbose logging
format: "json" # Structured logging
Container Configuration¶
# config/container.yaml - optimized for containers
server:
bind_address: "0.0.0.0:8080"
workers: 0 # Auto-detect based on container limits
backends:
- name: "backend-1"
url: "${BACKEND_1_URL}" # Environment variable substitution
- name: "backend-2"
url: "${BACKEND_2_URL}"
logging:
level: "${LOG_LEVEL}" # Configurable via environment
format: "json" # Always JSON in containers