Server & Backends¶
Server Section¶
Controls the HTTP server behavior:
server:
bind_address: "0.0.0.0:8080" # Host and port to bind
workers: 4 # Worker threads (0 = auto)
connection_pool_size: 100 # HTTP connection pool size
Multiple Bind Addresses and Unix Sockets¶
The server supports binding to multiple addresses simultaneously, including Unix domain sockets (on Unix-like systems and Windows 10 1809+). This enables flexible deployment scenarios such as:
- Listening on both IPv4 and IPv6 addresses
- Exposing a TCP port for external clients while using a Unix socket for local services
- Running behind a reverse proxy via Unix socket for better security
Single Address (Backward Compatible):
Multiple Addresses:
server:
bind_address:
- "127.0.0.1:8080" # IPv4 localhost
- "[::1]:8080" # IPv6 localhost
- "0.0.0.0:9090" # All interfaces on port 9090
Unix Socket Binding (Linux, macOS, and Windows 10 1809+):
server:
bind_address:
- "0.0.0.0:8080" # TCP for external access
- "unix:/var/run/continuum-router.sock" # Unix socket for local services
socket_mode: 0o660 # Optional: file permissions for Unix sockets (octal)
Configuration Options:
| Option | Type | Default | Description |
|---|---|---|---|
bind_address |
string or array | "0.0.0.0:8080" |
Address(es) to bind. TCP format: host:port. Unix socket format: unix:/path/to/socket |
socket_mode |
integer (octal) | null |
File permissions for Unix sockets (e.g., 0o660 for owner/group read-write) |
Unix Socket Notes:
- Unix socket addresses must start with
unix:prefix - Existing socket files are automatically removed before binding
- Socket files are cleaned up on graceful shutdown
- On Windows 10 1809+ (Build 17063+), Unix sockets are fully supported via the
socket2crate - On other non-Unix platforms,
unix:addresses log a warning and are skipped - Windows does not support Unix file permission modes; the
socket_modeoption is accepted but ignored - Unix socket connections bypass IP-based authentication checks (client IP reported as "unix")
Nginx Reverse Proxy Example:
upstream continuum {
server unix:/var/run/continuum-router.sock;
}
server {
listen 443 ssl;
location /v1/ {
proxy_pass http://continuum;
}
}
Performance Tuning:
workers: Set to 0 for auto-detection, or match CPU coresconnection_pool_size: Increase for high-load scenarios (200-500)
CORS Configuration¶
CORS (Cross-Origin Resource Sharing) allows the router to accept requests from web browsers running on different origins. This is essential for embedding continuum-router in:
- Tauri apps: WebView using origins like
tauri://localhost - Electron apps: Custom protocols
- Separate web frontends: Development servers on different ports
server:
bind_address: "0.0.0.0:8080"
cors:
enabled: true
allow_origins:
- "tauri://localhost"
- "http://localhost:*" # Wildcard port matching
- "https://example.com"
allow_methods:
- "GET"
- "POST"
- "PUT"
- "DELETE"
- "OPTIONS"
- "PATCH"
allow_headers:
- "Content-Type"
- "Authorization"
- "X-Request-ID"
- "X-Trace-ID"
expose_headers:
- "X-Request-ID"
- "X-Fallback-Used"
allow_credentials: false
max_age: 3600 # Preflight cache duration in seconds
CORS Configuration Options:
| Option | Type | Default | Description |
|---|---|---|---|
enabled |
boolean | false |
Enable/disable CORS middleware |
allow_origins |
array | [] |
Allowed origins (supports * for any, port wildcards like http://localhost:*) |
allow_methods |
array | ["GET", "POST", "PUT", "DELETE", "OPTIONS", "PATCH"] |
Allowed HTTP methods |
allow_headers |
array | ["Content-Type", "Authorization", "X-Request-ID", "X-Trace-ID"] |
Allowed request headers |
expose_headers |
array | [] |
Headers exposed to the client JavaScript |
allow_credentials |
boolean | false |
Allow cookies and authorization headers |
max_age |
integer | 3600 |
Preflight response cache duration in seconds |
Origin Pattern Matching:
| Pattern | Example | Description |
|---|---|---|
* |
* |
Matches any origin (not compatible with allow_credentials: true) |
| Exact URL | https://example.com |
Exact match |
| Custom scheme | tauri://localhost |
Custom protocols (Tauri, Electron) |
| Port wildcard | http://localhost:* |
Matches any port on localhost |
Security Considerations:
- Using
*for origins allows any website to make requests - only use for public APIs - When
allow_credentialsistrue, you cannot use*for origins - specify exact origins - For development, use port wildcards like
http://localhost:*for flexibility - In production, always specify exact origins for security
Hot Reload: CORS configuration supports immediate hot reload - changes apply to new requests instantly without server restart.
Backends Section¶
Defines the LLM backends to route requests to:
backends:
- name: "unique-identifier" # Must be unique across all backends
type: "generic" # Backend type (optional, defaults to "generic")
url: "http://backend:port" # Base URL for the backend
weight: 1 # Load balancing weight (1-100)
api_key: "${API_KEY}" # API key (optional, supports env var references)
org_id: "${ORG_ID}" # Organization ID (optional, for OpenAI)
models: ["model1", "model2"] # Optional: explicit model list
retry_override: # Optional: backend-specific retry settings
max_attempts: 5
base_delay: "200ms"
Starting Without Backends¶
The router can start with an empty backends list (backends: []), which is useful for:
- Infrastructure bootstrapping: Start the router first, then add backends dynamically via the Admin API
- Container orchestration: Router container can be ready before backend services
- Development workflows: Test admin endpoints before backends are provisioned
- Gradual rollout: Start with zero backends and add them progressively
When running with no backends:
/v1/modelsreturns{"object": "list", "data": []}/v1/chat/completionsand other routing endpoints return 503 "No backends available"/healthreturns healthy status (the router itself is operational)- Backends can be added via
POST /admin/backends
Example minimal configuration for dynamic backend management:
server:
bind_address: "0.0.0.0:8080"
backends: [] # Start with no backends - add via Admin API later
admin:
auth:
method: bearer
token: "${ADMIN_TOKEN}"
Backend Types Supported:
| Type | Description | Default URL |
|---|---|---|
generic |
OpenAI-compatible API (default) | Must be specified |
openai |
Native OpenAI API with built-in configuration | https://api.openai.com/v1 |
gemini |
Google Gemini API (OpenAI-compatible endpoint) | https://generativelanguage.googleapis.com/v1beta/openai |
azure |
Azure OpenAI Service | Must be specified |
vllm |
vLLM server | Must be specified |
ollama |
Ollama local server | http://localhost:11434 |
llamacpp |
llama.cpp llama-server (GGUF models) | http://localhost:8080 |
mlxcel |
MLxcel server (MLX-based, llama-server compatible, macOS only) | http://localhost:8080 |
lmstudio |
LM Studio local server | http://localhost:1234 |
anthropic |
Anthropic Claude API (native, with request/response translation) | https://api.anthropic.com |
bedrock |
Amazon Bedrock Claude (mantle + runtime; runtime requires --features bedrock-sigv4) |
https://bedrock-mantle.{region}.api.aws or https://bedrock-runtime.{region}.amazonaws.com (templated) |
continuum-router |
Remote Continuum Router or Backend.AI GO instance (federated routing) | Must be specified |
Native OpenAI Backend¶
When using type: openai, the router provides:
- Default URL: https://api.openai.com/v1 (can be overridden for proxies)
- Built-in model metadata: Automatic pricing, context windows, and capabilities
- Environment variable support: Automatically loads from CONTINUUM_OPENAI_API_KEY and CONTINUUM_OPENAI_ORG_ID
Minimal OpenAI configuration:
Full OpenAI configuration with explicit API key:
backends:
- name: "openai-primary"
type: openai
api_key: "${CONTINUUM_OPENAI_API_KEY}"
org_id: "${CONTINUUM_OPENAI_ORG_ID}" # Optional
models:
- gpt-4o
- gpt-4o-mini
- o1
- o1-mini
- o3-mini
- text-embedding-3-large
Using OpenAI with a proxy:
backends:
- name: "openai-proxy"
type: openai
url: "https://my-proxy.example.com/v1" # Override default URL
api_key: "${PROXY_API_KEY}"
models:
- gpt-4o
ChatGPT subscription / Codex headless login (OAuth device flow)¶
Continuum Router can authenticate against the OpenAI Codex backend
(https://chatgpt.com/backend-api/codex) using a ChatGPT Plus / Pro /
Enterprise subscription rather than a paid OpenAI API key.
The OpenAI provider does not implement standards-compliant RFC 8628 device flow; instead, the router uses OpenAI's custom Codex headless device-code flow, which is what the official Codex CLI uses for "login on headless devices." The flow has three steps:
- Request a one-time
user_codefromauth.openai.com/api/accounts/deviceauth/usercode. - Poll
auth.openai.com/api/accounts/deviceauth/tokenuntil the user approves the code in their browser. - Exchange the resulting authorization code for access / refresh tokens
via PKCE at
auth.openai.com/oauth/token.
Every request carries an originator: codex_cli_rs header so the
Cloudflare front in front of auth.openai.com admits the traffic.
One-time login¶
Run the device-flow login from any machine that can open the OpenAI verification URL in a browser. The router prints a verification URL and a short user code, and polls the token endpoint until the device is approved.
Tokens are written atomically to the configured token_store with mode
0600 on Unix. After login, the router uses these tokens automatically and
refreshes them transparently before they expire (60-second clock-skew
margin). A 401 response from the backend triggers one forced refresh and
a single retry before the error surfaces to the client.
Backend configuration¶
backends:
- name: openai-chatgpt
type: openai
url: https://chatgpt.com/backend-api/codex
auth:
type: oauth
oauth:
provider: openai
token_store: ~/.continuum-router/auth/openai.json
# Codex backends enumerate their real models from the live endpoint (see
# "Dynamic model enumeration" below). A non-empty list selects which
# enumerated models to expose and is the fallback when enumeration fails or
# returns nothing; leave it empty to expose every enumerated model.
models:
- gpt-5
- codex-mini
client_id and scope default to the public Codex CLI values that
auth.openai.com accepts; you only need to override them for a custom
OAuth client registration.
Configuration reference¶
| Field | Required | Description |
|---|---|---|
auth.type |
yes | Must be oauth to enable device-flow authentication. |
auth.oauth.provider |
yes | OAuth provider. Currently only openai is supported. |
auth.oauth.client_id |
no | Public OAuth client ID. Defaults to the Codex CLI's public client_id, which is what auth.openai.com accepts for ChatGPT-subscription headless login. Override only if you have your own OAuth client registered with the provider. |
auth.oauth.scope |
no | Space-separated scope string requested during device authorization. Defaults to "openid profile email offline_access". |
auth.oauth.token_store |
yes | Path to the JSON token store (e.g. ~/.continuum-router/auth/openai.json). Tilde and ${ENV_VAR} are expanded. |
auth.oauth.device_code_endpoint |
no | Override the device-authorization (user-code) endpoint. Defaults to the provider's well-known URL. |
auth.oauth.token_poll_endpoint |
no | Override the token-poll endpoint used during device flow (Codex-specific; distinct from the standard token_endpoint). Defaults to the provider's well-known URL. |
auth.oauth.token_endpoint |
no | Override the token endpoint used for the PKCE exchange and refresh. Defaults to the provider's well-known URL. |
auth.oauth.verification_url |
no | Override the user-facing verification URL printed during auth login. Defaults to the provider's well-known URL (https://auth.openai.com/codex/device for openai). |
auth.oauth.redirect_uri |
no | Override the redirect URI used by the PKCE exchange. Defaults to the provider's well-known URL. |
auth.oauth.originator |
no | Override the originator request header. Defaults to codex_cli_rs for provider: openai, which auth.openai.com's Cloudflare front allowlists. Override only if your environment requires a different value. |
auth.oauth.user_agent |
no | Override the User-Agent header sent on device-flow and refresh requests. Defaults to a Codex-CLI-compatible value for provider: openai because auth.openai.com is Cloudflare-fronted and rejects reqwest's default UA with a JS challenge. |
Dynamic model enumeration¶
Unlike static OpenAI API backends, the ChatGPT Codex backend does not expose a standard /v1/models endpoint. Instead it serves a model list at GET <base>/models?client_version=<ver> that is gated by the account's subscription plan. The router queries this endpoint with the loaded OAuth token during model discovery (and on the normal refresh cadence) and uses the result to populate /v1/models and routing:
- The request carries
Authorization: Bearer <access_token>, theoriginator: codex_cli_rsheader, and achatgpt-account-idheader derived from theid_token. - Only user-facing models survive: an entry is kept when its
visibilityislistand itsavailable_in_planseither is empty or contains the account plan (read from theid_token'schatgpt_plan_typeclaim). Internal entries such ascodex-auto-review(visibilityhide) are never exposed. - The configured
models:selection is applied to the enumerated set, exactly like every other backend. A non-empty list intersects the enumerated models down to the operator-selected subset; an empty list exposes the full enumerated set (e.g.gpt-5.5,gpt-5.4,gpt-5.4-mini). Enumerated models carry the clean owneropenaiinstead of the raw backend name. - On any failure (network error, non-2xx, empty list), the router falls back to the configured
models:list so routing degrades gracefully. Dynamic enumeration applies only to Codex OAuth backends; other OAuth and static-key backends are unchanged.
Request handling¶
The ChatGPT Codex backend serves a single inference endpoint, <base>/responses (https://chatgpt.com/backend-api/codex/responses for the default base), which implements a stricter subset of the public OpenAI Responses API. The router converts /v1/chat/completions requests routed to a Codex OAuth backend into Responses-API form and adjusts the converted request to the shape Codex accepts.
inputis always sent as an item list; Codex rejects the bare-string shorthand that a single-message request would otherwise produce.max_output_tokens,temperature,top_p,presence_penalty,frequency_penalty, andstopare removed, because Codex rejects them.- A
tool_choicethat forces a specific function is downgraded to"required", with a warning in the log. Codex supports only the string modes, so the model is still pushed to call a tool without naming one. - When the request carries no system message, the
instructionsfield is filled with a minimal default, because Codex rejects requests without instructions.
Codex accepts exactly one streaming/storage combination, stream: true with store: false, and the router forces both on every converted upstream call. A streaming client receives the converted SSE stream as usual. For a non-streaming client (stream: false), the router consumes the upstream SSE stream and folds it into a single chat-completion JSON body before responding; this detection also tolerates Codex responses that carry an SSE body without the text/event-stream content type. Because store is always false, nothing is persisted on the OpenAI side, and conversation state travels in the request as with any Chat Completions client.
Operational notes¶
- Tokens never appear in logs, traces, or metrics; only a short redacted prefix is logged when a refresh occurs.
- Refreshes are single-flighted with a
tokio::sync::Mutex, so concurrent in-flight requests during expiry windows do not stampede the OAuth provider. - Static
api_keyconfigurations are unaffected; OAuth is opt-in per backend via theauth.type: oauthblock. - Re-running
continuum-router auth login --backend <name>rewrites the token store atomically and is safe while the router is running. - The token store is read leniently:
expires_ataccepts epoch seconds (the canonical form the router writes), a numeric string, or an RFC3339 datetime, so a token file produced by another tool loads without editing. - The OpenAI provider's auth endpoint is Cloudflare-protected; the default
user_agentandoriginator: codex_cli_rsheader mirror the official Codex CLI so the device flow reaches the OAuth endpoint instead of the bot-challenge page. Setauth.oauth.user_agentand/orauth.oauth.originatorto custom values only if your environment specifically requires them. OAuthis rendered asoauthin YAML; the legacyo_authrendering produced by serde's default snake_case derivation is also accepted as an alias for backward compatibility.
Environment Variables for OpenAI¶
| Variable | Description |
|---|---|
CONTINUUM_OPENAI_API_KEY |
OpenAI API key (automatically loaded for type: openai backends) |
CONTINUUM_OPENAI_ORG_ID |
OpenAI Organization ID (optional) |
Model Auto-Discovery:
When models is not specified or is empty, backends automatically discover available models from their /v1/models API endpoint during initialization. This feature reduces configuration maintenance and ensures all backend-reported models are routable.
| Backend Type | Auto-Discovery Support | Fallback Models |
|---|---|---|
openai |
✅ Yes | gpt-4o, gpt-4o-mini, o3-mini |
gemini |
✅ Yes | gemini-3.1-pro-preview, gemini-3-flash-preview, gemini-2.5-pro, gemini-2.5-flash |
vllm |
✅ Yes | vicuna-7b-v1.5, llama-2-7b-chat, mistral-7b-instruct |
ollama |
✅ Yes | Uses vLLM discovery mechanism |
llamacpp |
✅ Yes | Auto-discovers from /v1/models endpoint |
mlxcel |
✅ Yes | Auto-discovers from /v1/models endpoint |
lmstudio |
✅ Yes | Auto-discovers from /v1/models endpoint |
continuum-router |
✅ Yes | Auto-discovers from remote /v1/models endpoint |
anthropic |
❌ No (no API) | Hardcoded Claude models |
generic |
❌ No | All models supported (supports_model() returns true) |
Discovery Behavior:
- Timeout: 10-second timeout prevents blocking startup
- Fallback: If discovery fails (timeout, network error, invalid response), fallback models are used
- Logging: Discovered models are logged at INFO level; fallback usage logged at WARN level
Model Resolution Priority:
1. Explicit models list from config (highest priority)
2. Models from model_configs field
3. Auto-discovered models from backend API
4. Hardcoded fallback models (lowest priority)
- Explicit model lists improve startup time and reduce backend queries
Native Gemini Backend¶
When using type: gemini, the router provides:
- Default URL: https://generativelanguage.googleapis.com/v1beta/openai (OpenAI-compatible endpoint)
- Built-in model metadata: Automatic context windows and capabilities for Gemini models
- Environment variable support: Automatically loads from CONTINUUM_GEMINI_API_KEY
- Extended streaming timeout: 300s timeout for thinking models (gemini-3.1-pro, gemini-3-flash, gemini-2.5-pro)
- Automatic max_tokens adjustment: For thinking models, see below
Minimal Gemini configuration:
backends:
- name: "gemini"
type: gemini
models:
- gemini-3.1-pro-preview
- gemini-3-flash-preview
- gemini-2.5-pro
- gemini-2.5-flash
Full Gemini configuration with API Key:
backends:
- name: "gemini"
type: gemini
api_key: "${CONTINUUM_GEMINI_API_KEY}"
weight: 2
models:
- gemini-3.1-pro-preview
- gemini-3-flash-preview
- gemini-2.5-pro
- gemini-2.5-flash
Gemini Authentication Methods¶
The Gemini backend supports two authentication methods:
API Key Authentication (Default)¶
The simplest authentication method using a Google AI Studio API key:
backends:
- name: "gemini"
type: gemini
api_key: "${CONTINUUM_GEMINI_API_KEY}"
models:
- gemini-3.1-pro-preview
Service Account Authentication¶
For enterprise environments and Google Cloud Platform (GCP) deployments, you can use Service Account authentication with automatic OAuth2 token management:
backends:
- name: "gemini"
type: gemini
auth:
type: service_account
key_file: "/path/to/service-account.json"
models:
- gemini-3.1-pro-preview
- gemini-3-flash-preview
Using environment variable for key file path:
backends:
- name: "gemini"
type: gemini
auth:
type: service_account
key_file: "${GOOGLE_APPLICATION_CREDENTIALS}"
models:
- gemini-3.1-pro-preview
Service Account Authentication Features:
| Feature | Description |
|---|---|
| Automatic Token Refresh | OAuth2 tokens are automatically refreshed 5 minutes before expiration |
| Token Caching | Tokens are cached in memory to minimize authentication overhead |
| Thread-Safe | Concurrent requests safely share token refresh operations |
| Environment Variable Expansion | Key file paths support ${VAR} and ~ expansion |
Creating a Service Account Key:
- Go to Google Cloud Console
- Navigate to IAM & Admin > Service Accounts
- Create a new service account or select an existing one
- Click Keys > Add Key > Create new key
- Choose JSON format and download the key file
- Store the key file securely and reference it in your configuration
Required Permissions:
The service account needs the following roles for Gemini API access:
roles/aiplatform.user- For Vertex AI Gemini endpoints- Or appropriate Google AI Studio permissions for generativelanguage.googleapis.com
Authentication Priority¶
When multiple authentication methods are configured:
| Priority | Method | Condition |
|---|---|---|
| 1 (Highest) | auth block |
If auth.type is specified |
| 2 | api_key field |
If no auth block is present |
| 3 | Environment variable | Falls back to CONTINUUM_GEMINI_API_KEY |
If both api_key and auth are specified, the auth block takes precedence and a warning is logged.
Gemini Thinking Models: Automatic max_tokens Adjustment¶
Gemini "thinking" models (gemini-3.1-pro, gemini-3-flash, gemini-2.5-pro, and models with -pro-preview suffix) perform extended reasoning before generating responses. To prevent response truncation, the router automatically adjusts max_tokens:
| Condition | Behavior |
|---|---|
max_tokens not specified |
Automatically set to 16384 |
max_tokens < 4096 |
Automatically increased to 16384 |
max_tokens >= 4096 |
Client value preserved |
This ensures thinking models can generate complete responses without truncation due to low default values from client libraries.
Environment Variables for Gemini¶
| Variable | Description |
|---|---|
CONTINUUM_GEMINI_API_KEY |
Google Gemini API key (automatically loaded for type: gemini backends) |
GOOGLE_APPLICATION_CREDENTIALS |
Path to service account JSON key file (standard GCP environment variable) |
Native Anthropic Backend¶
When using type: anthropic, the router provides:
- Default URL: https://api.anthropic.com (can be overridden for proxies)
- Native API translation: Automatically converts OpenAI format requests to Anthropic Messages API format and vice versa
- Anthropic-specific headers: Automatically adds x-api-key and anthropic-version headers
- Environment variable support: Automatically loads from CONTINUUM_ANTHROPIC_API_KEY
- Extended streaming timeout: 600s timeout for extended thinking models (Claude Opus, Sonnet 4)
Minimal Anthropic configuration:
backends:
- name: "anthropic"
type: anthropic
models:
- claude-sonnet-4-20250514
- claude-haiku-3-5-20241022
Full Anthropic configuration:
backends:
- name: "anthropic"
type: anthropic
api_key: "${CONTINUUM_ANTHROPIC_API_KEY}"
weight: 2
anthropic_fast_mode: false # opt-in: enable fast mode for eligible models (default false)
models:
- claude-fable-5
- claude-mythos-5 # limited release (Project Glasswing); needs approved access
- claude-opus-4-8
- claude-opus-4-7
- claude-opus-4-6
- claude-sonnet-4-6
- claude-haiku-4-5
Anthropic API Translation¶
The router automatically handles the translation between OpenAI and Anthropic API formats:
| OpenAI Format | Anthropic Format |
|---|---|
messages array with role: "system" |
Separate system parameter |
Authorization: Bearer <key> |
x-api-key: <key> header |
Optional max_tokens |
Required max_tokens (auto-filled if missing) |
choices[0].message.content |
content[0].text |
finish_reason: "stop" |
stop_reason: "end_turn" |
finish_reason: "content_filter" |
stop_reason: "refusal" |
usage.prompt_tokens |
usage.input_tokens |
When Anthropic returns stop_reason: "refusal", the router maps it to the OpenAI-compatible finish_reason: "content_filter". The upstream stop_details object (containing the refusal category) is forwarded on the response choice under stop_details and omitted when absent.
Example Request Translation:
OpenAI format (incoming from client):
{
"model": "claude-sonnet-4-20250514",
"messages": [
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "Hello"}
],
"max_tokens": 1024
}
Anthropic format (sent to API):
{
"model": "claude-sonnet-4-20250514",
"system": "You are helpful.",
"messages": [
{"role": "user", "content": "Hello"}
],
"max_tokens": 1024
}
Anthropic Native API Endpoints¶
In addition to routing OpenAI-format requests to Anthropic backends, the router also provides native Anthropic API endpoints:
| Endpoint | Description |
|---|---|
POST /anthropic/v1/messages |
Native Anthropic Messages API |
POST /anthropic/v1/messages/count_tokens |
Token counting with tiered backend support |
GET /anthropic/v1/models |
Model listing in Anthropic format |
These endpoints allow clients that use Anthropic's native API format (such as Claude Code) to connect directly without any request/response transformation overhead.
Claude Code Compatibility¶
The Anthropic Native API endpoints include full compatibility with Claude Code and other advanced Anthropic API clients:
Prompt Caching Support:
The router preserves cache_control fields throughout the request/response pipeline:
- System prompt text blocks
- User message content blocks (text, image, document)
- Tool definitions
- Tool use and tool result blocks
Header Forwarding:
| Header | Behavior |
|---|---|
anthropic-version |
Forwarded to native Anthropic backends |
anthropic-beta |
Forwarded to enable beta features (e.g., prompt-caching-2024-07-31, interleaved-thinking-2025-05-14) |
x-request-id |
Forwarded for request tracing |
Cache Usage Reporting:
Streaming responses from native Anthropic backends include cache usage information:
{
"usage": {
"input_tokens": 2159,
"cache_creation_input_tokens": 2048,
"cache_read_input_tokens": 0
}
}
Anthropic Extended Thinking Models¶
Models supporting extended thinking (Claude Opus, Sonnet 4, Claude Opus 4.7, and Claude Opus 4.8) may require longer response times. The router automatically:
- Sets higher default
max_tokens(16384) for thinking models - Uses extended streaming timeout (600s) for these models
Claude Opus 4.7/4.8 and the Mythos-class models (Fable 5, Mythos 5) require the adaptive thinking API (thinking.type == "adaptive" + output_config.effort) and reject the legacy budget_tokens shape. The router normalizes explicit legacy thinking.type == "enabled" requests for these models to adaptive thinking. These models also do not accept temperature, top_p, or top_k; the router drops these parameters automatically. Fable 5 and Mythos 5 additionally reject an explicit thinking.type == "disabled" (HTTP 400); the router omits the thinking parameter entirely for claude-fable-5-* and claude-mythos-5-* in that case.
Claude Fable 5¶
Claude Fable 5 (claude-fable-5-*, alias claude-fable-5-latest) is Anthropic's most capable model, a Mythos-class flagship positioned a tier above Opus 4.8. Key characteristics:
- Context window: 1M tokens (input)
- Max output: 128K tokens
- Pricing: $10 / $50 per million tokens (input / output)
- Thinking: Adaptive only (
thinking.type: "adaptive"+output_config.effort). The legacyenabled+budget_tokensshape returns HTTP 400, and an explicitthinking.type: "disabled"also returns HTTP 400 (the router omits thethinkingparameter instead). - Effort: Supports
low,medium,high, andmax(reasoning_effort: "xhigh"maps tomax). - Sampling parameters:
temperature,top_p,top_kare not accepted. The router drops them automatically before forwarding.
Claude Mythos 5 (claude-mythos-5-*, alias claude-mythos-5-latest) is the same underlying model as Fable 5 with the safety classifiers lifted, available only through Anthropic's limited Project Glasswing release. It shares every characteristic above (context window, max output, pricing, thinking, effort, and sampling-parameter handling); the router treats both ids identically.
Claude Opus 4.8¶
Claude Opus 4.8 (claude-opus-4-8-*, alias claude-opus-4-8-latest) is the flagship Claude 4.8 model. Key characteristics:
- Context window: 1M tokens (input)
- Max output: 128K tokens
- Pricing: $5 / $25 per million tokens (input / output)
- Knowledge cutoff: January 2026
- Thinking: Adaptive only (
thinking.type: "adaptive"+output_config.effort). The legacyenabled+budget_tokensshape returns HTTP 400. - Sampling parameters:
temperature,top_p,top_kare not accepted. The router drops them automatically before forwarding. - Effort default:
high. Whenreasoning_effortis omitted or set toauto, Anthropic applieshigheffort unlessoutput_config.effortis specified.
Anthropic Fast Mode¶
Fast mode reduces latency for eligible Claude models by routing requests through Anthropic's accelerated inference path. It is gated behind a per-backend opt-in configuration flag and only applies to native Anthropic backends (never Bedrock or Vertex AI).
Enabling Fast Mode¶
Set anthropic_fast_mode: true on the backend configuration:
backends:
- name: "anthropic-fast"
type: anthropic
api_key: "${CONTINUUM_ANTHROPIC_API_KEY}"
anthropic_fast_mode: true
models:
- claude-opus-4-8
- claude-opus-4-7
- claude-opus-4-6
When anthropic_fast_mode is enabled, the router adds the anthropic-beta: fast-mode-2026-02-01 header to eligible requests.
Eligible Models¶
Fast mode applies to Opus 4.6, 4.7, and 4.8 models on native Anthropic backends. Requests to Bedrock or Vertex AI backends ignore this flag even if set.
Speed Field Passthrough¶
Clients can also request fast mode explicitly via the speed field in the request body:
The speed field is forwarded on the OpenAI-compatible path. The response echoes the resolved speed in usage.speed ("fast" or "standard").
Pricing
Fast mode requests are billed at premium rates. Check Anthropic's current pricing for fast-mode-specific cost information.
Backend restriction
Fast mode is available on native Anthropic API backends only. Bedrock and Vertex AI backends do not support the anthropic-beta: fast-mode-2026-02-01 header.
Mid-Conversation System Messages (Claude Opus 4.8+)¶
Starting with Claude Opus 4.8, the Anthropic API accepts role: "system" messages at any position in the messages array, not only at the start of the conversation. The router enables this for models that support it.
Behavior by Model Version¶
| Model family | Mid-conversation role: "system" support |
|---|---|
Claude Fable 5 / Mythos 5 (claude-fable-5-*, claude-mythos-5-*) |
Preserved in-array at any position |
Claude Opus 4.8+ (claude-opus-4-8-*) |
Preserved in-array at any position |
| Claude 4.7 and earlier (and all Sonnet/Haiku) | Flattened: all system messages, including any after the first user turn, are merged into the top-level system field (none are preserved in-array) |
How It Works¶
When the router receives a messages array that contains role: "system" entries after a user turn, and the target model is Opus 4.8 or later, those entries are kept in place within the messages array. Leading system or developer messages before the first user turn still fill the top-level system field, because Anthropic requires at least one user message after the top-level system prompt.
For Claude 4.7 and earlier models, the prior behavior applies: all system messages are extracted and combined into the top-level system field. Mid-conversation system messages are merged into that top-level field as well, rather than preserved as in-array entries.
Example request with mid-conversation system message:
{
"model": "claude-opus-4-8",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "Paris."},
{"role": "system", "content": "Now respond only in French."},
{"role": "user", "content": "And Germany?"}
]
}
For Opus 4.8+, the router sends the leading system message as the top-level system field and preserves the mid-conversation role: "system" entry in the messages array as sent to Anthropic.
OpenAI ↔ Claude Reasoning Parameter Conversion¶
The router automatically converts between OpenAI's reasoning parameters and Claude's thinking parameter, enabling cross-provider reasoning requests without client changes.
Supported OpenAI Formats:
| Format | API | Example |
|---|---|---|
reasoning_effort (flat) |
Chat Completions API | "reasoning_effort": "high" |
reasoning.effort (nested) |
Responses API | "reasoning": {"effort": "high"} |
When both formats are present, reasoning_effort (flat) takes precedence.
Effort Level to Budget Tokens Mapping:
| Effort Level | Claude thinking.budget_tokens |
|---|---|
none |
(thinking disabled) |
minimal |
1,024 |
low |
4,096 |
medium |
10,240 |
high |
32,768 |
Example Request - Chat Completions API (flat format):
// Client sends OpenAI Chat Completions API request
{
"model": "claude-sonnet-4-6",
"reasoning_effort": "high",
"messages": [{"role": "user", "content": "Solve this complex problem"}]
}
// Router converts to Claude format
{
"model": "claude-sonnet-4-6",
"thinking": {"type": "enabled", "budget_tokens": 32768},
"messages": [{"role": "user", "content": "Solve this complex problem"}]
}
Example Request - Responses API (nested format):
// Client sends OpenAI Responses API request
{
"model": "claude-sonnet-4-6",
"reasoning": {"effort": "medium"},
"messages": [{"role": "user", "content": "Analyze this data"}]
}
// Router converts to Claude format
{
"model": "claude-sonnet-4-6",
"thinking": {"type": "enabled", "budget_tokens": 10240},
"messages": [{"role": "user", "content": "Analyze this data"}]
}
Response with Reasoning Content:
{
"choices": [{
"message": {
"role": "assistant",
"content": "The final answer is...",
"reasoning_content": "Let me analyze this step by step..."
}
}]
}
Important Notes:
- If
thinkingparameter is explicitly provided, it takes precedence overreasoning_effortandreasoning.effort reasoning_effort(flat) takes precedence overreasoning.effort(nested) when both are present- Only models supporting extended thinking (Opus 4.x, Sonnet 4.x, Opus 4.7, Opus 4.8) will have reasoning enabled
- When reasoning is enabled, the
temperatureparameter is automatically removed (Claude API requirement) - For Claude Opus 4.7 and 4.8,
temperature,top_p, andtop_kare always dropped regardless of thinking state - For streaming responses, thinking content is returned as
reasoning_contentdelta events
Environment Variables for Anthropic¶
| Variable | Description |
|---|---|
CONTINUUM_ANTHROPIC_API_KEY |
Anthropic API key (automatically loaded for type: anthropic backends) |
Amazon Bedrock Claude Backend¶
When using type: bedrock, the router routes Claude requests through Amazon Bedrock. There are two distinct entry points:
| Aspect | bedrock-mantle (Phase 1) | bedrock-runtime (Phase 2) |
|---|---|---|
| URL | https://bedrock-mantle.{region}.api.aws/anthropic/v1/messages |
https://bedrock-runtime.{region}.amazonaws.com/model/{modelId}/invoke[-with-response-stream] |
| Request body | Identical to native Anthropic Messages API | Same shape, but adds "anthropic_version": "bedrock-2023-05-31" and model moves to the URL path |
| Auth | Authorization: Bearer $AWS_BEARER_TOKEN_BEDROCK |
AWS SigV4 signing |
| Streaming | Standard text/event-stream | AWS binary event-stream (application/vnd.amazon.eventstream) |
| Headers | No anthropic-version, no x-api-key |
Content-Type: application/json, Accept: application/json or application/vnd.amazon.eventstream |
| Cargo feature | Always available | Requires --features bedrock-sigv4 at build time |
Both modes share the models: configuration surface and the same BackendTypeConfig::Bedrock variant. The split between mantle and runtime is invisible to clients, which continue to call /v1/chat/completions or /anthropic/v1/messages while the proxy adapts the body and headers. The endpoint_type field is the only knob that picks between them, with mantle as the default.
Phase 1 Configuration¶
backends:
- name: bedrock
type: bedrock # aliases: aws-bedrock, bedrock-anthropic, AmazonBedrock
endpoint_type: mantle # default; the Phase 1 implementation
region: us-east-1 # required; templated into the URL
api_key: ${AWS_BEARER_TOKEN_BEDROCK}
weight: 2
models:
- anthropic.claude-opus-4-7
- us.anthropic.claude-sonnet-4-5
- anthropic.claude-haiku-4-5
# global.anthropic.<family> uses AWS's cheapest cross-region tier.
# eu.<family>, jp.<family>, au.<family> route within the named geography.
Region Selection¶
The router builds the upstream URL by templating region into https://bedrock-mantle.{region}.api.aws. Any non-empty lowercase region identifier is accepted because AWS adds regions regularly — us-east-1, us-west-2, eu-west-1, ap-northeast-1, and so on. Empty or uppercase region values are rejected at configuration load time.
For Bedrock-specific forward-proxy deployments, an explicit url: field on the backend overrides the region template. Most operators should leave url unset and let the region drive the URL.
Model ID Format¶
Bedrock model identifiers come in four shapes; the router recognizes all of them and forwards them unchanged to the upstream:
| Shape | Example | Behavior |
|---|---|---|
| Plain Anthropic | anthropic.claude-opus-4-7 |
Routes to the backend's configured region. |
| Geographic profile | us.anthropic.claude-sonnet-4-5, eu.anthropic.claude-opus-4-7, jp.anthropic.claude-haiku-4-5, au.anthropic.claude-opus-4-7 |
AWS routes within the named geography. Pick this when data-residency commitments require keeping inference inside a region group. |
| Global profile | global.anthropic.claude-opus-4-7 |
AWS picks the lowest-latency region globally. Cheapest tier for inference profiles. |
| Full ARN | arn:aws:bedrock:us-east-1:123456789012:inference-profile/... |
Customer-managed inference profiles or cross-account references. |
Model IDs are listed explicitly in models: — there is no automatic alias mapping from native Anthropic IDs to Bedrock IDs. The router intentionally avoids hiding the geo-prefix decision behind a mapping table, since the prefix carries real billing and residency consequences.
Supported Features¶
The bedrock-mantle path inherits everything the native Anthropic backend supports:
- Streaming SSE responses, with the Anthropic SSE → OpenAI SSE transformer reused unchanged
- System prompts (translated from OpenAI's
messages[role=system]to Anthropic's separatesystemfield) - Tool calling and tool-result round-trips
- Vision (image inputs as base64 or URLs)
- Extended thinking on Claude 4-series models, including Opus 4.7's adaptive thinking API
Authentication¶
api_key: holds the Bedrock Bearer token. Set it via an environment variable rather than a literal string in production configs:
The router sends Authorization: Bearer ${AWS_BEARER_TOKEN_BEDROCK} on every request and strips any client-supplied x-api-key or anthropic-version header before forwarding — Bedrock returns HTTP 400 if those Anthropic-specific headers are present.
Phase 2 Configuration (bedrock-runtime)¶
Phase 2 hits the AWS-native Invoke API at https://bedrock-runtime.{region}.amazonaws.com/model/{modelId}/invoke[-with-response-stream]. The router signs each request with SigV4 and decodes the binary application/vnd.amazon.eventstream streaming response back into OpenAI-shape SSE for clients.
Build requirement¶
The runtime path lives behind the bedrock-sigv4 Cargo feature, which pulls in the small slice of the AWS SDK needed for signing and event-stream parsing (aws-sigv4, aws-smithy-eventstream, aws-credential-types, aws-config). The default build does not include these crates. Build with the feature enabled before running a runtime backend:
Without the feature, configuring endpoint_type: runtime returns a clear error at startup pointing at the rebuild flag. The default Phase 1 mantle path keeps working with or without the feature.
Example configuration¶
backends:
- name: bedrock-iam
type: bedrock
endpoint_type: runtime # selects the SigV4 path
region: us-east-1
weight: 1
auth:
type: sigv4 # required for runtime
# Pick at most one of the credential overrides below. When none
# is set, the standard AWS chain (env, shared config, IMDS,
# IRSA, ECS) resolves credentials.
# aws:
# profile: my-bedrock-profile
# aws:
# access_key_id: ${AWS_ACCESS_KEY_ID}
# secret_access_key: ${AWS_SECRET_ACCESS_KEY}
# session_token: ${AWS_SESSION_TOKEN}
models:
- anthropic.claude-opus-4-7
- us.anthropic.claude-sonnet-4-5
- global.anthropic.claude-haiku-4-5
# Full ARNs work too (provisioned throughput, custom inference profiles):
# - arn:aws:bedrock:us-east-1:123456789012:inference-profile/anthropic.claude-opus-4-7
Credential resolution order¶
The runtime backend resolves credentials in this order:
- Inline static credentials under
auth.aws.access_key_id+auth.aws.secret_access_key. An optionalsession_tokencovers STS-issued temporary credentials. - A named profile from
~/.aws/credentialsand~/.aws/configviaauth.aws.profile. - The standard AWS chain: environment variables (
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_SESSION_TOKEN), shared config, IMDS (EC2), IRSA / EKS pod identity, and ECS task role.
The chain resolves on every request, but the underlying AWS providers cache credentials with their own TTLs, so the per-request cost is normally a hash-map lookup rather than a network call.
Required IAM permissions¶
Attach a policy that allows bedrock:InvokeModel and bedrock:InvokeModelWithResponseStream on the model ARNs you intend to use. A minimal example:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"bedrock:InvokeModel",
"bedrock:InvokeModelWithResponseStream"
],
"Resource": [
"arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-opus-4-7",
"arn:aws:bedrock:us-east-1:*:inference-profile/us.anthropic.claude-sonnet-4-5"
]
}
]
}
For geo and global inference profiles, the router uses the profile ID as the URL path; AWS expands it server-side. Make sure your IAM resource list covers both the underlying foundation-model ARN and the inference-profile ARN you reference from models:.
Geo vs global profiles¶
The four model-ID shapes apply identically to runtime, but the billing and residency consequences differ between mantle and runtime only in the IAM check above. Pick the prefix that matches your data-residency and cost requirements:
| Prefix | Where AWS routes | Billing tier |
|---|---|---|
anthropic.<family> |
The backend's configured region. | Per-region rate. |
us.<family>, eu.<family>, jp.<family>, au.<family> |
Anywhere inside the named geography. | Per-region rate; potentially lower latency than a single fixed region. |
global.<family> |
Anywhere AWS deems lowest-latency at request time. | Cheapest tier. |
| Full ARN | Whatever the inference profile resolves to. | Whatever the ARN's underlying profile bills at. |
Streaming details¶
Runtime streaming responses arrive as application/vnd.amazon.eventstream frames. Each chunk frame contains a base64-encoded JSON object that, once decoded, is one Anthropic SSE event (message_start, content_block_delta, etc.). The router translates this back into the OpenAI-shape SSE that clients expect, so applications that already work against the mantle path keep working unchanged. AWS-specific error frames (ThrottlingException, ValidationException, ...) are surfaced as synthetic event: error SSE chunks.
Limitations¶
Converse/ConverseStreamAPIs are not used; the runtime path deliberately scopes to the Invoke API, so multi-provider Bedrock models (Nova, Llama, Mistral, etc.) served only via Converse are not reachable.- The inline
X-Amzn-Bedrock-GuardrailIdentifierheader on Invoke is not wired through this backend. To enforce a Bedrock guardrail, use the standalonebedrock_guardrailprovider in the guardrails configuration, which calls theApplyGuardrailAPI independently of the model invocation and therefore works for any backend. - Bedrock Prompt management (
prompt-routerARNs) is not in scope. - Provisioned Throughput and Application Inference Profile ARNs should work via the same URL-encoding path used for other ARNs, but no automated coverage is claimed beyond Foundation Model and inference-profile ARNs.
Native llama.cpp Backend¶
When using type: llamacpp, the router provides native support for llama.cpp llama-server:
- Default URL:
http://localhost:8080(llama-server default port) - Health Check: Uses
/healthendpoint (with fallback to/v1/models) - Model Discovery: Parses llama-server's hybrid
/v1/modelsresponse format - Rich Metadata: Extracts context window, parameter count, and model size from response
Minimal llama.cpp configuration:
backends:
- name: "local-llama"
type: llamacpp
# No URL needed if using default http://localhost:8080
# No API key required for local server
Full llama.cpp configuration:
backends:
- name: "local-llama"
type: llamacpp
url: "http://192.168.1.100:8080" # Custom URL if needed
weight: 2
# Models are auto-discovered from /v1/models endpoint
llama.cpp Features¶
| Feature | Description |
|---|---|
| GGUF Models | Native support for GGUF quantized models |
| Local Inference | No cloud API dependencies |
| Hardware Support | CPU, NVIDIA, AMD, Apple Silicon |
| Streaming | Full SSE streaming support |
| Embeddings | Supports /v1/embeddings endpoint |
| Tool Calling Detection | Auto-detects tool calling support via /props endpoint |
Tool Calling Auto-Detection¶
The router automatically detects tool calling capability for llama.cpp backends by querying the /props endpoint during model discovery. This enables automatic function calling support without manual configuration.
How it works:
- When a llama.cpp backend is discovered, the router fetches the
/propsendpoint - The
chat_templatefield is analyzed using precise Jinja2 pattern matching to detect tool-related syntax - If tool calling patterns are detected, the model's
function_callingcapability is automatically enabled - Detection results are stored for reference (including a hash of the chat template)
Detection Patterns:
The router uses precise pattern matching to reduce false positives:
- Role-based patterns:
message['role'] == 'tool',message.role == "tool" - Tool iteration:
for tool in tools,for function in functions - Tool calls access:
.tool_calls,['tool_calls'],message.tool_call - Jinja2 blocks with tool keywords: {% raw %}
{% if tools %},{% for tool_call in ... %}
Example /props response analyzed:
{% raw %}
{
"chat_template": "{% for message in messages %}{% if message['role'] == 'tool' %}...",
"default_generation_settings": { ... },
"total_slots": 1
}
Fallback Behavior:
- If
/propsis unavailable: Tool calling is assumed to be supported (optimistic fallback for modern llama.cpp versions) - If
/propsreturns an error: Tool calling is assumed to be supported (ensures compatibility with newer models) - If chat template exceeds 64KB: Detection is skipped and defaults to supported
- Detection is case-insensitive for maximum compatibility
- Results are merged with any existing model metadata from
model-metadata.yaml - Detected capabilities appear in the
featuresfield of the/v1/models/{model_id}response
Model Metadata Extraction¶
The router extracts rich metadata from llama-server responses:
| Field | Source | Description |
|---|---|---|
| Context Window | meta.n_ctx_train |
Training context window size |
| Parameter Count | meta.n_params |
Model parameters (e.g., "4B") |
| Model Size | meta.size |
File size in bytes |
| Capabilities | models[].capabilities |
Model capabilities array |
Starting llama-server¶
# Basic startup
./llama-server -m model.gguf --port 8080
# With GPU layers
./llama-server -m model.gguf --port 8080 -ngl 35
# With custom context size
./llama-server -m model.gguf --port 8080 --ctx-size 8192
Auto-Detection of llama.cpp Backends¶
When a backend is added without a type specified (defaults to generic), the router automatically probes the /v1/models endpoint to detect the backend type. llama.cpp backends are identified by:
owned_by: "llamacpp"in the response- Presence of llama.cpp-specific metadata fields (
n_ctx_train,n_params,vocab_type) - Hybrid response format with both
models[]anddata[]arrays
This auto-detection works for:
- Hot-reload configuration changes
- Backends added via Admin API without explicit type
- Configuration files with
type: genericor no type specified
Example: Auto-detected backend via Admin API:
# Add backend without specifying type - auto-detects llama.cpp
curl -X POST http://localhost:8080/admin/backends \
-H "Content-Type: application/json" \
-d '{
"name": "local-llm",
"url": "http://localhost:8080"
}'
Native MLxcel Backend¶
When using type: mlxcel, the router provides native support for MLxcel, an MLX-based model serving backend for macOS with Apple Silicon:
- Default URL:
http://localhost:8080(same as llama-server) - API Compatibility: Fully compatible with llama-server (llama.cpp) API
- Model Format: Serves SafeTensor format models via Apple's MLX framework
- Health Check: Uses
/healthas primary, with/v1/modelsas fallback - Platform: macOS with Apple Silicon only
Minimal MLxcel configuration:
backends:
- name: "mlxcel-local"
type: mlxcel
# No URL needed if using default http://localhost:8080
Full MLxcel configuration:
backends:
- name: "mlxcel-local"
type: mlxcel
url: "http://192.168.1.100:8080" # Custom URL if needed
weight: 2
models:
- mlx-community/Qwen3-4B-4bit
Auto-detection not supported
MLxcel cannot be auto-detected from the /v1/models response because it returns
the same response format as llama.cpp (including owned_by: "llamacpp"). You must
explicitly set type: mlxcel in the configuration. This ensures proper owned_by
metadata (mlxcel) is used for model identification.
Native LM Studio Backend¶
When using type: lmstudio, the router provides native support for LM Studio local server:
- Default URL:
http://localhost:1234(LM Studio default port) - Health Check: Uses
/v1/models(OpenAI-compatible) as primary, with/api/v1/models(native API) as fallback - Model Discovery: Auto-discovers models from
/v1/modelsendpoint owned_byAttribution: Reports"lmstudio"for proper model attribution
Minimal LM Studio configuration:
backends:
- name: "lmstudio"
type: lmstudio
# No URL needed if using default http://localhost:1234
# No API key required for local server
Full LM Studio configuration:
backends:
- name: "lmstudio"
type: lmstudio
url: "http://192.168.1.100:1234" # Custom URL if needed
weight: 2
api_key: "${LM_API_TOKEN}" # Optional: LM Studio API token (v0.4.0+)
# Models are auto-discovered from /v1/models endpoint
LM Studio Features¶
| Feature | Description |
|---|---|
| OpenAI-Compatible API | Full /v1/chat/completions, /v1/completions, /v1/embeddings support |
| Native REST API | Additional /api/v1/* endpoints for model management |
| Local Inference | No cloud API dependencies |
| Auto-Discovery | Models automatically detected from /v1/models |
| Optional Authentication | Supports API token via Authorization: Bearer header (v0.4.0+) |
Native Continuum Router / Backend.AI GO Backend¶
When using type: continuum-router, the router connects to a remote Continuum Router instance or Backend.AI GO deployment for federated LLM routing. Supported aliases include: continuum-router, continuum_router, ContinuumRouter, backendai, backend-ai, backend_ai.
- Health Check: Uses
/healthas primary, with/v1/modelsas fallback - Model Discovery: Auto-discovers models from the remote instance's
/v1/modelsendpoint - Authentication: Bearer token via
Authorization: Bearer <key>header - Request Passthrough: Requests are forwarded with no transformation (both systems use OpenAI-compatible APIs)
owned_byAttribution: Reports"continuum-router"for discovered models- Transport: Supports both HTTP and Unix Domain Socket transports
Minimal configuration:
backends:
- name: "remote-cr"
type: continuum-router
url: "https://remote.example.com"
api_key: "${REMOTE_API_KEY}"
# Models are auto-discovered from remote /v1/models endpoint
Full configuration with explicit models:
backends:
- name: "remote-backendai"
type: continuum-router
url: "https://remote-backend-ai.example.com"
api_key: "${REMOTE_BACKEND_AI_API_KEY}"
weight: 2
models:
- gpt-4o
- claude-sonnet-4-20250514
Use cases:
- Multi-region deployment: geo-route requests across Continuum Router instances
- Federated routing: connect multiple independent CR or Backend.AI GO deployments
- Tiered access: route through a central Backend.AI GO instance for quota management
- High availability: configure multiple Backend.AI GO instances for failover
Continuum Router Backend Features¶
| Feature | Description |
|---|---|
| Federated Routing | Forward requests to remote Continuum Router or Backend.AI GO instances |
| Auto-Discovery | Models automatically discovered from remote /v1/models |
| Bearer Auth | API key forwarded as Authorization: Bearer header |
| SSE Streaming | Full streaming support for chat completions |
| No Transformation | Requests passed through as-is (OpenAI-compatible on both ends) |
| Unix Socket Support | Supports unix:///path/to/socket.sock transport URLs |
Unix Domain Socket Backends¶
Continuum Router supports Unix Domain Sockets (UDS) as an alternative transport to TCP for local LLM backends. Unix sockets provide:
- Enhanced Security: No TCP port exposure - communication happens through the file system
- Lower Latency: No network stack overhead for local communication
- Better Performance: Reduced context switching and memory copies
- Simple Access Control: Uses standard Unix file permissions (on Linux/macOS; Windows does not support Unix file modes)
URL Format:
Platform Support:
| Platform | Support |
|---|---|
| Linux | Full support via native AF_UNIX |
| macOS | Full support via native AF_UNIX |
| Windows | Full support via socket2 crate (Windows 10 1809+ / Build 17063+) |
| Other | Not supported; addresses are skipped with a warning |
Configuration Examples:
On Windows, use drive-letter paths (e.g., unix://C:/temp/llama.sock).
On Linux/macOS, use standard absolute paths (e.g., unix:///var/run/llama.sock).
backends:
# llama-server with Unix socket (Linux/macOS)
- name: "llama-socket"
type: llamacpp
url: "unix:///var/run/llama-server.sock"
weight: 2
models:
- llama-3.2-3b
- qwen3-4b
# Ollama with Unix socket
- name: "ollama-socket"
type: ollama
url: "unix:///var/run/ollama.sock"
weight: 1
models:
- llama3.2
- mistral
# vLLM with Unix socket
- name: "vllm-socket"
type: vllm
url: "unix:///tmp/vllm.sock"
weight: 3
models:
- meta-llama/Llama-3.1-8B-Instruct
Starting Backends with Unix Sockets:
# llama-server
./llama-server -m model.gguf --unix /var/run/llama.sock
# Ollama
OLLAMA_HOST="unix:///var/run/ollama.sock" ollama serve
# vLLM
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B \
--unix-socket /tmp/vllm.sock
Socket Path Conventions:
| Path | Use Case |
|---|---|
/var/run/*.sock |
System services (requires root) |
/tmp/*.sock |
Temporary, user-accessible |
~/.local/share/continuum/*.sock |
Per-user persistent sockets |
~/Library/Application Support/*.sock |
macOS application data (paths with spaces are supported) |
Health Checks: The router automatically performs health checks on Unix socket backends using the same endpoints (/health, /v1/models) as TCP backends.
Platform support and limits:
- Streaming: Server-Sent Events (SSE) streaming works over Unix socket backends, for both chat completions and the Anthropic Messages surface.
- Windows: AF_UNIX sockets are supported on Windows 10 1809+ (build 17063+) via the
afunix.syskernel driver; earlier Windows versions return a clear error at connect time. - Max response size: Response bodies are limited to 100MB by default to prevent memory exhaustion.
Troubleshooting:
| Error | Cause | Solution |
|---|---|---|
| "Socket file not found" | Server not running | Start the backend server |
| "Permission denied" | File permissions | chmod 660 socket.sock |
| "Connection timeout" | Server not accepting connections | Verify server is listening |
| "Response body exceeds maximum size" | Response too large | Increase max_response_size or use streaming with TCP backend |