Server & Backends¶
Server Section¶
Controls the HTTP server behavior:
server:
bind_address: "0.0.0.0:8080" # Host and port to bind
workers: 4 # Worker threads (0 = auto)
connection_pool_size: 100 # HTTP connection pool size
Multiple Bind Addresses and Unix Sockets¶
The server supports binding to multiple addresses simultaneously, including Unix domain sockets (on Unix-like systems and Windows 10 1809+). This enables flexible deployment scenarios such as:
- Listening on both IPv4 and IPv6 addresses
- Exposing a TCP port for external clients while using a Unix socket for local services
- Running behind a reverse proxy via Unix socket for better security
Single Address (Backward Compatible):
Multiple Addresses:
server:
bind_address:
- "127.0.0.1:8080" # IPv4 localhost
- "[::1]:8080" # IPv6 localhost
- "0.0.0.0:9090" # All interfaces on port 9090
Unix Socket Binding (Linux, macOS, and Windows 10 1809+):
server:
bind_address:
- "0.0.0.0:8080" # TCP for external access
- "unix:/var/run/continuum-router.sock" # Unix socket for local services
socket_mode: 0o660 # Optional: file permissions for Unix sockets (octal)
Configuration Options:
| Option | Type | Default | Description |
|---|---|---|---|
bind_address | string or array | "0.0.0.0:8080" | Address(es) to bind. TCP format: host:port. Unix socket format: unix:/path/to/socket |
socket_mode | integer (octal) | null | File permissions for Unix sockets (e.g., 0o660 for owner/group read-write) |
Unix Socket Notes:
- Unix socket addresses must start with
unix:prefix - Existing socket files are automatically removed before binding
- Socket files are cleaned up on graceful shutdown
- On Windows 10 1809+ (Build 17063+), Unix sockets are fully supported via the
socket2crate - On other non-Unix platforms,
unix:addresses log a warning and are skipped - Windows does not support Unix file permission modes; the
socket_modeoption is accepted but ignored - Unix socket connections bypass IP-based authentication checks (client IP reported as "unix")
Nginx Reverse Proxy Example:
upstream continuum {
server unix:/var/run/continuum-router.sock;
}
server {
listen 443 ssl;
location /v1/ {
proxy_pass http://continuum;
}
}
Performance Tuning:
workers: Set to 0 for auto-detection, or match CPU coresconnection_pool_size: Increase for high-load scenarios (200-500)
CORS Configuration¶
CORS (Cross-Origin Resource Sharing) allows the router to accept requests from web browsers running on different origins. This is essential for embedding continuum-router in:
- Tauri apps: WebView using origins like
tauri://localhost - Electron apps: Custom protocols
- Separate web frontends: Development servers on different ports
server:
bind_address: "0.0.0.0:8080"
cors:
enabled: true
allow_origins:
- "tauri://localhost"
- "http://localhost:*" # Wildcard port matching
- "https://example.com"
allow_methods:
- "GET"
- "POST"
- "PUT"
- "DELETE"
- "OPTIONS"
- "PATCH"
allow_headers:
- "Content-Type"
- "Authorization"
- "X-Request-ID"
- "X-Trace-ID"
expose_headers:
- "X-Request-ID"
- "X-Fallback-Used"
allow_credentials: false
max_age: 3600 # Preflight cache duration in seconds
CORS Configuration Options:
| Option | Type | Default | Description |
|---|---|---|---|
enabled | boolean | false | Enable/disable CORS middleware |
allow_origins | array | [] | Allowed origins (supports * for any, port wildcards like http://localhost:*) |
allow_methods | array | ["GET", "POST", "PUT", "DELETE", "OPTIONS", "PATCH"] | Allowed HTTP methods |
allow_headers | array | ["Content-Type", "Authorization", "X-Request-ID", "X-Trace-ID"] | Allowed request headers |
expose_headers | array | [] | Headers exposed to the client JavaScript |
allow_credentials | boolean | false | Allow cookies and authorization headers |
max_age | integer | 3600 | Preflight response cache duration in seconds |
Origin Pattern Matching:
| Pattern | Example | Description |
|---|---|---|
* | * | Matches any origin (not compatible with allow_credentials: true) |
| Exact URL | https://example.com | Exact match |
| Custom scheme | tauri://localhost | Custom protocols (Tauri, Electron) |
| Port wildcard | http://localhost:* | Matches any port on localhost |
Security Considerations:
- Using
*for origins allows any website to make requests - only use for public APIs - When
allow_credentialsistrue, you cannot use*for origins - specify exact origins - For development, use port wildcards like
http://localhost:*for flexibility - In production, always specify exact origins for security
Hot Reload: CORS configuration supports immediate hot reload - changes apply to new requests instantly without server restart.
Backends Section¶
Defines the LLM backends to route requests to:
backends:
- name: "unique-identifier" # Must be unique across all backends
type: "generic" # Backend type (optional, defaults to "generic")
url: "http://backend:port" # Base URL for the backend
weight: 1 # Load balancing weight (1-100)
api_key: "${API_KEY}" # API key (optional, supports env var references)
org_id: "${ORG_ID}" # Organization ID (optional, for OpenAI)
models: ["model1", "model2"] # Optional: explicit model list
retry_override: # Optional: backend-specific retry settings
max_attempts: 5
base_delay: "200ms"
Starting Without Backends¶
The router can start with an empty backends list (backends: []), which is useful for:
- Infrastructure bootstrapping: Start the router first, then add backends dynamically via the Admin API
- Container orchestration: Router container can be ready before backend services
- Development workflows: Test admin endpoints before backends are provisioned
- Gradual rollout: Start with zero backends and add them progressively
When running with no backends:
/v1/modelsreturns{"object": "list", "data": []}/v1/chat/completionsand other routing endpoints return 503 "No backends available"/healthreturns healthy status (the router itself is operational)- Backends can be added via
POST /admin/backends
Example minimal configuration for dynamic backend management:
server:
bind_address: "0.0.0.0:8080"
backends: [] # Start with no backends - add via Admin API later
admin:
auth:
method: bearer
token: "${ADMIN_TOKEN}"
Backend Types Supported:
| Type | Description | Default URL |
|---|---|---|
generic | OpenAI-compatible API (default) | Must be specified |
openai | Native OpenAI API with built-in configuration | https://api.openai.com/v1 |
gemini | Google Gemini API (OpenAI-compatible endpoint) | https://generativelanguage.googleapis.com/v1beta/openai |
azure | Azure OpenAI Service | Must be specified |
vllm | vLLM server | Must be specified |
ollama | Ollama local server | http://localhost:11434 |
llamacpp | llama.cpp llama-server (GGUF models) | http://localhost:8080 |
mlxcel | MLxcel server (MLX-based, llama-server compatible, macOS only) | http://localhost:8080 |
lmstudio | LM Studio local server | http://localhost:1234 |
anthropic | Anthropic Claude API (native, with request/response translation) | https://api.anthropic.com |
continuum-router | Remote Continuum Router or Backend.AI GO instance (federated routing) | Must be specified |
Native OpenAI Backend¶
When using type: openai, the router provides: - Default URL: https://api.openai.com/v1 (can be overridden for proxies) - Built-in model metadata: Automatic pricing, context windows, and capabilities - Environment variable support: Automatically loads from CONTINUUM_OPENAI_API_KEY and CONTINUUM_OPENAI_ORG_ID
Minimal OpenAI configuration:
Full OpenAI configuration with explicit API key:
backends:
- name: "openai-primary"
type: openai
api_key: "${CONTINUUM_OPENAI_API_KEY}"
org_id: "${CONTINUUM_OPENAI_ORG_ID}" # Optional
models:
- gpt-4o
- gpt-4o-mini
- o1
- o1-mini
- o3-mini
- text-embedding-3-large
Using OpenAI with a proxy:
backends:
- name: "openai-proxy"
type: openai
url: "https://my-proxy.example.com/v1" # Override default URL
api_key: "${PROXY_API_KEY}"
models:
- gpt-4o
Environment Variables for OpenAI¶
| Variable | Description |
|---|---|
CONTINUUM_OPENAI_API_KEY | OpenAI API key (automatically loaded for type: openai backends) |
CONTINUUM_OPENAI_ORG_ID | OpenAI Organization ID (optional) |
Model Auto-Discovery:
When models is not specified or is empty, backends automatically discover available models from their /v1/models API endpoint during initialization. This feature reduces configuration maintenance and ensures all backend-reported models are routable.
| Backend Type | Auto-Discovery Support | Fallback Models |
|---|---|---|
openai | ✅ Yes | gpt-4o, gpt-4o-mini, o3-mini |
gemini | ✅ Yes | gemini-3.1-pro-preview, gemini-3-flash-preview, gemini-2.5-pro, gemini-2.5-flash |
vllm | ✅ Yes | vicuna-7b-v1.5, llama-2-7b-chat, mistral-7b-instruct |
ollama | ✅ Yes | Uses vLLM discovery mechanism |
llamacpp | ✅ Yes | Auto-discovers from /v1/models endpoint |
mlxcel | ✅ Yes | Auto-discovers from /v1/models endpoint |
lmstudio | ✅ Yes | Auto-discovers from /v1/models endpoint |
continuum-router | ✅ Yes | Auto-discovers from remote /v1/models endpoint |
anthropic | ❌ No (no API) | Hardcoded Claude models |
generic | ❌ No | All models supported (supports_model() returns true) |
Discovery Behavior:
- Timeout: 10-second timeout prevents blocking startup
- Fallback: If discovery fails (timeout, network error, invalid response), fallback models are used
- Logging: Discovered models are logged at INFO level; fallback usage logged at WARN level
Model Resolution Priority: 1. Explicit models list from config (highest priority) 2. Models from model_configs field 3. Auto-discovered models from backend API 4. Hardcoded fallback models (lowest priority)
- Explicit model lists improve startup time and reduce backend queries
Native Gemini Backend¶
When using type: gemini, the router provides: - Default URL: https://generativelanguage.googleapis.com/v1beta/openai (OpenAI-compatible endpoint) - Built-in model metadata: Automatic context windows and capabilities for Gemini models - Environment variable support: Automatically loads from CONTINUUM_GEMINI_API_KEY - Extended streaming timeout: 300s timeout for thinking models (gemini-3.1-pro, gemini-3-flash, gemini-2.5-pro) - Automatic max_tokens adjustment: For thinking models, see below
Minimal Gemini configuration:
backends:
- name: "gemini"
type: gemini
models:
- gemini-3.1-pro-preview
- gemini-3-flash-preview
- gemini-2.5-pro
- gemini-2.5-flash
Full Gemini configuration with API Key:
backends:
- name: "gemini"
type: gemini
api_key: "${CONTINUUM_GEMINI_API_KEY}"
weight: 2
models:
- gemini-3.1-pro-preview
- gemini-3-flash-preview
- gemini-2.5-pro
- gemini-2.5-flash
Gemini Authentication Methods¶
The Gemini backend supports two authentication methods:
API Key Authentication (Default)¶
The simplest authentication method using a Google AI Studio API key:
backends:
- name: "gemini"
type: gemini
api_key: "${CONTINUUM_GEMINI_API_KEY}"
models:
- gemini-3.1-pro-preview
Service Account Authentication¶
For enterprise environments and Google Cloud Platform (GCP) deployments, you can use Service Account authentication with automatic OAuth2 token management:
backends:
- name: "gemini"
type: gemini
auth:
type: service_account
key_file: "/path/to/service-account.json"
models:
- gemini-3.1-pro-preview
- gemini-3-flash-preview
Using environment variable for key file path:
backends:
- name: "gemini"
type: gemini
auth:
type: service_account
key_file: "${GOOGLE_APPLICATION_CREDENTIALS}"
models:
- gemini-3.1-pro-preview
Service Account Authentication Features:
| Feature | Description |
|---|---|
| Automatic Token Refresh | OAuth2 tokens are automatically refreshed 5 minutes before expiration |
| Token Caching | Tokens are cached in memory to minimize authentication overhead |
| Thread-Safe | Concurrent requests safely share token refresh operations |
| Environment Variable Expansion | Key file paths support ${VAR} and ~ expansion |
Creating a Service Account Key:
- Go to Google Cloud Console
- Navigate to IAM & Admin > Service Accounts
- Create a new service account or select an existing one
- Click Keys > Add Key > Create new key
- Choose JSON format and download the key file
- Store the key file securely and reference it in your configuration
Required Permissions:
The service account needs the following roles for Gemini API access:
roles/aiplatform.user- For Vertex AI Gemini endpoints- Or appropriate Google AI Studio permissions for generativelanguage.googleapis.com
Authentication Priority¶
When multiple authentication methods are configured:
| Priority | Method | Condition |
|---|---|---|
| 1 (Highest) | auth block | If auth.type is specified |
| 2 | api_key field | If no auth block is present |
| 3 | Environment variable | Falls back to CONTINUUM_GEMINI_API_KEY |
If both api_key and auth are specified, the auth block takes precedence and a warning is logged.
Gemini Thinking Models: Automatic max_tokens Adjustment¶
Gemini "thinking" models (gemini-3.1-pro, gemini-3-flash, gemini-2.5-pro, and models with -pro-preview suffix) perform extended reasoning before generating responses. To prevent response truncation, the router automatically adjusts max_tokens:
| Condition | Behavior |
|---|---|
max_tokens not specified | Automatically set to 16384 |
max_tokens < 4096 | Automatically increased to 16384 |
max_tokens >= 4096 | Client value preserved |
This ensures thinking models can generate complete responses without truncation due to low default values from client libraries.
Environment Variables for Gemini¶
| Variable | Description |
|---|---|
CONTINUUM_GEMINI_API_KEY | Google Gemini API key (automatically loaded for type: gemini backends) |
GOOGLE_APPLICATION_CREDENTIALS | Path to service account JSON key file (standard GCP environment variable) |
Native Anthropic Backend¶
When using type: anthropic, the router provides: - Default URL: https://api.anthropic.com (can be overridden for proxies) - Native API translation: Automatically converts OpenAI format requests to Anthropic Messages API format and vice versa - Anthropic-specific headers: Automatically adds x-api-key and anthropic-version headers - Environment variable support: Automatically loads from CONTINUUM_ANTHROPIC_API_KEY - Extended streaming timeout: 600s timeout for extended thinking models (Claude Opus, Sonnet 4)
Minimal Anthropic configuration:
backends:
- name: "anthropic"
type: anthropic
models:
- claude-sonnet-4-20250514
- claude-haiku-3-5-20241022
Full Anthropic configuration:
backends:
- name: "anthropic"
type: anthropic
api_key: "${CONTINUUM_ANTHROPIC_API_KEY}"
weight: 2
models:
- claude-opus-4-6
- claude-sonnet-4-6
- claude-haiku-4-5
Anthropic API Translation¶
The router automatically handles the translation between OpenAI and Anthropic API formats:
| OpenAI Format | Anthropic Format |
|---|---|
messages array with role: "system" | Separate system parameter |
Authorization: Bearer <key> | x-api-key: <key> header |
Optional max_tokens | Required max_tokens (auto-filled if missing) |
choices[0].message.content | content[0].text |
finish_reason: "stop" | stop_reason: "end_turn" |
usage.prompt_tokens | usage.input_tokens |
Example Request Translation:
OpenAI format (incoming from client):
{
"model": "claude-sonnet-4-20250514",
"messages": [
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "Hello"}
],
"max_tokens": 1024
}
Anthropic format (sent to API):
{
"model": "claude-sonnet-4-20250514",
"system": "You are helpful.",
"messages": [
{"role": "user", "content": "Hello"}
],
"max_tokens": 1024
}
Anthropic Native API Endpoints¶
In addition to routing OpenAI-format requests to Anthropic backends, the router also provides native Anthropic API endpoints:
| Endpoint | Description |
|---|---|
POST /anthropic/v1/messages | Native Anthropic Messages API |
POST /anthropic/v1/messages/count_tokens | Token counting with tiered backend support |
GET /anthropic/v1/models | Model listing in Anthropic format |
These endpoints allow clients that use Anthropic's native API format (such as Claude Code) to connect directly without any request/response transformation overhead.
Claude Code Compatibility¶
The Anthropic Native API endpoints include full compatibility with Claude Code and other advanced Anthropic API clients:
Prompt Caching Support:
The router preserves cache_control fields throughout the request/response pipeline:
- System prompt text blocks
- User message content blocks (text, image, document)
- Tool definitions
- Tool use and tool result blocks
Header Forwarding:
| Header | Behavior |
|---|---|
anthropic-version | Forwarded to native Anthropic backends |
anthropic-beta | Forwarded to enable beta features (e.g., prompt-caching-2024-07-31, interleaved-thinking-2025-05-14) |
x-request-id | Forwarded for request tracing |
Cache Usage Reporting:
Streaming responses from native Anthropic backends include cache usage information:
{
"usage": {
"input_tokens": 2159,
"cache_creation_input_tokens": 2048,
"cache_read_input_tokens": 0
}
}
Anthropic Extended Thinking Models¶
Models supporting extended thinking (Claude Opus, Sonnet 4) may require longer response times. The router automatically: - Sets higher default max_tokens (16384) for thinking models - Uses extended streaming timeout (600s) for these models
OpenAI ↔ Claude Reasoning Parameter Conversion¶
The router automatically converts between OpenAI's reasoning parameters and Claude's thinking parameter, enabling seamless cross-provider reasoning requests.
Supported OpenAI Formats:
| Format | API | Example |
|---|---|---|
reasoning_effort (flat) | Chat Completions API | "reasoning_effort": "high" |
reasoning.effort (nested) | Responses API | "reasoning": {"effort": "high"} |
When both formats are present, reasoning_effort (flat) takes precedence.
Effort Level to Budget Tokens Mapping:
| Effort Level | Claude thinking.budget_tokens |
|---|---|
none | (thinking disabled) |
minimal | 1,024 |
low | 4,096 |
medium | 10,240 |
high | 32,768 |
Example Request - Chat Completions API (flat format):
// Client sends OpenAI Chat Completions API request
{
"model": "claude-sonnet-4-6",
"reasoning_effort": "high",
"messages": [{"role": "user", "content": "Solve this complex problem"}]
}
// Router converts to Claude format
{
"model": "claude-sonnet-4-6",
"thinking": {"type": "enabled", "budget_tokens": 32768},
"messages": [{"role": "user", "content": "Solve this complex problem"}]
}
Example Request - Responses API (nested format):
// Client sends OpenAI Responses API request
{
"model": "claude-sonnet-4-6",
"reasoning": {"effort": "medium"},
"messages": [{"role": "user", "content": "Analyze this data"}]
}
// Router converts to Claude format
{
"model": "claude-sonnet-4-6",
"thinking": {"type": "enabled", "budget_tokens": 10240},
"messages": [{"role": "user", "content": "Analyze this data"}]
}
Response with Reasoning Content:
{
"choices": [{
"message": {
"role": "assistant",
"content": "The final answer is...",
"reasoning_content": "Let me analyze this step by step..."
}
}]
}
Important Notes: - If thinking parameter is explicitly provided, it takes precedence over reasoning_effort and reasoning.effort - reasoning_effort (flat) takes precedence over reasoning.effort (nested) when both are present - Only models supporting extended thinking (Opus 4.x, Sonnet 4.x) will have reasoning enabled - When reasoning is enabled, the temperature parameter is automatically removed (Claude API requirement) - For streaming responses, thinking content is returned as reasoning_content delta events
Environment Variables for Anthropic¶
| Variable | Description |
|---|---|
CONTINUUM_ANTHROPIC_API_KEY | Anthropic API key (automatically loaded for type: anthropic backends) |
Native llama.cpp Backend¶
When using type: llamacpp, the router provides native support for llama.cpp llama-server:
- Default URL:
http://localhost:8080(llama-server default port) - Health Check: Uses
/healthendpoint (with fallback to/v1/models) - Model Discovery: Parses llama-server's hybrid
/v1/modelsresponse format - Rich Metadata: Extracts context window, parameter count, and model size from response
Minimal llama.cpp configuration:
backends:
- name: "local-llama"
type: llamacpp
# No URL needed if using default http://localhost:8080
# No API key required for local server
Full llama.cpp configuration:
backends:
- name: "local-llama"
type: llamacpp
url: "http://192.168.1.100:8080" # Custom URL if needed
weight: 2
# Models are auto-discovered from /v1/models endpoint
llama.cpp Features¶
| Feature | Description |
|---|---|
| GGUF Models | Native support for GGUF quantized models |
| Local Inference | No cloud API dependencies |
| Hardware Support | CPU, NVIDIA, AMD, Apple Silicon |
| Streaming | Full SSE streaming support |
| Embeddings | Supports /v1/embeddings endpoint |
| Tool Calling Detection | Auto-detects tool calling support via /props endpoint |
Tool Calling Auto-Detection¶
The router automatically detects tool calling capability for llama.cpp backends by querying the /props endpoint during model discovery. This enables automatic function calling support without manual configuration.
How it works:
- When a llama.cpp backend is discovered, the router fetches the
/propsendpoint - The
chat_templatefield is analyzed using precise Jinja2 pattern matching to detect tool-related syntax - If tool calling patterns are detected, the model's
function_callingcapability is automatically enabled - Detection results are stored for reference (including a hash of the chat template)
Detection Patterns:
The router uses precise pattern matching to reduce false positives:
- Role-based patterns:
message['role'] == 'tool',message.role == "tool" - Tool iteration:
for tool in tools,for function in functions - Tool calls access:
.tool_calls,['tool_calls'],message.tool_call - Jinja2 blocks with tool keywords:
{% if tools %},{% for tool_call in ... %}
Example /props response analyzed:
{
"chat_template": "{% for message in messages %}{% if message['role'] == 'tool' %}...",
"default_generation_settings": { ... },
"total_slots": 1
}
Fallback Behavior:
- If
/propsis unavailable: Tool calling is assumed to be supported (optimistic fallback for modern llama.cpp versions) - If
/propsreturns an error: Tool calling is assumed to be supported (ensures compatibility with newer models) - If chat template exceeds 64KB: Detection is skipped and defaults to supported
- Detection is case-insensitive for maximum compatibility
- Results are merged with any existing model metadata from
model-metadata.yaml - Detected capabilities appear in the
featuresfield of the/v1/models/{model_id}response
Model Metadata Extraction¶
The router extracts rich metadata from llama-server responses:
| Field | Source | Description |
|---|---|---|
| Context Window | meta.n_ctx_train | Training context window size |
| Parameter Count | meta.n_params | Model parameters (e.g., "4B") |
| Model Size | meta.size | File size in bytes |
| Capabilities | models[].capabilities | Model capabilities array |
Starting llama-server¶
# Basic startup
./llama-server -m model.gguf --port 8080
# With GPU layers
./llama-server -m model.gguf --port 8080 -ngl 35
# With custom context size
./llama-server -m model.gguf --port 8080 --ctx-size 8192
Auto-Detection of llama.cpp Backends¶
When a backend is added without a type specified (defaults to generic), the router automatically probes the /v1/models endpoint to detect the backend type. llama.cpp backends are identified by:
owned_by: "llamacpp"in the response- Presence of llama.cpp-specific metadata fields (
n_ctx_train,n_params,vocab_type) - Hybrid response format with both
models[]anddata[]arrays
This auto-detection works for:
- Hot-reload configuration changes
- Backends added via Admin API without explicit type
- Configuration files with
type: genericor no type specified
Example: Auto-detected backend via Admin API:
# Add backend without specifying type - auto-detects llama.cpp
curl -X POST http://localhost:8080/admin/backends \
-H "Content-Type: application/json" \
-d '{
"name": "local-llm",
"url": "http://localhost:8080"
}'
Native MLxcel Backend¶
When using type: mlxcel, the router provides native support for MLxcel, an MLX-based model serving backend for macOS with Apple Silicon:
- Default URL:
http://localhost:8080(same as llama-server) - API Compatibility: Fully compatible with llama-server (llama.cpp) API
- Model Format: Serves SafeTensor format models via Apple's MLX framework
- Health Check: Uses
/healthas primary, with/v1/modelsas fallback - Platform: macOS with Apple Silicon only
Minimal MLxcel configuration:
backends:
- name: "mlxcel-local"
type: mlxcel
# No URL needed if using default http://localhost:8080
Full MLxcel configuration:
backends:
- name: "mlxcel-local"
type: mlxcel
url: "http://192.168.1.100:8080" # Custom URL if needed
weight: 2
models:
- mlx-community/Qwen3-4B-4bit
Auto-detection not supported
MLxcel cannot be auto-detected from the /v1/models response because it returns the same response format as llama.cpp (including owned_by: "llamacpp"). You must explicitly set type: mlxcel in the configuration. This ensures proper owned_by metadata (mlxcel) is used for model identification.
Native LM Studio Backend¶
When using type: lmstudio, the router provides native support for LM Studio local server:
- Default URL:
http://localhost:1234(LM Studio default port) - Health Check: Uses
/v1/models(OpenAI-compatible) as primary, with/api/v1/models(native API) as fallback - Model Discovery: Auto-discovers models from
/v1/modelsendpoint owned_byAttribution: Reports"lmstudio"for proper model attribution
Minimal LM Studio configuration:
backends:
- name: "lmstudio"
type: lmstudio
# No URL needed if using default http://localhost:1234
# No API key required for local server
Full LM Studio configuration:
backends:
- name: "lmstudio"
type: lmstudio
url: "http://192.168.1.100:1234" # Custom URL if needed
weight: 2
api_key: "${LM_API_TOKEN}" # Optional: LM Studio API token (v0.4.0+)
# Models are auto-discovered from /v1/models endpoint
LM Studio Features¶
| Feature | Description |
|---|---|
| OpenAI-Compatible API | Full /v1/chat/completions, /v1/completions, /v1/embeddings support |
| Native REST API | Additional /api/v1/* endpoints for model management |
| Local Inference | No cloud API dependencies |
| Auto-Discovery | Models automatically detected from /v1/models |
| Optional Authentication | Supports API token via Authorization: Bearer header (v0.4.0+) |
Native Continuum Router / Backend.AI GO Backend¶
When using type: continuum-router, the router connects to a remote Continuum Router instance or Backend.AI GO deployment for federated LLM routing. Supported aliases include: continuum-router, continuum_router, ContinuumRouter, backendai, backend-ai, backend_ai.
- Health Check: Uses
/healthas primary, with/v1/modelsas fallback - Model Discovery: Auto-discovers models from the remote instance's
/v1/modelsendpoint - Authentication: Bearer token via
Authorization: Bearer <key>header - Request Passthrough: Requests are forwarded with no transformation (both systems use OpenAI-compatible APIs)
owned_byAttribution: Reports"continuum-router"for discovered models- Transport: Supports both HTTP and Unix Domain Socket transports
Minimal configuration:
backends:
- name: "remote-cr"
type: continuum-router
url: "https://remote.example.com"
api_key: "${REMOTE_API_KEY}"
# Models are auto-discovered from remote /v1/models endpoint
Full configuration with explicit models:
backends:
- name: "remote-backendai"
type: continuum-router
url: "https://remote-backend-ai.example.com"
api_key: "${REMOTE_BACKEND_AI_API_KEY}"
weight: 2
models:
- gpt-4o
- claude-sonnet-4-20250514
Use cases:
- Multi-region deployment: geo-route requests across Continuum Router instances
- Federated routing: connect multiple independent CR or Backend.AI GO deployments
- Tiered access: route through a central Backend.AI GO instance for quota management
- High availability: configure multiple Backend.AI GO instances for failover
Continuum Router Backend Features¶
| Feature | Description |
|---|---|
| Federated Routing | Forward requests to remote Continuum Router or Backend.AI GO instances |
| Auto-Discovery | Models automatically discovered from remote /v1/models |
| Bearer Auth | API key forwarded as Authorization: Bearer header |
| SSE Streaming | Full streaming support for chat completions |
| No Transformation | Requests passed through as-is (OpenAI-compatible on both ends) |
| Unix Socket Support | Supports unix:///path/to/socket.sock transport URLs |
Unix Domain Socket Backends¶
Continuum Router supports Unix Domain Sockets (UDS) as an alternative transport to TCP for local LLM backends. Unix sockets provide:
- Enhanced Security: No TCP port exposure - communication happens through the file system
- Lower Latency: No network stack overhead for local communication
- Better Performance: Reduced context switching and memory copies
- Simple Access Control: Uses standard Unix file permissions (on Linux/macOS; Windows does not support Unix file modes)
URL Format:
Platform Support:
| Platform | Support |
|---|---|
| Linux | Full support via native AF_UNIX |
| macOS | Full support via native AF_UNIX |
| Windows | Full support via socket2 crate (Windows 10 1809+ / Build 17063+) |
| Other | Not supported; addresses are skipped with a warning |
Configuration Examples:
On Windows, use drive-letter paths (e.g., unix://C:/temp/llama.sock). On Linux/macOS, use standard absolute paths (e.g., unix:///var/run/llama.sock).
backends:
# llama-server with Unix socket (Linux/macOS)
- name: "llama-socket"
type: llamacpp
url: "unix:///var/run/llama-server.sock"
weight: 2
models:
- llama-3.2-3b
- qwen3-4b
# Ollama with Unix socket
- name: "ollama-socket"
type: ollama
url: "unix:///var/run/ollama.sock"
weight: 1
models:
- llama3.2
- mistral
# vLLM with Unix socket
- name: "vllm-socket"
type: vllm
url: "unix:///tmp/vllm.sock"
weight: 3
models:
- meta-llama/Llama-3.1-8B-Instruct
Starting Backends with Unix Sockets:
# llama-server
./llama-server -m model.gguf --unix /var/run/llama.sock
# Ollama
OLLAMA_HOST="unix:///var/run/ollama.sock" ollama serve
# vLLM
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B \
--unix-socket /tmp/vllm.sock
Socket Path Conventions:
| Path | Use Case |
|---|---|
/var/run/*.sock | System services (requires root) |
/tmp/*.sock | Temporary, user-accessible |
~/.local/share/continuum/*.sock | Per-user persistent sockets |
~/Library/Application Support/*.sock | macOS application data (paths with spaces are supported) |
Health Checks: The router automatically performs health checks on Unix socket backends using the same endpoints (/health, /v1/models) as TCP backends.
Current Limitations:
- Streaming (SSE) not supported: Unix socket backends do not currently support Server-Sent Events (SSE) streaming. Use TCP backends for streaming chat completions.
- Windows platform: Unix sockets are not currently supported on Windows (planned for future releases).
- Max response size: Response bodies are limited to 100MB by default to prevent memory exhaustion.
Troubleshooting:
| Error | Cause | Solution |
|---|---|---|
| "Socket file not found" | Server not running | Start the backend server |
| "Permission denied" | File permissions | chmod 660 socket.sock |
| "Connection timeout" | Server not accepting connections | Verify server is listening |
| "Response body exceeds maximum size" | Response too large | Increase maxresponsesize or use streaming with TCP backend |