Server & Backends¶

Server Section¶

Controls the HTTP server behavior:

server:
  bind_address: "0.0.0.0:8080"    # Host and port to bind
  workers: 4                       # Worker threads (0 = auto)
  connection_pool_size: 100        # HTTP connection pool size

Multiple Bind Addresses and Unix Sockets¶

The server supports binding to multiple addresses simultaneously, including Unix domain sockets (on Unix-like systems and Windows 10 1809+). This enables flexible deployment scenarios such as:

Listening on both IPv4 and IPv6 addresses
Exposing a TCP port for external clients while using a Unix socket for local services
Running behind a reverse proxy via Unix socket for better security

Single Address (Backward Compatible):

server:
  bind_address: "0.0.0.0:8080"

Multiple Addresses:

server:
  bind_address:
    - "127.0.0.1:8080"           # IPv4 localhost
    - "[::1]:8080"               # IPv6 localhost
    - "0.0.0.0:9090"             # All interfaces on port 9090

Unix Socket Binding (Linux, macOS, and Windows 10 1809+):

server:
  bind_address:
    - "0.0.0.0:8080"             # TCP for external access
    - "unix:/var/run/continuum-router.sock"  # Unix socket for local services
  socket_mode: 0o660              # Optional: file permissions for Unix sockets (octal)

Configuration Options:

Option	Type	Default	Description
`bind_address`	string or array	`"0.0.0.0:8080"`	Address(es) to bind. TCP format: `host:port`. Unix socket format: `unix:/path/to/socket`
`socket_mode`	integer (octal)	`null`	File permissions for Unix sockets (e.g., `0o660` for owner/group read-write)

Unix Socket Notes:

Unix socket addresses must start with unix: prefix
Existing socket files are automatically removed before binding
Socket files are cleaned up on graceful shutdown
On Windows 10 1809+ (Build 17063+), Unix sockets are fully supported via the socket2 crate
On other non-Unix platforms, unix: addresses log a warning and are skipped
Windows does not support Unix file permission modes; the socket_mode option is accepted but ignored
Unix socket connections bypass IP-based authentication checks (client IP reported as "unix")

Nginx Reverse Proxy Example:

upstream continuum {
    server unix:/var/run/continuum-router.sock;
}

server {
    listen 443 ssl;
    location /v1/ {
        proxy_pass http://continuum;
    }
}

Performance Tuning:

workers: Set to 0 for auto-detection, or match CPU cores
connection_pool_size: Increase for high-load scenarios (200-500)

CORS Configuration¶

CORS (Cross-Origin Resource Sharing) allows the router to accept requests from web browsers running on different origins. This is essential for embedding continuum-router in:

Tauri apps: WebView using origins like tauri://localhost
Electron apps: Custom protocols
Separate web frontends: Development servers on different ports

server:
  bind_address: "0.0.0.0:8080"
  cors:
    enabled: true
    allow_origins:
      - "tauri://localhost"
      - "http://localhost:*"        # Wildcard port matching
      - "https://example.com"
    allow_methods:
      - "GET"
      - "POST"
      - "PUT"
      - "DELETE"
      - "OPTIONS"
      - "PATCH"
    allow_headers:
      - "Content-Type"
      - "Authorization"
      - "X-Request-ID"
      - "X-Trace-ID"
    expose_headers:
      - "X-Request-ID"
      - "X-Fallback-Used"
    allow_credentials: false
    max_age: 3600                   # Preflight cache duration in seconds

CORS Configuration Options:

Option	Type	Default	Description
`enabled`	boolean	`false`	Enable/disable CORS middleware
`allow_origins`	array	`[]`	Allowed origins (supports `` for any, port wildcards like `http://localhost:`)
`allow_methods`	array	`["GET", "POST", "PUT", "DELETE", "OPTIONS", "PATCH"]`	Allowed HTTP methods
`allow_headers`	array	`["Content-Type", "Authorization", "X-Request-ID", "X-Trace-ID"]`	Allowed request headers
`expose_headers`	array	`[]`	Headers exposed to the client JavaScript
`allow_credentials`	boolean	`false`	Allow cookies and authorization headers
`max_age`	integer	`3600`	Preflight response cache duration in seconds

Origin Pattern Matching:

Pattern	Example	Description
`*`	`*`	Matches any origin (not compatible with `allow_credentials: true`)
Exact URL	`https://example.com`	Exact match
Custom scheme	`tauri://localhost`	Custom protocols (Tauri, Electron)
Port wildcard	`http://localhost:*`	Matches any port on localhost

Security Considerations:

Using * for origins allows any website to make requests - only use for public APIs
When allow_credentials is true, you cannot use * for origins - specify exact origins
For development, use port wildcards like http://localhost:* for flexibility
In production, always specify exact origins for security

Hot Reload: CORS configuration supports immediate hot reload - changes apply to new requests instantly without server restart.

Backends Section¶

Defines the LLM backends to route requests to:

backends:
    - name: "unique-identifier"        # Must be unique across all backends
    type: "generic"                  # Backend type (optional, defaults to "generic")
    url: "http://backend:port"       # Base URL for the backend
    weight: 1                        # Load balancing weight (1-100)
    api_key: "${API_KEY}"            # API key (optional, supports env var references)
    org_id: "${ORG_ID}"              # Organization ID (optional, for OpenAI)
    models: ["model1", "model2"]     # Optional: explicit model list
    retry_override:                  # Optional: backend-specific retry settings
      max_attempts: 5
      base_delay: "200ms"

Starting Without Backends¶

The router can start with an empty backends list (backends: []), which is useful for:

Infrastructure bootstrapping: Start the router first, then add backends dynamically via the Admin API
Container orchestration: Router container can be ready before backend services
Development workflows: Test admin endpoints before backends are provisioned
Gradual rollout: Start with zero backends and add them progressively

When running with no backends:

/v1/models returns {"object": "list", "data": []}
/v1/chat/completions and other routing endpoints return 503 "No backends available"
/health returns healthy status (the router itself is operational)
Backends can be added via POST /admin/backends

Example minimal configuration for dynamic backend management:

server:
  bind_address: "0.0.0.0:8080"

backends: []  # Start with no backends - add via Admin API later

admin:
  auth:
    method: bearer
    token: "${ADMIN_TOKEN}"

Backend Types Supported:

Type	Description	Default URL
`generic`	OpenAI-compatible API (default)	Must be specified
`openai`	Native OpenAI API with built-in configuration	`https://api.openai.com/v1`
`gemini`	Google Gemini API (OpenAI-compatible endpoint)	`https://generativelanguage.googleapis.com/v1beta/openai`
`azure`	Azure OpenAI Service	Must be specified
`vllm`	vLLM server	Must be specified
`ollama`	Ollama local server	`http://localhost:11434`
`llamacpp`	llama.cpp llama-server (GGUF models)	`http://localhost:8080`
`mlxcel`	MLxcel server (MLX-based, llama-server compatible, macOS only)	`http://localhost:8080`
`lmstudio`	LM Studio local server	`http://localhost:1234`
`anthropic`	Anthropic Claude API (native, with request/response translation)	`https://api.anthropic.com`
`continuum-router`	Remote Continuum Router or Backend.AI GO instance (federated routing)	Must be specified

Native OpenAI Backend¶

When using type: openai, the router provides: - Default URL: https://api.openai.com/v1 (can be overridden for proxies) - Built-in model metadata: Automatic pricing, context windows, and capabilities - Environment variable support: Automatically loads from CONTINUUM_OPENAI_API_KEY and CONTINUUM_OPENAI_ORG_ID

Minimal OpenAI configuration:

backends:
    - name: "openai"
    type: openai
    models:
      - gpt-4o
      - gpt-4o-mini
      - o3-mini

Full OpenAI configuration with explicit API key:

backends:
    - name: "openai-primary"
    type: openai
    api_key: "${CONTINUUM_OPENAI_API_KEY}"
    org_id: "${CONTINUUM_OPENAI_ORG_ID}"     # Optional
    models:
      - gpt-4o
      - gpt-4o-mini
      - o1
      - o1-mini
      - o3-mini
      - text-embedding-3-large

Using OpenAI with a proxy:

backends:
    - name: "openai-proxy"
    type: openai
    url: "https://my-proxy.example.com/v1"   # Override default URL
    api_key: "${PROXY_API_KEY}"
    models:
      - gpt-4o

Environment Variables for OpenAI¶

Variable	Description
`CONTINUUM_OPENAI_API_KEY`	OpenAI API key (automatically loaded for `type: openai` backends)
`CONTINUUM_OPENAI_ORG_ID`	OpenAI Organization ID (optional)

Model Auto-Discovery:

When models is not specified or is empty, backends automatically discover available models from their /v1/models API endpoint during initialization. This feature reduces configuration maintenance and ensures all backend-reported models are routable.

Backend Type	Auto-Discovery Support	Fallback Models
`openai`	✅ Yes	`gpt-4o`, `gpt-4o-mini`, `o3-mini`
`gemini`	✅ Yes	`gemini-3.1-pro-preview`, `gemini-3-flash-preview`, `gemini-2.5-pro`, `gemini-2.5-flash`
`vllm`	✅ Yes	`vicuna-7b-v1.5`, `llama-2-7b-chat`, `mistral-7b-instruct`
`ollama`	✅ Yes	Uses vLLM discovery mechanism
`llamacpp`	✅ Yes	Auto-discovers from `/v1/models` endpoint
`mlxcel`	✅ Yes	Auto-discovers from `/v1/models` endpoint
`lmstudio`	✅ Yes	Auto-discovers from `/v1/models` endpoint
`continuum-router`	✅ Yes	Auto-discovers from remote `/v1/models` endpoint
`anthropic`	❌ No (no API)	Hardcoded Claude models
`generic`	❌ No	All models supported (`supports_model()` returns `true`)

Discovery Behavior:

Timeout: 10-second timeout prevents blocking startup
Fallback: If discovery fails (timeout, network error, invalid response), fallback models are used
Logging: Discovered models are logged at INFO level; fallback usage logged at WARN level

Model Resolution Priority: 1. Explicit models list from config (highest priority) 2. Models from model_configs field 3. Auto-discovered models from backend API 4. Hardcoded fallback models (lowest priority)

Explicit model lists improve startup time and reduce backend queries

Native Gemini Backend¶

When using type: gemini, the router provides: - Default URL: https://generativelanguage.googleapis.com/v1beta/openai (OpenAI-compatible endpoint) - Built-in model metadata: Automatic context windows and capabilities for Gemini models - Environment variable support: Automatically loads from CONTINUUM_GEMINI_API_KEY - Extended streaming timeout: 300s timeout for thinking models (gemini-3.1-pro, gemini-3-flash, gemini-2.5-pro) - Automatic max_tokens adjustment: For thinking models, see below

Minimal Gemini configuration:

backends:
    - name: "gemini"
    type: gemini
    models:
      - gemini-3.1-pro-preview
      - gemini-3-flash-preview
      - gemini-2.5-pro
      - gemini-2.5-flash

Full Gemini configuration with API Key:

backends:
    - name: "gemini"
    type: gemini
    api_key: "${CONTINUUM_GEMINI_API_KEY}"
    weight: 2
    models:
      - gemini-3.1-pro-preview
      - gemini-3-flash-preview
      - gemini-2.5-pro
      - gemini-2.5-flash

Gemini Authentication Methods¶

The Gemini backend supports two authentication methods:

API Key Authentication (Default)¶

The simplest authentication method using a Google AI Studio API key:

backends:
    - name: "gemini"
    type: gemini
    api_key: "${CONTINUUM_GEMINI_API_KEY}"
    models:
      - gemini-3.1-pro-preview

Service Account Authentication¶

For enterprise environments and Google Cloud Platform (GCP) deployments, you can use Service Account authentication with automatic OAuth2 token management:

backends:
    - name: "gemini"
    type: gemini
    auth:
      type: service_account
      key_file: "/path/to/service-account.json"
    models:
      - gemini-3.1-pro-preview
      - gemini-3-flash-preview

Using environment variable for key file path:

backends:
    - name: "gemini"
    type: gemini
    auth:
      type: service_account
      key_file: "${GOOGLE_APPLICATION_CREDENTIALS}"
    models:
      - gemini-3.1-pro-preview

Service Account Authentication Features:

Feature	Description
Automatic Token Refresh	OAuth2 tokens are automatically refreshed 5 minutes before expiration
Token Caching	Tokens are cached in memory to minimize authentication overhead
Thread-Safe	Concurrent requests safely share token refresh operations
Environment Variable Expansion	Key file paths support `${VAR}` and `~` expansion

Creating a Service Account Key:

Go to Google Cloud Console
Navigate to IAM & Admin > Service Accounts
Create a new service account or select an existing one
Click Keys > Add Key > Create new key
Choose JSON format and download the key file
Store the key file securely and reference it in your configuration

Required Permissions:

The service account needs the following roles for Gemini API access:

roles/aiplatform.user - For Vertex AI Gemini endpoints
Or appropriate Google AI Studio permissions for generativelanguage.googleapis.com

Authentication Priority¶

When multiple authentication methods are configured:

Priority	Method	Condition
1 (Highest)	`auth` block	If `auth.type` is specified
2	`api_key` field	If no `auth` block is present
3	Environment variable	Falls back to `CONTINUUM_GEMINI_API_KEY`

If both api_key and auth are specified, the auth block takes precedence and a warning is logged.

Gemini Thinking Models: Automatic max_tokens Adjustment¶

Gemini "thinking" models (gemini-3.1-pro, gemini-3-flash, gemini-2.5-pro, and models with -pro-preview suffix) perform extended reasoning before generating responses. To prevent response truncation, the router automatically adjusts max_tokens:

Condition	Behavior
`max_tokens` not specified	Automatically set to 16384
`max_tokens` < 4096	Automatically increased to 16384
`max_tokens` >= 4096	Client value preserved

This ensures thinking models can generate complete responses without truncation due to low default values from client libraries.

Environment Variables for Gemini¶

Variable	Description
`CONTINUUM_GEMINI_API_KEY`	Google Gemini API key (automatically loaded for `type: gemini` backends)
`GOOGLE_APPLICATION_CREDENTIALS`	Path to service account JSON key file (standard GCP environment variable)

Native Anthropic Backend¶

When using type: anthropic, the router provides: - Default URL: https://api.anthropic.com (can be overridden for proxies) - Native API translation: Automatically converts OpenAI format requests to Anthropic Messages API format and vice versa - Anthropic-specific headers: Automatically adds x-api-key and anthropic-version headers - Environment variable support: Automatically loads from CONTINUUM_ANTHROPIC_API_KEY - Extended streaming timeout: 600s timeout for extended thinking models (Claude Opus, Sonnet 4)

Minimal Anthropic configuration:

backends:
    - name: "anthropic"
    type: anthropic
    models:
      - claude-sonnet-4-20250514
      - claude-haiku-3-5-20241022

Full Anthropic configuration:

backends:
    - name: "anthropic"
    type: anthropic
    api_key: "${CONTINUUM_ANTHROPIC_API_KEY}"
    weight: 2
    models:
      - claude-opus-4-6
      - claude-sonnet-4-6
      - claude-haiku-4-5

Anthropic API Translation¶

The router automatically handles the translation between OpenAI and Anthropic API formats:

OpenAI Format	Anthropic Format
`messages` array with `role: "system"`	Separate `system` parameter
`Authorization: Bearer <key>`	`x-api-key: <key>` header
Optional `max_tokens`	Required `max_tokens` (auto-filled if missing)
`choices[0].message.content`	`content[0].text`
`finish_reason: "stop"`	`stop_reason: "end_turn"`
`usage.prompt_tokens`	`usage.input_tokens`

Example Request Translation:

OpenAI format (incoming from client):

{
  "model": "claude-sonnet-4-20250514",
  "messages": [
    {"role": "system", "content": "You are helpful."},
    {"role": "user", "content": "Hello"}
  ],
  "max_tokens": 1024
}

Anthropic format (sent to API):

{
  "model": "claude-sonnet-4-20250514",
  "system": "You are helpful.",
  "messages": [
    {"role": "user", "content": "Hello"}
  ],
  "max_tokens": 1024
}

Anthropic Native API Endpoints¶

In addition to routing OpenAI-format requests to Anthropic backends, the router also provides native Anthropic API endpoints:

Endpoint	Description
`POST /anthropic/v1/messages`	Native Anthropic Messages API
`POST /anthropic/v1/messages/count_tokens`	Token counting with tiered backend support
`GET /anthropic/v1/models`	Model listing in Anthropic format

These endpoints allow clients that use Anthropic's native API format (such as Claude Code) to connect directly without any request/response transformation overhead.

Claude Code Compatibility¶

The Anthropic Native API endpoints include full compatibility with Claude Code and other advanced Anthropic API clients:

Prompt Caching Support:

The router preserves cache_control fields throughout the request/response pipeline:

System prompt text blocks
User message content blocks (text, image, document)
Tool definitions
Tool use and tool result blocks

Header Forwarding:

Header	Behavior
`anthropic-version`	Forwarded to native Anthropic backends
`anthropic-beta`	Forwarded to enable beta features (e.g., `prompt-caching-2024-07-31`, `interleaved-thinking-2025-05-14`)
`x-request-id`	Forwarded for request tracing

Cache Usage Reporting:

Streaming responses from native Anthropic backends include cache usage information:

{
  "usage": {
    "input_tokens": 2159,
    "cache_creation_input_tokens": 2048,
    "cache_read_input_tokens": 0
  }
}

Anthropic Extended Thinking Models¶

Models supporting extended thinking (Claude Opus, Sonnet 4) may require longer response times. The router automatically: - Sets higher default max_tokens (16384) for thinking models - Uses extended streaming timeout (600s) for these models

OpenAI ↔ Claude Reasoning Parameter Conversion¶

The router automatically converts between OpenAI's reasoning parameters and Claude's thinking parameter, enabling seamless cross-provider reasoning requests.

Supported OpenAI Formats:

Format	API	Example
`reasoning_effort` (flat)	Chat Completions API	`"reasoning_effort": "high"`
`reasoning.effort` (nested)	Responses API	`"reasoning": {"effort": "high"}`

When both formats are present, reasoning_effort (flat) takes precedence.

Effort Level to Budget Tokens Mapping:

Effort Level	Claude `thinking.budget_tokens`
`none`	(thinking disabled)
`minimal`	1,024
`low`	4,096
`medium`	10,240
`high`	32,768

Example Request - Chat Completions API (flat format):

// Client sends OpenAI Chat Completions API request
{
  "model": "claude-sonnet-4-6",
  "reasoning_effort": "high",
  "messages": [{"role": "user", "content": "Solve this complex problem"}]
}

// Router converts to Claude format
{
  "model": "claude-sonnet-4-6",
  "thinking": {"type": "enabled", "budget_tokens": 32768},
  "messages": [{"role": "user", "content": "Solve this complex problem"}]
}

Example Request - Responses API (nested format):

// Client sends OpenAI Responses API request
{
  "model": "claude-sonnet-4-6",
  "reasoning": {"effort": "medium"},
  "messages": [{"role": "user", "content": "Analyze this data"}]
}

// Router converts to Claude format
{
  "model": "claude-sonnet-4-6",
  "thinking": {"type": "enabled", "budget_tokens": 10240},
  "messages": [{"role": "user", "content": "Analyze this data"}]
}

Response with Reasoning Content:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "The final answer is...",
      "reasoning_content": "Let me analyze this step by step..."
    }
  }]
}

Important Notes: - If thinking parameter is explicitly provided, it takes precedence over reasoning_effort and reasoning.effort - reasoning_effort (flat) takes precedence over reasoning.effort (nested) when both are present - Only models supporting extended thinking (Opus 4.x, Sonnet 4.x) will have reasoning enabled - When reasoning is enabled, the temperature parameter is automatically removed (Claude API requirement) - For streaming responses, thinking content is returned as reasoning_content delta events

Environment Variables for Anthropic¶

Variable	Description
`CONTINUUM_ANTHROPIC_API_KEY`	Anthropic API key (automatically loaded for `type: anthropic` backends)

Native llama.cpp Backend¶

When using type: llamacpp, the router provides native support for llama.cpp llama-server:

Default URL: http://localhost:8080 (llama-server default port)
Health Check: Uses /health endpoint (with fallback to /v1/models)
Model Discovery: Parses llama-server's hybrid /v1/models response format
Rich Metadata: Extracts context window, parameter count, and model size from response

Minimal llama.cpp configuration:

backends:
  - name: "local-llama"
    type: llamacpp
    # No URL needed if using default http://localhost:8080
    # No API key required for local server

Full llama.cpp configuration:

backends:
  - name: "local-llama"
    type: llamacpp
    url: "http://192.168.1.100:8080"  # Custom URL if needed
    weight: 2
    # Models are auto-discovered from /v1/models endpoint

llama.cpp Features¶

Feature	Description
GGUF Models	Native support for GGUF quantized models
Local Inference	No cloud API dependencies
Hardware Support	CPU, NVIDIA, AMD, Apple Silicon
Streaming	Full SSE streaming support
Embeddings	Supports `/v1/embeddings` endpoint
Tool Calling Detection	Auto-detects tool calling support via `/props` endpoint

Tool Calling Auto-Detection¶

The router automatically detects tool calling capability for llama.cpp backends by querying the /props endpoint during model discovery. This enables automatic function calling support without manual configuration.

How it works:

When a llama.cpp backend is discovered, the router fetches the /props endpoint
The chat_template field is analyzed using precise Jinja2 pattern matching to detect tool-related syntax
If tool calling patterns are detected, the model's function_calling capability is automatically enabled
Detection results are stored for reference (including a hash of the chat template)

Detection Patterns:

The router uses precise pattern matching to reduce false positives:

Role-based patterns: message['role'] == 'tool', message.role == "tool"
Tool iteration: for tool in tools, for function in functions
Tool calls access: .tool_calls, ['tool_calls'], message.tool_call
Jinja2 blocks with tool keywords: {% if tools %}, {% for tool_call in ... %}

Example /props response analyzed:

{
  "chat_template": "{% for message in messages %}{% if message['role'] == 'tool' %}...",
  "default_generation_settings": { ... },
  "total_slots": 1
}

Fallback Behavior:

If /props is unavailable: Tool calling is assumed to be supported (optimistic fallback for modern llama.cpp versions)
If /props returns an error: Tool calling is assumed to be supported (ensures compatibility with newer models)
If chat template exceeds 64KB: Detection is skipped and defaults to supported
Detection is case-insensitive for maximum compatibility
Results are merged with any existing model metadata from model-metadata.yaml
Detected capabilities appear in the features field of the /v1/models/{model_id} response

Model Metadata Extraction¶

The router extracts rich metadata from llama-server responses:

Field	Source	Description
Context Window	`meta.n_ctx_train`	Training context window size
Parameter Count	`meta.n_params`	Model parameters (e.g., "4B")
Model Size	`meta.size`	File size in bytes
Capabilities	`models[].capabilities`	Model capabilities array

Starting llama-server¶

# Basic startup
./llama-server -m model.gguf --port 8080

# With GPU layers
./llama-server -m model.gguf --port 8080 -ngl 35

# With custom context size
./llama-server -m model.gguf --port 8080 --ctx-size 8192

Auto-Detection of llama.cpp Backends¶

When a backend is added without a type specified (defaults to generic), the router automatically probes the /v1/models endpoint to detect the backend type. llama.cpp backends are identified by:

owned_by: "llamacpp" in the response
Presence of llama.cpp-specific metadata fields (n_ctx_train, n_params, vocab_type)
Hybrid response format with both models[] and data[] arrays

This auto-detection works for:

Hot-reload configuration changes
Backends added via Admin API without explicit type
Configuration files with type: generic or no type specified

Example: Auto-detected backend via Admin API:

# Add backend without specifying type - auto-detects llama.cpp
curl -X POST http://localhost:8080/admin/backends \
  -H "Content-Type: application/json" \
  -d '{
    "name": "local-llm",
    "url": "http://localhost:8080"
  }'

Native MLxcel Backend¶

When using type: mlxcel, the router provides native support for MLxcel, an MLX-based model serving backend for macOS with Apple Silicon:

Default URL: http://localhost:8080 (same as llama-server)
API Compatibility: Fully compatible with llama-server (llama.cpp) API
Model Format: Serves SafeTensor format models via Apple's MLX framework
Health Check: Uses /health as primary, with /v1/models as fallback
Platform: macOS with Apple Silicon only

Minimal MLxcel configuration:

backends:
  - name: "mlxcel-local"
    type: mlxcel
    # No URL needed if using default http://localhost:8080

Full MLxcel configuration:

backends:
  - name: "mlxcel-local"
    type: mlxcel
    url: "http://192.168.1.100:8080"  # Custom URL if needed
    weight: 2
    models:
        - mlx-community/Qwen3-4B-4bit

Auto-detection not supported

MLxcel cannot be auto-detected from the /v1/models response because it returns the same response format as llama.cpp (including owned_by: "llamacpp"). You must explicitly set type: mlxcel in the configuration. This ensures proper owned_by metadata (mlxcel) is used for model identification.

Native LM Studio Backend¶

When using type: lmstudio, the router provides native support for LM Studio local server:

Default URL: http://localhost:1234 (LM Studio default port)
Health Check: Uses /v1/models (OpenAI-compatible) as primary, with /api/v1/models (native API) as fallback
Model Discovery: Auto-discovers models from /v1/models endpoint
owned_by Attribution: Reports "lmstudio" for proper model attribution

Minimal LM Studio configuration:

backends:
  - name: "lmstudio"
    type: lmstudio
    # No URL needed if using default http://localhost:1234
    # No API key required for local server

Full LM Studio configuration:

backends:
  - name: "lmstudio"
    type: lmstudio
    url: "http://192.168.1.100:1234"  # Custom URL if needed
    weight: 2
    api_key: "${LM_API_TOKEN}"        # Optional: LM Studio API token (v0.4.0+)
    # Models are auto-discovered from /v1/models endpoint

LM Studio Features¶

Feature	Description
OpenAI-Compatible API	Full `/v1/chat/completions`, `/v1/completions`, `/v1/embeddings` support
Native REST API	Additional `/api/v1/*` endpoints for model management
Local Inference	No cloud API dependencies
Auto-Discovery	Models automatically detected from `/v1/models`
Optional Authentication	Supports API token via `Authorization: Bearer` header (v0.4.0+)

Native Continuum Router / Backend.AI GO Backend¶

When using type: continuum-router, the router connects to a remote Continuum Router instance or Backend.AI GO deployment for federated LLM routing. Supported aliases include: continuum-router, continuum_router, ContinuumRouter, backendai, backend-ai, backend_ai.

Health Check: Uses /health as primary, with /v1/models as fallback
Model Discovery: Auto-discovers models from the remote instance's /v1/models endpoint
Authentication: Bearer token via Authorization: Bearer <key> header
Request Passthrough: Requests are forwarded with no transformation (both systems use OpenAI-compatible APIs)
owned_by Attribution: Reports "continuum-router" for discovered models
Transport: Supports both HTTP and Unix Domain Socket transports

Minimal configuration:

backends:
  - name: "remote-cr"
    type: continuum-router
    url: "https://remote.example.com"
    api_key: "${REMOTE_API_KEY}"
    # Models are auto-discovered from remote /v1/models endpoint

Full configuration with explicit models:

backends:
  - name: "remote-backendai"
    type: continuum-router
    url: "https://remote-backend-ai.example.com"
    api_key: "${REMOTE_BACKEND_AI_API_KEY}"
    weight: 2
    models:
      - gpt-4o
      - claude-sonnet-4-20250514

Use cases:

Multi-region deployment: geo-route requests across Continuum Router instances
Federated routing: connect multiple independent CR or Backend.AI GO deployments
Tiered access: route through a central Backend.AI GO instance for quota management
High availability: configure multiple Backend.AI GO instances for failover

Continuum Router Backend Features¶

Feature	Description
Federated Routing	Forward requests to remote Continuum Router or Backend.AI GO instances
Auto-Discovery	Models automatically discovered from remote `/v1/models`
Bearer Auth	API key forwarded as `Authorization: Bearer` header
SSE Streaming	Full streaming support for chat completions
No Transformation	Requests passed through as-is (OpenAI-compatible on both ends)
Unix Socket Support	Supports `unix:///path/to/socket.sock` transport URLs

Unix Domain Socket Backends¶

Continuum Router supports Unix Domain Sockets (UDS) as an alternative transport to TCP for local LLM backends. Unix sockets provide:

Enhanced Security: No TCP port exposure - communication happens through the file system
Lower Latency: No network stack overhead for local communication
Better Performance: Reduced context switching and memory copies
Simple Access Control: Uses standard Unix file permissions (on Linux/macOS; Windows does not support Unix file modes)

URL Format:

unix:///path/to/socket.sock

Platform Support:

Platform	Support
Linux	Full support via native AF_UNIX
macOS	Full support via native AF_UNIX
Windows	Full support via `socket2` crate (Windows 10 1809+ / Build 17063+)
Other	Not supported; addresses are skipped with a warning

Configuration Examples:

On Windows, use drive-letter paths (e.g., unix://C:/temp/llama.sock). On Linux/macOS, use standard absolute paths (e.g., unix:///var/run/llama.sock).

backends:
  # llama-server with Unix socket (Linux/macOS)
  - name: "llama-socket"
    type: llamacpp
    url: "unix:///var/run/llama-server.sock"
    weight: 2
    models:
      - llama-3.2-3b
      - qwen3-4b

  # Ollama with Unix socket
  - name: "ollama-socket"
    type: ollama
    url: "unix:///var/run/ollama.sock"
    weight: 1
    models:
      - llama3.2
      - mistral

  # vLLM with Unix socket
  - name: "vllm-socket"
    type: vllm
    url: "unix:///tmp/vllm.sock"
    weight: 3
    models:
      - meta-llama/Llama-3.1-8B-Instruct

Starting Backends with Unix Sockets:

# llama-server
./llama-server -m model.gguf --unix /var/run/llama.sock

# Ollama
OLLAMA_HOST="unix:///var/run/ollama.sock" ollama serve

# vLLM
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B \
  --unix-socket /tmp/vllm.sock

Socket Path Conventions:

Path	Use Case
`/var/run/*.sock`	System services (requires root)
`/tmp/*.sock`	Temporary, user-accessible
`~/.local/share/continuum/*.sock`	Per-user persistent sockets
`~/Library/Application Support/*.sock`	macOS application data (paths with spaces are supported)

Health Checks: The router automatically performs health checks on Unix socket backends using the same endpoints (/health, /v1/models) as TCP backends.

Current Limitations:

Streaming (SSE) not supported: Unix socket backends do not currently support Server-Sent Events (SSE) streaming. Use TCP backends for streaming chat completions.
Windows platform: Unix sockets are not currently supported on Windows (planned for future releases).
Max response size: Response bodies are limited to 100MB by default to prevent memory exhaustion.

Troubleshooting:

Error	Cause	Solution
"Socket file not found"	Server not running	Start the backend server
"Permission denied"	File permissions	`chmod 660 socket.sock`
"Connection timeout"	Server not accepting connections	Verify server is listening
"Response body exceeds maximum size"	Response too large	Increase maxresponsesize or use streaming with TCP backend