Advanced Configuration¶

Global Prompts¶

Global prompts allow you to inject system prompts into all requests, providing centralized policy management for security, compliance, and behavioral guidelines. Prompts can be defined inline or loaded from external Markdown files.

Basic Configuration¶

global_prompts:
  # Inline default prompt
  default: |
    You must follow company security policies.
    Never reveal internal system details.
    Be helpful and professional.

  # Merge strategy: prepend (default), append, or replace
  merge_strategy: prepend

  # Custom separator between global and user prompts
  separator: "\n\n---\n\n"

External Prompt Files¶

For complex prompts, you can load content from external Markdown files. This provides: - Better editing experience with syntax highlighting - Version control without config file noise - Hot-reload support for prompt updates

global_prompts:
  # Directory containing prompt files (relative to config directory)
  prompts_dir: "./prompts"

  # Load default prompt from file
  default_file: "system.md"

  # Backend-specific prompts from files
  backends:
    anthropic:
      prompt_file: "anthropic-system.md"
    openai:
      prompt_file: "openai-system.md"

  # Model-specific prompts from files
  models:
    gpt-4o:
      prompt_file: "gpt4o-system.md"
    claude-3-opus:
      prompt_file: "claude-opus-system.md"

  merge_strategy: prepend

Prompt Resolution Priority¶

When determining which prompt to use for a request:

Model-specific prompt (highest priority) - global_prompts.models.<model-id>
Backend-specific prompt - global_prompts.backends.<backend-name>
Default prompt - global_prompts.default or global_prompts.default_file

For each level, if both prompt (inline) and prompt_file are specified, prompt_file takes precedence.

Merge Strategies¶

Strategy	Behavior
`prepend`	Global prompt added before user's system prompt (default)
`append`	Global prompt added after user's system prompt
`replace`	Global prompt replaces user's system prompt entirely

REST API Management¶

Prompt files can be managed at runtime via the Admin API:

# List all prompts
curl http://localhost:8080/admin/config/prompts

# Get specific prompt file
curl http://localhost:8080/admin/config/prompts/prompts/system.md

# Update prompt file
curl -X PUT http://localhost:8080/admin/config/prompts/prompts/system.md \
  -H "Content-Type: application/json" \
  -d '{"content": "# Updated System Prompt\n\nNew content here."}'

# Reload all prompt files from disk
curl -X POST http://localhost:8080/admin/config/prompts/reload

See Admin REST API Reference for complete API documentation.

Security Considerations¶

Path Traversal Protection: All file paths are validated to prevent directory traversal attacks
File Size Limits: Individual files limited to 1MB, total cache limited to 50MB
Relative Paths Only: Prompt files must be within the configured prompts_dir or config directory
Sandboxed Access: Files outside the allowed directory are rejected

Hot Reload¶

Global prompts support immediate hot-reload. Changes to prompt configuration or files take effect on the next request without server restart.

Model Metadata¶

Continuum Router supports rich model metadata to provide detailed information about model capabilities, pricing, and limits. This metadata is returned in /v1/models API responses and can be used by clients to make informed model selection decisions.

Metadata Sources¶

Model metadata can be configured in three ways (in priority order):

Backend-specific model_configs (highest priority)
External metadata file (model-metadata.yaml)
No metadata (models work without metadata)

External Metadata File¶

Create a model-metadata.yaml file:

models:
    - id: "gpt-4"
    aliases:                    # Alternative IDs that share this metadata
      - "gpt-4-0125-preview"
      - "gpt-4-turbo-preview"
      - "gpt-4-vision-preview"
    metadata:
      display_name: "GPT-4"
      summary: "Most capable GPT-4 model for complex tasks"
      capabilities: ["text", "image", "function_calling"]
      knowledge_cutoff: "2024-04"
      pricing:
        input_tokens: 0.03   # Per 1000 tokens
        output_tokens: 0.06  # Per 1000 tokens
      limits:
        context_window: 128000
        max_output: 4096

    - id: "llama-3-70b"
    aliases:                    # Different quantizations of the same model
      - "llama-3-70b-instruct"
      - "llama-3-70b-chat"
      - "llama-3-70b-q4"
      - "llama-3-70b-q8"
    metadata:
      display_name: "Llama 3 70B"
      summary: "Open-source model with strong performance"
      capabilities: ["text", "code"]
      knowledge_cutoff: "2023-12"
      pricing:
        input_tokens: 0.001
        output_tokens: 0.002
      limits:
        context_window: 8192
        max_output: 2048

Reference it in your config:

model_metadata_file: "model-metadata.yaml"

Thinking Pattern Configuration¶

Some models output reasoning/thinking content in non-standard ways. The router supports configuring thinking patterns per model to properly transform streaming responses.

Pattern Types:

Pattern	Description	Example Model
`none`	No thinking pattern (default)	Most models
`standard`	Explicit start/end tags (`<think>...</think>`)	Custom reasoning models
`unterminated_start`	No start tag, only end tag	nemotron-3-nano

Configuration Example:

models:
    - id: nemotron-3-nano
      metadata:
        display_name: "Nemotron 3 Nano"
        capabilities: ["chat", "reasoning"]
        # Thinking pattern configuration
        thinking:
          pattern: unterminated_start
          end_marker: "</think>"
          assume_reasoning_first: true

Thinking Pattern Fields:

Field	Type	Description
`pattern`	string	Pattern type: `none`, `standard`, or `unterminated_start`
`start_marker`	string	Start marker for `standard` pattern (e.g., `<think>`)
`end_marker`	string	End marker (e.g., `</think>`)
`assume_reasoning_first`	boolean	If `true`, treat first tokens as reasoning until end marker

How It Works:

When a model has a thinking pattern configured:

Streaming responses are intercepted and transformed
Content before end_marker is sent as reasoning_content field
Content after end_marker is sent as content field
The output follows OpenAI's reasoning_content format for compatibility

Example Output:

// Reasoning content (before end marker)
{"choices": [{"delta": {"reasoning_content": "Let me analyze..."}}]}

// Regular content (after end marker)
{"choices": [{"delta": {"content": "The answer is 42."}}]}

Responses-API-only Models¶

OpenAI exposes some models exclusively via the Responses API (/v1/responses). These models are not reachable through /v1/chat/completions, so a request that targets them on a Chat Completions endpoint returns a 404 not_found from upstream.

The responses_only capability flag marks such models so the router can dispatch them to the Responses API surface instead. The flag defaults to false, so existing model entries do not need to be touched.

Configuration Example:

models:
    - id: gpt-5.4-pro
      metadata:
        display_name: "GPT-5.4 Pro"
        capabilities: ["chat", "vision", "code", "reasoning", "tool"]
        # Served only on /v1/responses; not available on /v1/chat/completions.
        responses_only: true
        limits:
          context_window: 1050000
          max_output: 128000

Models marked Responses-API-only out of the box¶

The list below is kept in sync with model-metadata.yaml and the built-in OpenAI registry (src/infrastructure/backends/openai/models/gpt5_family.rs). When a new Responses-API-only model is added upstream, both files should be updated together.

Model ID	Source	Notes
`gpt-5.2-pro`	Built-in OpenAI metadata + `model-metadata.yaml`	Smartest model for difficult questions; `xhigh` reasoning effort
`gpt-5.4-pro`	`model-metadata.yaml`	Frontier-class deep reasoning; supports `medium`, `high`, `xhigh`
`gpt-5.5-pro`	`model-metadata.yaml`	High-capability variant of GPT-5.5 for high-stakes workloads

The flag follows the same lookup priority chain as the rest of the metadata (backend model_configs > model-metadata.yaml > built-in OpenAI metadata), so an operator-supplied entry can override the default for any model.

Marking a new model as Responses-API-only¶

To mark an additional model as Responses-API-only, add responses_only: true to the model entry's metadata block in any of the supported sources. Use the lookup priority that fits the deployment scope:

model-metadata.yaml for a router-wide default that applies to every backend. Add the flag alongside the existing capability metadata; no other field needs to change. This is the recommended location for newly-released Pro models that are uniformly Responses-API-only across providers.
Backend model_configs in config.yaml for a backend-specific override (for example, when a self-hosted clone of a Pro model is exposed on a Chat Completions endpoint and should not be dispatched to /v1/responses). A backend-level responses_only: false overrides the metadata-file default for that backend only.
Built-in OpenAI registry in src/infrastructure/backends/openai/models/gpt5_family.rs for models that ship with the binary. New entries here should also be reflected in model-metadata.yaml so externally-loaded metadata stays consistent.

After updating any of these sources, restart the router or trigger a hot reload so the new flag takes effect on subsequent requests.

Dispatch behavior¶

The router honors responses_only=true on every public surface that would otherwise hit /v1/chat/completions:

/v1/chat/completions: requests transparently forward to the upstream /v1/responses endpoint and the response is translated back into a strict-mode chat.completion (or chat.completion.chunk for streaming) envelope.
/anthropic/v1/messages: the Anthropic-formatted request is converted to the Responses API shape, dispatched to /v1/responses, and the upstream response is translated back into Anthropic Messages JSON (or the Anthropic SSE event sequence for streaming). Tool-call round-trips, web-search emulation, and Unix-socket transports all branch on the flag.

In both cases the dispatch is transparent to the client: the request and response shapes match the surface the client called, so no client-side changes are required to use a responses_only model.

Backend-type constraint¶

Only OpenAI and Azure OpenAI backends serve /v1/responses. When a responses_only model is paired with a backend whose type is not OpenAI or Azure OpenAI, the router rejects the request with a 400 invalid_request_error (Anthropic-shaped on /anthropic/v1/messages, OpenAI-shaped on /v1/chat/completions) before any upstream dispatch. The message names both the model and the configured backend type so the misconfiguration is visible from the client log.

The first dispatch per (backend, model) pair logs at info level so operators can confirm Responses-API routing without enabling debug logs.

Namespace-Aware Matching¶

The router handles model IDs with namespace prefixes. For example:

Backend returns: "custom/gpt-4", "openai/gpt-4", "optimized/gpt-4"
Metadata defined for: "gpt-4"
Result: All variants match and receive the same metadata

This allows different backends to use their own naming conventions while sharing common metadata definitions.

Metadata Priority and Alias Resolution¶

When looking up metadata for a model, the router uses the following priority chain:

Exact model ID match
Exact alias match
Date suffix normalization (automatic, zero-config)
Quantization / format suffix normalization (automatic, zero-config; see below)
Combined date + format suffix normalization
Wildcard pattern alias match
Base model name fallback (namespace stripping)

Within each source (backend config, metadata file, built-in), the same priority applies:

Backend-specific model_configs (highest priority)

backends:
  - name: "my-backend"
    model_configs:
      - id: "gpt-4"
        aliases: ["gpt-4-turbo", "gpt-4-vision"]
        metadata: {...}  # This takes precedence

External metadata file (second priority)

model_metadata_file: "model-metadata.yaml"

Built-in metadata (for OpenAI and Gemini backends)

Automatic Date Suffix Handling¶

LLM providers frequently release model versions with date suffixes. The router automatically detects and normalizes date suffixes without any configuration:

Supported date patterns:

-YYYYMMDD (e.g., claude-opus-4-5-20251130)
-YYYY-MM-DD (e.g., gpt-4o-2024-08-06)
-YYMM (e.g., o1-mini-2409)
@YYYYMMDD (e.g., model@20251130)

How it works:

Request: claude-opus-4-5-20251215
         ↓ (date suffix detected)
Lookup:  claude-opus-4-5-20251101  (existing metadata entry)
         ↓ (base names match)
Result:  Uses claude-opus-4-5-20251101 metadata

This means you only need to configure metadata once per model family, and new dated versions automatically inherit the metadata.

Automatic Quantization and Format Suffix Handling¶

Real-world model IDs arriving at /v1/models, routing logic, and backend metadata enrichment frequently combine a canonical base ID with one or more trailing quantization, format, or flavor tokens. The router strips an allowlisted set of such tokens iteratively and retries exact-id, exact-alias, and date-suffix matching after each peel, so you only need to configure metadata for the canonical base ID.

Token Categories¶

The following trailing tokens are detected and stripped (case-insensitive):

Category	Examples
Bit-width	`-2bit`, `-3bit`, `-4bit`, `-5bit`, `-6bit`, `-8bit`, `-16bit`
GGUF / llama.cpp quants	`-Q4_K_M`, `-Q4_K_S`, `-Q5_K_M`, `-Q6_K`, `-Q8_0`, `-Q2_K`, `-IQ2_XS`, `-IQ3_XXS`, `-IQ4_XS`, `-F16`, `-F32`, `-BF16`
FP formats	`-FP4`, `-FP8`, `-FP16`, `-FP32`, `-NVFP4`, `-MXFP4`
INT formats	`-INT2`, `-INT4`, `-INT8`
Library tags	`-AWQ`, `-GPTQ`, `-BNB`, `-HQQ`, `-EXL2`, `-EXL3`, `-MLX`
Imatrix / abbreviated	`-i1` through `-i8`, `-q2` through `-q8`
Unsloth dynamic	`-UD-Q`, `-UD-IQ`
Container formats	`-GGUF`, `-GGML`, `-SAFETENSORS`
Flavors	`-it`, `-instruct`, `-chat`, `-base`, `-thinking`, `-qat`

Parameter-Count Suffixes are Preserved¶

Tokens that look like parameter counts are never stripped, even when they share a trailing b:

Kept: -32b, -70b, -8b, -4b, -a3b, -a22b, -0.6b, -1.7b, -e4b
Stripped: -4bit, -8bit, -16bit (the literal bit suffix marks quantization)

This discrimination ensures that a parameter-count variant like qwen3-32b resolves only to explicit qwen3-32b metadata, never to a generic qwen3 entry via accidental stripping.

Layered Peeling¶

Tokens are stripped one at a time. After each peel, the router re-runs exact-id, exact-alias, and date-suffix match before attempting another peel. This lets alias configurations like gemma-3-4b-it-qat still win even when the request is gemma-3-4b-it-qat-4bit:

Request: gemma-3-4b-it-qat-4bit
         ↓ (peel -4bit)
Try:     gemma-3-4b-it-qat
         ↓ (matches alias of gemma-3-4b-qat)
Result:  Uses gemma-3-4b-qat metadata

Priority Note¶

Stripping runs after exact-id and exact-alias match. A canonical base ID that happens to end in an allowlisted token (for example gemma-3-12b-qat) wins before the peel phase runs, so existing configurations remain stable.

Suffix-Order Ambiguity¶

Both -qat-4bit and -4bit-qat orderings appear in real-world model IDs. Peeling removes one token at a time from the right, so the intermediate form mirrors the order in which tokens appeared in the input. The match sequence for gemma-3-12b-qat-4bit is gemma-3-12b-qat-4bit → gemma-3-12b-qat → gemma-3-12b, while gemma-3-12b-4bit-qat goes gemma-3-12b-4bit-qat → gemma-3-12b-4bit → gemma-3-12b. If both suffix orderings need to resolve to the same QAT-variant metadata, configure the canonical QAT base ID (gemma-3-12b-qat) with the matching metadata and let the non-QAT form (gemma-3-12b) carry its own entry; the deepest successful match wins at each peel depth. When the QAT and non-QAT variants need distinct tier or capability metadata, prefer aliases that enumerate the reorderings over relying on the peel order alone.

Length Bounds¶

The layered peel phase caps input length at 256 characters and iteration count at 8 peels as defense-in-depth against pathological inputs. Matching still runs (the exact-id and exact-alias phases remain in effect), but the peel phase short-circuits instead of walking a long allowlist-token chain. Request handlers enforce the same 256-character cap on the model field for every chat / completion / embedding endpoint, so normal traffic never hits the internal cap.

Case Insensitivity¶

Stripping is case-insensitive, so Qwen3.5-4B-4bit, QWEN3.5-4B-4BIT, and qwen3.5-4b-4bit all resolve to the same qwen3.5-4b metadata entry. Exact-id and exact-alias match phases (1 and 2) remain case-sensitive, so HuggingFace-style aliases like BAAI/bge-m3 keep their original behavior.

Wildcard Pattern Matching¶

Aliases support glob-style wildcard patterns using the * character:

Prefix matching: claude-* matches claude-opus, claude-sonnet, etc.
Suffix matching: *-preview matches gpt-4o-preview, o1-preview, etc.
Infix matching: gpt-*-turbo matches gpt-4-turbo, gpt-3.5-turbo, etc.

Example configuration with wildcard patterns:

models:
    - id: "claude-opus-4-5-20251101"
    aliases:
        - "claude-opus-4-5"     # Exact match for base name
        - "claude-opus-*"       # Wildcard for any claude-opus variant
    metadata:
        display_name: "Claude Opus 4.5"
        # Automatically matches: claude-opus-4-5-20251130, claude-opus-test, etc.

    - id: "gpt-4o"
    aliases:
        - "gpt-4o-*-preview"    # Matches preview versions
        - "*-4o-turbo"          # Suffix matching
    metadata:
        display_name: "GPT-4o"

Priority note: Exact aliases are always matched before wildcard patterns. When both could match, the exact alias wins.

Using Aliases for Model Variants¶

Aliases are particularly useful for:

Different quantizations: qwen3-32b-i1, qwen3-23b-i4 → all use qwen3 metadata
Version variations: gpt-4-0125-preview, gpt-4-turbo → share gpt-4 metadata
Deployment variations: llama-3-70b-instruct, llama-3-70b-chat → same base model
Dated versions: claude-3-5-sonnet-20241022, claude-3-5-sonnet-20241201 → share metadata (automatic with date suffix handling)

Example configuration with aliases:

model_configs:
    - id: "qwen3"
    aliases:
      - "qwen3-32b-i1"     # 32B with 1-bit quantization
      - "qwen3-23b-i4"     # 23B with 4-bit quantization
      - "qwen3-16b-q8"     # 16B with 8-bit quantization
      - "qwen3-*"          # Wildcard for any other qwen3 variant
    metadata:
      display_name: "Qwen 3"
      summary: "Alibaba's Qwen model family"
      # ... rest of metadata

Aliases vs. suffix normalization: when to use which¶

Two coverage layers resolve non-canonical model ids to their owning metadata entry: explicit YAML aliases, and the layered suffix-peel allowlist in src/models/pattern_matching.rs. They are complementary, not redundant. This section explains how to choose between them when adding a new entry.

Matching phase order¶

The pipeline runs in this order, and a successful match in an earlier phase short-circuits the later ones:

Exact model id (case-sensitive).
Exact alias (case-sensitive).
Date-suffix normalization (-YYYYMMDD, -YYYY-MM-DD, -YYMM, @YYYYMMDD).
Layered quantization / format / flavor peel (case-insensitive; after each peel, exact-id + exact-alias + date-suffix phases re-run; combined date + format handled in the same loop).
HuggingFace repo-prefix stripping (vendor/repo -> repo) with re-entry into phases 1-4 on the stripped residual. Single-hop re-entry; phase 5 does not recurse.
Wildcard alias (glob-style * patterns).

A retained explicit alias runs in phase 2, strictly before the peel (phase 4) and before the prefix-strip layer (phase 5). When a retained alias and a peel-or-strip path would resolve to different metadata, the alias wins deterministically. Aliases are therefore a stronger intent signal than peel-or-strip coverage, not a weaker one.

The three alias classes¶

Every alias in model-metadata.yaml falls into one of three classes.

peel-coverable¶

Normalization reaches the same owner id without the alias, and the target metadata is the correct one. These are deletion candidates. Example: qwen3.6-35b-a3b-instruct as an alias of qwen3.6-35b-a3b. Phase 4 peels the FLAVOR token -instruct and lands on the base id directly, so the explicit alias adds no coverage. The 64 aliases removed in issue #557 were all in this class, and each one has a regression assert in tests/format_suffix_normalization_test.rs::real_metadata_removed_aliases_still_resolve.

vendor-prefix¶

The alias carries a vendor or repo prefix that suffix peel cannot strip, because peel only removes right-side tokens from a closed allowlist. Historically (pre-#555) such aliases were strictly load-bearing because the old namespace-fallback phase was case-sensitive; after #555 introduced phase 5 (HuggingFace prefix stripping with re-entry into phases 1-4, where phase 4 is case-insensitive), the mixed-case HF form resolves without the alias. Example: Qwen/Qwen3.6-35B-A3B as an alias of qwen3.6-35b-a3b. Phase 2 still wins on the explicit alias today, but phase 5 would also reach the base id via Qwen/ -> residual Qwen3.6-35B-A3B -> phase 4 case-insensitive match. These aliases are now peel-coverable-adjacent. Retroactive removal is deferred to a follow-up audit; keep for now, with a YAML comment noting the covering phase.

intentional-override¶

The alias deliberately routes a differently-weighted model under another entry's metadata, as an operator decision. Keep. Example: smoothie-qwen3-32b-i1 as an alias of smoothie-qwen3. The smoothie-qwen3-32b-i1 fine-tune has its own weights; the operator has chosen to surface it under the umbrella smoothie-qwen3 metadata rather than give it a dedicated entry. Peel must not infer this equivalence on its own. When an alias sits in this class, the YAML comment on the line must note that the underlying weights differ from the owner id, so a future reader or auditor can tell an intentional override apart from a mechanical normalization gap.

Guidance for adding a new alias¶

Before adding a line to model-metadata.yaml, ask whether peel already covers it.

If the new id is a canonical base with a trailing quantization, format, or flavor token already in the allowlist, and the weights are equivalent to the base metadata, do not add the alias. The peel handles it, and adding the alias would be dead code.
If the new id shares weights with the base but ends in a token class that peel does not yet handle (for example, a novel fine-tune label like -abliterated or a new quantization format like -nf4), prefer extending the peel allowlist in src/models/pattern_matching.rs. This is a code change with test coverage, and it lifts an entire class of future variants in one move.
If the new id has a vendor prefix, a repo namespace that normalization would not case-match, a parameter-count token blocking the peel chain (-Nb, -aNb, -eNb), or intentionally-different weights, add the alias with a YAML comment that states the reason. If weights differ from the owner id, say so in the comment.

Surface distinction: code-gated vs. YAML-gated¶

Change site	Gate	Release cadence	Use for
Peel allowlist in `src/models/pattern_matching.rs`	Code review + Rust release	Ships with the next router release	Strategic normalization that covers a whole token class across all models.
Aliases in `model-metadata.yaml`	YAML review + hot reload	Same-day reload via admin API	Individual overrides, vendor-prefix fixes, weight-differing overrides, and emergency coverage for novel tokens before they earn a peel allowlist entry.

The peel allowlist is the strategic layer. Aliases are the tactical override and emergency channel.

Token categories already on the peel allowlist¶

The allowlist in src/models/pattern_matching.rs currently covers:

BIT_WIDTH: -2bit, -3bit, -4bit, -5bit, -6bit, -8bit, -16bit
GGUF_QUANT: -Q4_K_M, -Q4_K_S, -Q5_K_M, -Q6_K, -Q8_0, -Q2_K, -IQ2_XS, -IQ3_XXS, -IQ4_XS, -F16, -F32, -BF16
FP_FORMAT: -FP4, -FP8, -FP16, -FP32, -NVFP4, -MXFP4
INT_FORMAT: -INT2, -INT4, -INT8
LIBRARY: -AWQ, -GPTQ, -BNB, -HQQ, -EXL2, -EXL3, -MLX
IMATRIX: -i1 through -i8, -q2 through -q8
UNSLOTH: -UD-Q<digit>_<KIND>, -UD-IQ<digit>_<KIND>
CONTAINER: -GGUF, -GGML, -SAFETENSORS
FLAVOR: -it, -instruct, -chat, -base, -thinking, -qat

Parameter-count suffixes (-Nb, -aNb, -eNb, -0.6b, -1.7b) are never peeled. They are part of canonical model identity and terminate the peel chain. This is why qwen3-32b-i1 must be kept as an explicit alias of qwen3: phase 4 strips -i1 and then halts at -32b, so without the alias the chain exhausts before reaching the base id.

HuggingFace repo-prefix stripping (phase 5)¶

Phase 5 normalizes HuggingFace-style vendor/repo prefixes off the left side of a model id, complementing the right-side suffix peel. It was added in issue #555 to resolve the common HF-GGUF class where a user submits an id like unsloth/Qwen3.6-35B-A3B-GGUF and expects it to route to the canonical qwen3.6-35b-a3b metadata without an explicit alias for every vendor x base x quant combination.

How phase 5 runs¶

The input is inspected for a / separator. No /, no-op.
Total segments (count of / plus one) must be at most MAX_PREFIX_SEGMENTS (3). org/team/repo is permitted; a/b/c/d/model is rejected outright.
All segments must be non-empty and free of ASCII whitespace. Malformed inputs like /repo, vendor/, vendor//repo, or vendor /repo are rejected.
On success, the residual is the substring after the last /. This residual is fed back into phases 1-4 with the re-entry gate closed. Phase 5 does not recurse: the inner call cannot trigger phase 5 again, so the recursion depth is exactly 1 by construction.

Composition with suffix peel¶

The re-entry runs through phase 4, so prefix stripping composes with suffix peel in a single lookup. unsloth/Qwen3.6-35B-A3B-GGUF strips to Qwen3.6-35B-A3B-GGUF, phase 4 peels -GGUF, case-insensitively matches qwen3.6-35b-a3b. This is the motivating case for the phase and covers HuggingFace GGUF forks without requiring hand-enumerated aliases.

Registered-alias precedence¶

Operators who explicitly register a vendor/repo form as a YAML alias keep deterministic control. Phase 2 runs before phase 5, so the exact alias wins before the stripping layer ever considers the input. Use this when the prefixed form must route to a different metadata entry than the canonical base id would.

Out of scope¶

Hyphen-delimited vendor prefixes (e.g., smoothie-qwen/smoothie-qwen3-32b-i1). Different semantic class, different detection difficulty, and often represents different weights where silent base-metadata routing is the wrong call. A future issue may revisit if demand materializes.
Automatic vendor discovery from the HuggingFace API. The layer is purely syntactic.
Extending the suffix peel allowlist. Orthogonal change; follow the peel-extension path for novel token classes.

Security bounds¶

Parallels the suffix peel:

Bound	Value	Effect
`MAX_PREFIX_SEGMENTS`	3	Inputs with more segments are rejected before any scan.
`MAX_MODEL_ID_LEN`	256	Oversized inputs skip phase 5 just like phase 4.
Re-entry depth	1	Structurally enforced via a recursion gate, not a counter.

Phase 5 is constant-time on adversarial input: after the segment-count, emptiness, whitespace, and length guards, the work reduces to a single slice lookup plus one additional pass through phases 1-4.

Audit procedure¶

To re-audit the YAML, run:

cargo test --test alias_audit_helper -- --ignored --nocapture audit_metadata_aliases

The helper prints every alias with its classification (REDUNDANT, LOAD-BEARING-DRIFT, LOAD-BEARING-LOSS, or WILDCARD) and the post-removal resolution target. The current snapshot is captured in docs/reports/alias-audit-2026-04.md.

API Response¶

The /v1/models endpoint returns enriched model information:

{
  "object": "list",
  "data": [
    {
      "id": "gpt-4",
      "object": "model",
      "created": 1234567890,
      "owned_by": "openai",
      "backends": ["openai-proxy"],
      "metadata": {
        "display_name": "GPT-4",
        "summary": "Most capable GPT-4 model for complex tasks",
        "capabilities": ["text", "image", "function_calling"],
        "knowledge_cutoff": "2024-04",
        "pricing": {
          "input_tokens": 0.03,
          "output_tokens": 0.06
        },
        "limits": {
          "context_window": 128000,
          "max_output": 4096
        }
      }
    }
  ]
}

Hot Reload¶

Continuum Router supports hot reload for runtime configuration updates without server restart. Configuration changes are detected automatically and applied based on their classification.

Configuration Item Classification¶

Configuration items are classified into three categories based on their hot reload capability:

Immediate Update (No Service Interruption)¶

These settings update immediately without any service disruption:

# Logging configuration
logging:
  level: "info"                  # ✅ Immediate: Log level changes apply instantly
  format: "json"                 # ✅ Immediate: Log format changes apply instantly

# Rate limiting settings
rate_limiting:
  enabled: true                  # ✅ Immediate: Enable/disable rate limiting
  limits:
    per_client:
      requests_per_second: 10    # ✅ Immediate: New limits apply immediately
      burst_capacity: 20         # ✅ Immediate: Burst settings update instantly

# Circuit breaker configuration
circuit_breaker:
  enabled: true                  # ✅ Immediate: Enable/disable circuit breaker
  failure_threshold: 5           # ✅ Immediate: Threshold updates apply instantly
  timeout_seconds: 60            # ✅ Immediate: Timeout changes immediate

# Retry configuration
retry:
  max_attempts: 3                # ✅ Immediate: Retry policy updates instantly
  base_delay: "100ms"            # ✅ Immediate: Backoff settings apply immediately
  exponential_backoff: true      # ✅ Immediate: Strategy changes instant

# Global prompts
global_prompts:
  default: "You are helpful"       # ✅ Immediate: Prompt changes apply to new requests
  default_file: "prompts/system.md"  # ✅ Immediate: File-based prompts also hot-reload

# Admin statistics
admin:
  stats:
    retention_window: "24h"        # ✅ Immediate: Retention window updates instantly
    token_tracking: true           # ✅ Immediate: Token tracking toggle applies immediately

Gradual Update (Existing Connections Maintained)¶

These settings apply to new connections while maintaining existing ones:

# Backend configuration
backends:
    - name: "ollama"               # ✅ Gradual: New requests use updated backend pool
    url: "http://localhost:11434"
    weight: 2                    # ✅ Gradual: Load balancing updates for new requests
    models: ["llama3.2"]         # ✅ Gradual: Model routing updates gradually

# Health check settings
health_checks:
  interval: "30s"                # ✅ Gradual: Next health check cycle uses new interval
  timeout: "10s"                 # ✅ Gradual: New checks use updated timeout
  unhealthy_threshold: 3         # ✅ Gradual: Threshold applies to new evaluations
  healthy_threshold: 2           # ✅ Gradual: Recovery threshold updates gradually

# Timeout configuration
timeouts:
  connection: "10s"              # ✅ Gradual: New requests use updated timeouts
  request:
    standard:
      first_byte: "30s"          # ✅ Gradual: Applies to new requests
      total: "180s"              # ✅ Gradual: New requests use new timeout
    streaming:
      chunk_interval: "30s"      # ✅ Gradual: New streams use updated settings

Requires Restart (Hot Reload Not Possible)¶

These settings require a server restart to take effect. Changes are logged as warnings:

server:
  bind_address: "0.0.0.0:8080"   # ❌ Restart required: TCP/Unix socket binding
  # bind_address:                 # ❌ Restart required: Any address changes
  #   - "0.0.0.0:8080"
  #   - "unix:/var/run/router.sock"
  socket_mode: 0o660              # ❌ Restart required: Socket permissions
  workers: 4                      # ❌ Restart required: Worker thread pool size

When these settings are changed, the router will log a warning like:

WARN server.bind_address changed from '0.0.0.0:8080' to '0.0.0.0:9000' - requires restart to take effect

Hot Reload Process¶

File System Watcher - Detects configuration file changes automatically
Configuration Loading - New configuration is loaded and parsed
Validation - New configuration is validated against schema
Change Detection - ConfigDiff computation identifies what changed
Classification - Changes are classified (immediate/gradual/restart)
Atomic Update - Valid configuration is applied atomically
Component Propagation - Updates are propagated to affected components:
HealthChecker updates check intervals and thresholds
RateLimitStore updates rate limiting rules
CircuitBreaker updates failure thresholds and timeouts
BackendPool updates backend configuration
Immediate Health Check - When backends are added, an immediate health check is triggered so new backends become available within 1-2 seconds instead of waiting for the next periodic check
Error Handling - If invalid, error is logged and old configuration retained

Checking Hot Reload Status¶

Use the admin API to check hot reload status and capabilities:

# Check if hot reload is enabled
curl http://localhost:8080/admin/config/hot-reload-status

# View current configuration
curl http://localhost:8080/admin/config

Hot Reload Behavior Examples¶

Example 1: Changing Log Level (Immediate)

# Before
logging:
  level: "info"

# After
logging:
  level: "debug"

Result: Log level changes immediately. No restart needed. Ongoing requests continue, new logs use debug level.

Example 2: Adding a Backend (Gradual with Immediate Health Check)

# Before
backends:
    - name: "ollama"
    url: "http://localhost:11434"

# After
backends:
    - name: "ollama"
    url: "http://localhost:11434"
    - name: "lmstudio"
    url: "http://localhost:1234"

Result: New backend added to pool with immediate health check triggered. The new backend becomes available within 1-2 seconds (instead of waiting up to 30 seconds for the next periodic health check). Existing requests continue to current backends. New requests can route to lmstudio once health check passes.

Example 2b: Removing a Backend (Graceful Draining)

# Before
backends:
    - name: "ollama"
      url: "http://localhost:11434"
    - name: "lmstudio"
      url: "http://localhost:1234"

# After
backends:
    - name: "ollama"
      url: "http://localhost:11434"

Result: Backend "lmstudio" enters draining state. New requests are not routed to it, but existing in-flight requests (including streaming) continue until completion. After all references are released (or after 5 minutes timeout), the backend is fully removed from memory.

Backend State Lifecycle¶

When a backend is removed from configuration, it goes through a graceful shutdown process:

Active → Draining: Backend is marked as draining. New requests skip this backend.
In-flight Completion: Existing requests/streams continue uninterrupted.
Cleanup: Once all references are released, or after 5-minute timeout, the backend is removed.

This ensures zero impact on ongoing connections during configuration changes.

Example 3: Changing Bind Address (Requires Restart)

# Before
server:
  bind_address: "0.0.0.0:8080"

# After
server:
  bind_address: "0.0.0.0:9000"

Result: Warning logged. Change does not take effect. Restart required to bind to new port.

Distributed Tracing¶

Continuum Router supports distributed tracing for request correlation across backend services. This feature helps with debugging and monitoring requests as they flow through multiple services.

Configuration¶

tracing:
  enabled: true                         # Enable/disable distributed tracing (default: true)
  w3c_trace_context: true               # Support W3C Trace Context header (default: true)
  headers:
    trace_id: "X-Trace-ID"              # Header name for trace ID (default)
    request_id: "X-Request-ID"          # Header name for request ID (default)
    correlation_id: "X-Correlation-ID"  # Header name for correlation ID (default)

How It Works¶

Trace ID Extraction: When a request arrives, the router extracts trace IDs from headers in the following priority order:
W3C traceparent header (if W3C support enabled)
Configured trace_id header (X-Trace-ID)
Configured request_id header (X-Request-ID)
Configured correlation_id header (X-Correlation-ID)
Trace ID Generation: If no trace ID is found in headers, a new UUID is generated.
Header Propagation: The trace ID is propagated to backend services via multiple headers:
X-Request-ID: For broad compatibility
X-Trace-ID: Primary trace identifier
X-Correlation-ID: For correlation tracking
traceparent: W3C Trace Context (if enabled)
tracestate: W3C Trace State (if present in original request)
Retry Preservation: The same trace ID is preserved across all retry attempts, making it easy to correlate multiple backend requests for a single client request.

Structured Logging¶

When tracing is enabled, all log messages include the trace_id field:

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "info",
  "trace_id": "0af7651916cd43dd8448eb211c80319c",
  "message": "Processing chat completions request",
  "backend": "openai",
  "model": "gpt-4o"
}

W3C Trace Context¶

When w3c_trace_context is enabled, the router supports the W3C Trace Context standard:

Incoming: Parses traceparent header (format: 00-{trace_id}-{span_id}-{flags})
Outgoing: Generates new traceparent header with preserved trace ID and new span ID
State: Forwards tracestate header if present in original request

Example traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01

Disabling Tracing¶

To disable distributed tracing:

tracing:
  enabled: false

Load Balancing Strategies¶

load_balancer:
  strategy: "round_robin"         # round_robin, weighted, random
  health_aware: true              # Only use healthy backends

Strategies:

round_robin: Equal distribution across backends
weighted: Distribution based on backend weights
random: Random selection (good for avoiding patterns)

Per-Backend Retry Configuration¶

backends:
    - name: "slow-backend"
    url: "http://slow.example.com"
    retry_override:               # Override global retry settings
      max_attempts: 5             # More attempts for slower backends
      base_delay: "500ms"         # Longer delays
      max_delay: "60s"

Model Fallback¶

Continuum Router supports automatic model fallback when the primary model is unavailable. This feature integrates with the circuit breaker for layered failover protection.

Pre-Stream vs. Mid-Stream Fallback¶

The router provides two independent fallback mechanisms:

Mechanism	When it activates	Config section	Default
Pre-stream fallback	Before or at the start of a response: connection errors, timeouts, trigger error codes, unhealthy backend at routing time	`fallback`	Enabled when `fallback.enabled: true`
Mid-stream fallback	After streaming has started and the backend fails mid-response	`fallback` + `streaming.mid_stream_fallback`	Activates when `fallback.enabled: true` and a fallback chain is configured. Continuation mode is enabled by default.

When fallback.enabled: true and a fallback chain is configured for the requested model, mid-stream connection drops are suppressed and the router transparently switches to the next backend, even if streaming.mid_stream_fallback.enabled is false.

streaming.mid_stream_fallback.enabled controls continuation behavior only: whether the fallback backend receives a continuation prompt (using accumulated partial response) or a full restart of the original request. The default is true (continuation mode), which produces uninterrupted output for the client. Setting it to false forces restart mode, which may cause duplicate or incoherent content if partial output was already sent to the client.

Configuration¶

fallback:
  enabled: true

  # Define fallback chains for each primary model
  fallback_chains:
    # Same-provider fallback
    "gpt-4o":
      - "gpt-4-turbo"
      - "gpt-3.5-turbo"

    "claude-opus-4-5-20251101":
      - "claude-sonnet-4-5"
      - "claude-haiku-4-5"

    # Cross-provider fallback
    "gemini-2.5-pro":
      - "gemini-2.5-flash"
      - "gpt-4o"  # Falls back to OpenAI if Gemini unavailable

  fallback_policy:
    trigger_conditions:
      error_codes: [429, 500, 502, 503, 504]
      timeout: true
      connection_error: true
      model_not_found: true
      circuit_breaker_open: true

    max_fallback_attempts: 3
    fallback_timeout_multiplier: 1.5
    preserve_parameters: true

  model_settings:
    "gpt-4o":
      fallback_enabled: true
      notify_on_fallback: true

Trigger Conditions¶

Condition	Description
`error_codes`	HTTP status codes that trigger fallback (e.g., 429, 500, 502, 503, 504)
`timeout`	Request timeout
`connection_error`	TCP connection failures
`model_not_found`	Model not available on backend
`circuit_breaker_open`	Backend circuit breaker is open

Response Headers¶

When fallback is used, the following headers are added to the response:

Header	Description	Example
`X-Fallback-Used`	Indicates fallback was used	`true`
`X-Original-Model`	Originally requested model	`gpt-4o`
`X-Fallback-Model`	Model that served the request	`gpt-4-turbo`
`X-Fallback-Reason`	Why fallback was triggered	`error_code_429`
`X-Fallback-Attempts`	Number of fallback attempts	`2`

Cross-Provider Parameter Translation¶

When falling back across providers (e.g., OpenAI → Anthropic), the router automatically translates request parameters:

OpenAI Parameter	Anthropic Parameter	Notes
`max_tokens`	`max_tokens`	Auto-filled if missing (required by Anthropic)
`temperature`	`temperature`	Direct mapping
`top_p`	`top_p`	Direct mapping
`stop`	`stop_sequences`	Array conversion

Provider-specific parameters are automatically removed or converted during cross-provider fallback.

Integration with Circuit Breaker¶

The fallback system works in conjunction with the circuit breaker:

Circuit Breaker detects failures and opens when threshold is exceeded
Fallback chain activates when circuit breaker is open
Requests route to fallback models based on configured chains
Circuit breaker tests recovery and closes when backend recovers

# Example: Combined circuit breaker and fallback configuration
circuit_breaker:
  enabled: true
  failure_threshold: 5
  timeout: 60s

fallback:
  enabled: true
  fallback_policy:
    trigger_conditions:
      circuit_breaker_open: true  # Link to circuit breaker

Mid-Stream Fallback¶

Mid-stream fallback allows the router to transparently continue an active SSE stream on a fallback backend when the primary backend fails mid-response. The client's connection remains open and sees an uninterrupted response with only a brief pause during the switchover.

Mid-stream fallback activates automatically when fallback.enabled: true and a fallback chain is configured for the requested model. The streaming.mid_stream_fallback section controls how the fallback backend is invoked (continuation vs restart mode), not whether fallback happens.

Configuration¶

fallback:
  enabled: true  # Required: enables mid-stream fallback path
  fallback_chains:
    "gpt-4o":
      - "gpt-4-turbo"
      - "gpt-3.5-turbo"

streaming:
  mid_stream_fallback:
    # Enable continuation mode (default: true).
    # When true, accumulated partial response is used to build a continuation prompt,
    # producing uninterrupted output for the client.
    # When false, the fallback backend restarts the request from scratch, which may
    # cause duplicate or incoherent content if partial output was already sent.
    enabled: true

    # Minimum estimated tokens accumulated before using continuation mode (default: 50)
    # Below this threshold the request is restarted from scratch on the fallback backend
    # instead of appending a continuation prompt.
    min_accumulated_tokens: 50

    # Maximum fallback attempts per streaming request (default: 2, max: 10)
    max_fallback_attempts: 2

    # Prompt appended as a user message after the partial assistant response
    continuation_prompt: "Continue from where you left off exactly. Do not repeat any previously generated content."

How It Works¶

The client sends a streaming chat completion request.
The router begins streaming from the primary backend, accumulating response content.
If the backend fails mid-stream (connection drop, timeout, error event):
- The error is NOT forwarded to the client.
- The accumulated partial response is captured.
- The next healthy backend in the fallback chain is selected (unhealthy backends are skipped).
- A continuation or restart request is sent to the fallback backend.
- Streaming resumes on the fallback backend without closing the client connection.
The client receives an uninterrupted response with only a brief pause during the switchover.

Continuation vs. Restart Mode¶

The min_accumulated_tokens threshold controls which recovery mode is used:

Condition	Mode	Behavior
`enabled: true` (default) and tokens ≥ `min_accumulated_tokens` and not truncated	Continuation	Original messages + partial assistant response + continuation prompt
`enabled: true` (default) and tokens < `min_accumulated_tokens`	Restart	Original request replayed (not enough context to continue)
`enabled: true` (default) and content truncated (> 100 KB)	Restart	Forced restart to avoid incoherent context
`mid_stream_fallback.enabled: false`	Restart	Original request replayed on fallback backend from scratch

Continuation mode (the default) produces uninterrupted output for the client. Restart mode is used automatically when there is too little context to continue meaningfully, or when the accumulated response is too long to include safely. Explicitly setting enabled: false forces restart mode unconditionally, which may cause duplicate or incoherent content visible to the client.

Edge Case Handling¶

The mid-stream fallback path addresses several edge cases automatically:

Global timeout budget: All fallback attempts share the original request start time. Each attempt checks remaining budget before sending, preventing indefinite timeout accumulation across the chain.
Cross-provider parameter translation: When the fallback model is on a different provider (e.g., OpenAI → Anthropic), request parameters are automatically translated: provider-specific fields removed and parameter names mapped.
Concurrent request storms: A global semaphore (50 permits) limits simultaneous fallback attempts. Requests that cannot acquire a permit within 5 seconds are rejected gracefully.
Accumulator truncation: When accumulated response content exceeds 100 KB, the continuation mode is forced to restart to avoid sending incoherent context to the fallback backend.
Health re-check: Backend health is re-verified before each fallback attempt in the chain. Unhealthy backends are skipped to the next entry.
Missing [DONE] marker: Streams ending without [DONE] but with finish_reason: "stop" are treated as completed successfully, preventing unnecessary fallback.

Metrics¶

Three Prometheus metrics track mid-stream fallback activity. See Mid-Stream Fallback Metrics for details.

Minimizing Failover Latency¶

When a backend goes down during streaming, the time until the fallback backend takes over depends on several configuration parameters across different subsystems. Below is a tuning guide for minimizing this switchover delay.

How failover delay is composed¶

The total time a client waits during a mid-stream failover is roughly:

failover_delay ≈ failure_detection_time + health_recheck_time + fallback_connection_time

Each component maps to specific configuration:

Component	What determines it	Default	Tuning target
Failure detection	Stream inactivity timeout (hardcoded 60 s) or TCP read error (immediate) or `chunk_interval` timeout	30–60 s	Lower `chunk_interval`
Health re-check	Health check before fallback attempt	`timeout: 5s`	Keep low
Fallback connection	TCP connect + TLS handshake to fallback backend	`connection: 10s`	Lower `connection`

Recommended configuration for fast failover¶

# 1. Timeouts — the most impactful settings for failover speed
timeouts:
  connection: 5s               # Faster TCP connect timeout (default: 10s)
  request:
    streaming:
      first_byte: 30s          # How long to wait for the first token (default: 60s)
      chunk_interval: 10s      # Max silence between chunks before treating as failure (default: 30s)
      total: 600s              # Total streaming budget (keep generous)

# 2. Health checks — detect backend failures proactively
health_checks:
  interval: 10s                # Check every 10s instead of 30s (default: 30s)
  timeout: 3s                  # Fail health checks faster (default: 5s)
  unhealthy_threshold: 2       # Mark unhealthy after 2 failures (default: 3)
  healthy_threshold: 1         # Recover after 1 success (default: 2)
  warmup_check_interval: 1s   # Fast checks during backend startup

# 3. Circuit breaker — stop routing to a failed backend immediately
circuit_breaker:
  enabled: true
  failure_threshold: 3         # Open circuit after 3 failures (default: 5)
  timeout: 30s                 # Try recovery after 30s (default: 60s)
  half_open_max_requests: 2
  half_open_success_threshold: 1
  timeout_as_failure: true     # Count timeouts toward circuit breaker

# 4. Fallback chain — must be configured for mid-stream fallback to activate
fallback:
  enabled: true
  fallback_chains:
    "gpt-4o":
        - "gpt-4-turbo"
        - "gpt-3.5-turbo"
  fallback_policy:
    trigger_conditions:
      error_codes: [429, 500, 502, 503, 504]
      timeout: true
      connection_error: true
      circuit_breaker_open: true

# 5. Mid-stream fallback — continuation mode (default: enabled)
streaming:
  mid_stream_fallback:
    enabled: true              # Use continuation mode (default)
    max_fallback_attempts: 3   # Allow more retries for resilience (default: 2)
    min_accumulated_tokens: 30 # Lower threshold for continuation vs restart (default: 50)

Parameter impact summary¶

Parameter	Effect on failover speed	Trade-off
`timeouts.request.streaming.chunk_interval`	High — directly controls how quickly a stalled stream is detected	Too low may cause false positives on slow models (e.g., reasoning models with long thinking phases)
`timeouts.connection`	Medium — limits TCP connect delay to fallback backend	Too low may fail on high-latency networks
`health_checks.interval`	Medium — faster detection means the circuit breaker opens sooner, preventing requests from reaching a dead backend	More frequent checks increase backend load
`health_checks.unhealthy_threshold`	Medium — fewer failures needed to mark backend unhealthy	Lower values increase sensitivity to transient errors
`circuit_breaker.failure_threshold`	Medium — fewer failures to open circuit	Too aggressive may open circuit on temporary spikes
`circuit_breaker.timeout`	Low — affects recovery time, not failover speed	Shorter means faster recovery but more probing of unhealthy backends
`mid_stream_fallback.max_fallback_attempts`	Low — more attempts increase resilience but not speed of individual switchover	More attempts consume more of the global timeout budget

Failure detection scenarios¶

Different failure types are detected at different speeds:

Failure type	Detection time	Mechanism
TCP connection reset / backend crash	Immediate (< 1 s)	Stream read error triggers instant fallback
Backend returns 5xx error	Immediate (< 1 s)	HTTP status check before streaming begins
Backend becomes unresponsive (stall)	`chunk_interval` (default 30 s)	Inactivity timeout on the stream
Backend sends error SSE events	After 5 errors	Error count threshold in stream processing
Backend process killed mid-response	Immediate (< 1 s)	TCP FIN/RST detected as stream read error

The most common scenario in production, a backend becoming unresponsive, is governed by chunk_interval. For latency-sensitive applications, lowering this to 10–15 seconds is recommended, with model-specific overrides for slow models:

timeouts:
  request:
    streaming:
      chunk_interval: 10s      # Fast detection for most models
    model_overrides:
      gemini-2.5-pro:          # Reasoning models need longer intervals
        streaming:
          chunk_interval: 30s
          first_byte: 120s

Rate Limiting¶

Continuum Router includes built-in rate limiting for the /v1/models endpoint to prevent abuse and ensure fair resource allocation.

Current Configuration¶

Rate limiting is currently configured with the following default values:

# Note: These values are currently hardcoded but may become configurable in future versions
rate_limiting:
  models_endpoint:
    # Per-client limits (identified by API key or IP address)
    sustained_limit: 100          # Maximum requests per minute
    burst_limit: 20               # Maximum requests in any 5-second window

    # Time windows
    window_duration: 60s          # Sliding window for sustained limit
    burst_window: 5s              # Window for burst detection

    # Client identification priority
    identification:
      - api_key                   # Bearer token (first 16 chars used as ID)
      - x_forwarded_for           # Proxy/load balancer header
      - x_real_ip                 # Alternative IP header
      - fallback: "unknown"       # When no identifier available

How It Works¶

Client Identification: Each request is associated with a client using:
API key from Authorization: Bearer <token> header (preferred)
IP address from proxy headers (fallback)
Dual-Window Approach:
Sustained limit: Prevents excessive usage over time
Burst protection: Catches rapid-fire requests
Independent Quotas: Each client has separate rate limits:
Client A with API key abc123...: 100 req/min
Client B with API key def456...: 100 req/min
Client C from IP 192.168.1.1: 100 req/min

Response Headers¶

When rate limited, the response includes: - Status Code: 429 Too Many Requests - Error Message: Indicates whether burst or sustained limit was exceeded

Cache TTL Optimization¶

To prevent cache poisoning attacks: - Empty model lists: Cached for 5 seconds only - Normal responses: Cached for 60 seconds

This prevents attackers from forcing the router to cache empty responses during backend outages.

Monitoring¶

Rate limit violations are tracked in metrics: - rate_limit_violations: Total rejected requests - empty_responses_returned: Empty model lists served - Per-client violation tracking for identifying problematic clients

Future Enhancements¶

Future versions may support: - Configurable rate limits via YAML/environment variables - Per-endpoint rate limiting - Custom rate limits per API key - Redis-backed distributed rate limiting

Smart Routing¶

Smart routing classifies incoming requests by complexity and domain, then routes them to the most appropriate model tier using configurable policies. It combines a model tier registry (mapping models to tiers and domains) with a rule-based request classifier and a policy engine that maps classification results to routing decisions.

When model: "auto" is used in a chat completion request, the pipeline runs: classify the request, evaluate policies top-to-bottom, select a model from the matched tier. The same pipeline runs for all requests when intercept_all: true.

Configuration¶

smart_routing:
  enabled: true

  # Default tier when no profile matches and auto-inference is inconclusive.
  # 1 = Flagship, 2 = Standard, 3 = Lightweight. Defaults to 2.
  default_tier: 2

  # Model name that triggers smart routing. Defaults to "auto".
  virtual_model: "auto"

  # When true, all requests go through smart routing regardless of model name.
  intercept_all: false

  model_profiles:
    # Exact model name
    - model: "gpt-4o"
      tier: 1
      domains: [general, code, reasoning, creative]
      cost_per_1k_input_tokens: 0.005
      cost_per_1k_output_tokens: 0.015

    # Another exact match
    - model: "gpt-4o-mini"
      tier: 2
      domains: [general, code]
      cost_per_1k_input_tokens: 0.00015
      cost_per_1k_output_tokens: 0.0006

    # Glob pattern — matches all GGUF Q4_K_M quantized models
    - model_pattern: "*-q4_K_M"
      tier: 3
      domains: [general]

  # Routing policies: first match wins, top-to-bottom evaluation
  routing_policies:
    - name: "trivial_to_lightweight"
      when:
        complexity: [trivial, simple]
        domain: [general]
      route_to:
        tier: 3

    - name: "code_to_flagship"
      when:
        domain: [code]
        complexity: [moderate, complex, expert]
      route_to:
        tier: 1
        prefer_domains: [code]

    - name: "vision_required"
      when:
        requires: [vision]
      route_to:
        tier: 1
        require_capabilities: [vision]

    - name: "complex_to_flagship"
      when:
        complexity: [complex, expert]
      route_to:
        tier: 1

    - name: "default_to_standard"
      when: {}                      # Catch-all (always matches)
      route_to:
        tier: 2

Tier Classification¶

Tier	Value	Meaning	Typical Examples
Flagship	1	Highest capability, highest cost	gpt-4o, claude-3.5-sonnet, gemini-1.5-pro
Standard	2	Balanced capability and speed	gpt-4o-mini, claude-3-haiku
Lightweight	3	Optimized for speed and low cost	llama-3-8b, phi-3-mini, quantized variants

Domain Specialization Tags¶

Tag	Description
`general`	No specific specialty
`code`	Code generation, debugging, review
`reasoning`	Complex multi-step reasoning, math
`creative`	Creative writing, storytelling
`multilingual`	Translation and multilingual tasks
`vision`	Image understanding

Auto-Inference¶

When a model has no matching explicit profile or glob pattern, the router infers its tier automatically using three sources in priority order:

Pricing (from model-metadata.yaml): input cost >= $3/1k tokens maps to Flagship; >= $0.50/1k to Standard; below that to Lightweight. Zero-cost models skip pricing inference.
Capabilities (from model-metadata.yaml): models with 3+ high-value capabilities (vision, reasoning, audio, video, function_calling, tool) map to Flagship; 1+ such capability or 3+ total capabilities map to Standard.
Name heuristics: keywords like pro, ultra, opus, sonnet, turbo map to Flagship; keywords like mini, small, tiny, nano, lite, flash, haiku and quantization markers (q4_, q5_, q8_, gguf, gptq, awq) map to Lightweight.

Auto-inferred results are cached per model ID (up to 10,000 entries). The cache clears on hot-reload and when the /admin/smart-routing/model-profiles PUT endpoint is called.

Glob Pattern Syntax¶

Patterns use * as the only wildcard character. Multiple wildcards are supported.

Pattern	Matches	Does Not Match
`gpt-*`	`gpt-4o`, `gpt-4o-mini`	`claude-3`
`*-q4_K_M`	`llama-3-8b-q4_K_M`	`llama-3-8b-q5_K_M`
`gpt-*-turbo`	`gpt-4-turbo`, `gpt-3.5-turbo`	`gpt-4o`
`*`	everything	nothing

Request Classifier¶

The rule-based classifier analyzes each request using 11 signal types and produces a ClassificationResult containing complexity level, domain tag, required capabilities, and confidence score.

Complexity Levels¶

Level	Description	Example
`trivial`	Greetings, yes/no, single-fact lookup	"What is 2+2?"
`simple`	Short explanations, basic summaries	"Summarize this paragraph"
`moderate`	Multi-step reasoning, medium code tasks	"Refactor this function"
`complex`	Advanced algorithms, system design	"Design a distributed cache"
`expert`	Research-level problems, formal proofs	"Prove this theorem"

Classification Signals¶

Signal	What it detects
`message_length`	Total token count across all messages
`code_blocks`	Fenced code blocks or inline code
`math_notation`	LaTeX, equations, mathematical symbols
`system_prompt_complexity`	Length and complexity of the system prompt
`conversation_depth`	Number of turns in the conversation
`image_attachments`	Multimodal image content in messages
`tool_definitions`	Tool/function definitions in the request
`complexity_keywords`	Words like "optimize", "architect", "prove"
`language_detection`	Non-Latin scripts or multilingual content
`creative_markers`	Words like "story", "poem", "imagine"
`analysis_markers`	Words like "analyze", "compare", "evaluate"

Each detected signal contributes to the final complexity level and domain tag. Conflicting signals (e.g., both creative and code markers present) reduce the confidence score.

LLM-Based Classifier¶

The rule-based classifier is fast but sometimes ambiguous. When that is not accurate enough, the LLM-based classifier sends the request to a small, cheap model for classification. Three operating modes are supported via classifier.method:

Method	Behavior
`rule`	Rule-based only (default). No LLM calls.
`llm`	Always calls the LLM classifier; falls back to rule-based on failure.
`hybrid`	Rule-based first; escalates to LLM only when rule confidence is below `confidence_threshold`.

Hybrid mode is the recommended production setting: it adds latency only for genuinely ambiguous requests (typically 10-20% of traffic), while trivial and clear-cut requests are classified in microseconds.

smart_routing:
  enabled: true

  classifier:
    # Classification method: "rule" (default), "llm", or "hybrid".
    method: hybrid

    rule:
      # Confidence below this threshold triggers LLM escalation in hybrid mode.
      # Range: 0.0 – 1.0. Default: 0.7.
      confidence_threshold: 0.7

    llm:
      # Model used for classification. Any fast, cheap model works well.
      model: "gpt-4o-mini"

      # Backend name to route classification requests to. Must be a configured
      # backend. If omitted, the router uses the backend URL directly.
      backend: "openai-fast"

      # Maximum time allowed for a classification request (milliseconds).
      timeout_ms: 2000

      # Maximum input tokens sent to the classifier (content is truncated).
      max_input_tokens: 500

      # Number of retries after a parse failure (0 or 1). Default: 1.
      max_retries: 1

      # Temperature for classification. 0.0 gives deterministic output.
      temperature: 0.0

      # Maximum output tokens in the classification response.
      max_output_tokens: 150

      # Structured output strategy. "auto" selects the best method for the
      # configured backend: json_schema (OpenAI/vLLM), tool_use (Anthropic),
      # json_object (Ollama/Gemini/LM Studio), prompt_only (others).
      structured_output: auto   # auto | json_schema | tool_use | json_object | prompt_only

      # Include built-in few-shot examples in the system prompt.
      few_shot_examples: true

      # Custom few-shot examples appended after the built-in ones.
      custom_examples:
        - user: "What is the recommended dose of ibuprofen?"
          classification:
            complexity: simple
            domain: medical

      # Cache TTL for classification results (seconds). Default: 300.
      cache_ttl_seconds: 300

      # Maximum number of cached entries. Default: 10000.
      max_cache_entries: 10000

      # Disable the LLM classifier when load reaches this state.
      # "critical" (default) or "warning". Set to "" to never disable.
      disable_under_load_state: critical

    # Extend the built-in complexity taxonomy with custom levels.
    custom_complexity_levels:
      - name: specialized
        description: "Requires a domain-specific fine-tuned model"
        rank: 6   # Optional ordering hint (higher = harder)

    # Extend the built-in domain taxonomy with custom categories.
    custom_domains:
      - name: medical
        description: "Medical and clinical questions"

Structured Output Strategies¶

The LLM classifier needs structured JSON from the classifier model. The auto strategy picks the right mechanism based on the backend type, but you can override it explicitly:

Strategy	Mechanism	Supported backends
`json_schema`	`response_format: { type: "json_schema" }`	OpenAI, Azure, vLLM
`tool_use`	Tool/function calling	Anthropic, Gemini
`json_object`	`response_format: { type: "json_object" }`	OpenAI, Ollama, Gemini, LM Studio, llama.cpp
`prompt_only`	JSON extracted from free-form text via regex	Any backend

When the classifier response cannot be parsed, the router retries once with a correction prompt (controlled by max_retries). If the retry also fails, the result from the rule-based classifier is used instead.

Classification Cache¶

Classification results are cached in memory with a configurable TTL to avoid repeated LLM calls for the same request. The cache key is a SHA-256 hash of the truncated user message, so semantically identical requests share the same cached result. The cache is per-process and not shared across router instances.

Custom Taxonomy¶

Both complexity levels and domain tags are extensible. Custom values added via custom_complexity_levels and custom_domains appear in the classifier's system prompt and are accepted in the structured-output schema. Routing policies can reference custom values just like built-in ones:

routing_policies:
  - name: specialized_to_flagship
    when:
      complexity: [specialized]
    route_to:
      tier: 1

Bypass Header¶

The LLM classifier sends an X-Smart-Route-Bypass: true header with its classification requests. The router skips smart routing for any request carrying this header, preventing circular classification loops when the classifier backend is itself behind the same router instance.

Routing Policies¶

Policies are evaluated top-to-bottom; the first match wins. If no policy matches and no catch-all is defined, the request falls back to default_tier.

Policy Condition Logic¶

Fields within a when block are AND-ed: all specified fields must match.
Values within a single field are OR-ed: complexity: [trivial, simple] matches either.
when: {} is a catch-all that always matches.

Policy Fields¶

when conditions:

Field	Type	Description
`complexity`	`[string]`	Complexity levels that match (OR logic)
`domain`	`[string]`	Domain tags that match (OR logic)
`requires`	`[string]`	Capabilities that must all be present (AND logic)

route_to action:

Field	Type	Description
`tier`	`int`	Target tier (1=Flagship, 2=Standard, 3=Lightweight)
`prefer_domains`	`[string]`	Soft preference for domain-specialized models
`require_capabilities`	`[string]`	Hard filter: model must have these capabilities

Model Selection Within a Tier¶

When multiple models belong to the matched tier, the selector scores them using:

Domain preference match (soft bonus)
Capability match (hard filter if require_capabilities is set)
Cost scoring (lower cost scores higher within a tier)
Random tiebreak for equal-score models

If no models are available in the matched tier, the router tries adjacent tiers in order (e.g., if Lightweight is empty, tries Standard, then Flagship).

Load-Aware Dynamic Tier Adjustment¶

Under normal conditions, smart routing picks the tier that best matches each request. When traffic spikes or latency rises, the load monitor tracks real-time metrics and automatically shifts routing to lighter model tiers until the system recovers.

Load management is disabled by default. Enable it under smart_routing.load_management:

smart_routing:
  enabled: true
  load_management:
    enabled: true

    # How often load metrics are evaluated (milliseconds). Default: 1000.
    assessment_interval_ms: 1000

    # Thresholds for entering Warning and Critical states.
    # Any single threshold being exceeded triggers the state transition.
    thresholds:
      warning:
        requests_per_second: 100
        avg_latency_ms: 3000
        error_rate: 0.05
        in_flight_requests: 50
      critical:
        requests_per_second: 200
        avg_latency_ms: 5000
        error_rate: 0.15
        in_flight_requests: 100

    # Routing restrictions applied per load state.
    degradation:
      warning:
        max_tier: 2           # Cap routing at Standard tier
        prefer_quantized: false
        reject_expert: false
      critical:
        max_tier: 3           # Cap routing at Lightweight tier
        prefer_quantized: true
        reject_expert: false

    # Recovery behavior.
    recovery:
      cooldown_seconds: 30    # Minimum time before downgrading the load state
      hysteresis_factor: 0.8  # Metric must drop to 80% of threshold to recover

Load States¶

State	Meaning	Default routing restriction
`normal`	All metrics within bounds	No restriction
`warning`	At least one metric above Warning threshold	Capped at Standard tier (tier 2)
`critical`	At least one metric above Critical threshold	Capped at Lightweight tier (tier 3), prefer quantized

When a request would normally route to Flagship (tier 1) but the load state is Warning, it is silently downgraded to Standard. The routing decision log records the adjusted policy name as <original_policy>__load_warning or <original_policy>__load_critical.

Hysteresis and Cooldown¶

Rapid oscillation between load states can itself cause instability. Two mechanisms prevent it:

Hysteresis: to leave Warning state, a metric must drop below threshold * hysteresis_factor (default 0.8), not just below the threshold. A system that entered Warning at 100 RPS stays there until RPS drops below 80.
Cooldown: after any state transition, recovery to a lower state is blocked for cooldown_seconds (default 30). Escalation (Normal to Warning, Warning to Critical) always bypasses the cooldown.

Per-Tier Threshold Overrides¶

If different tiers have different capacity characteristics, you can override thresholds per tier:

smart_routing:
  load_management:
    enabled: true
    thresholds:
      warning:
        requests_per_second: 100
    tier_thresholds:
      "1":                      # Tier 1 (Flagship) has a lower RPS tolerance
        warning:
          requests_per_second: 50
      "3":                      # Tier 3 (Lightweight) can handle more
        warning:
          requests_per_second: 300

Prometheus Metrics¶

When the metrics feature is enabled, the following counters and gauges are exported for load management:

Metric	Type	Description
`smart_routing_load_state`	Gauge	Current load state: 0=Normal, 1=Warning, 2=Critical
`smart_routing_tier_degradation_total`	Counter	Number of times routing was degraded due to load
`smart_routing_load_transitions_total`	Counter	Number of load state transitions, labeled by `from` and `to`

The LLM classifier exports six additional metrics:

Metric	Type	Description
`smart_routing_llm_classifier_calls_total`	Counter	Total LLM classifier invocations
`smart_routing_llm_classifier_cache_hits_total`	Counter	Classification results served from cache
`smart_routing_llm_classifier_duration_seconds`	Histogram	End-to-end LLM classification latency
`smart_routing_llm_classifier_fallbacks_total`	Counter	Times the LLM classifier fell back to rule-based
`smart_routing_llm_classifier_parse_errors_total`	Counter	Response parse failures before retry
`smart_routing_llm_classifier_retries_total`	Counter	Retry attempts after initial parse failure

Debug Response Headers¶

Set debug_headers: true to include smart routing decision details in HTTP response headers. This is intended for development and staging environments.

smart_routing:
  enabled: true
  debug_headers: true  # Enable in development/staging

When enabled, smart-routed responses include:

Header	Description
`X-Smart-Route-Source`	Original model requested (e.g., `auto`)
`X-Smart-Route-Target`	Selected model (e.g., `gpt-4o-mini`)
`X-Smart-Route-Complexity`	Classified complexity level
`X-Smart-Route-Domain`	Classified domain
`X-Smart-Route-Policy`	Policy that matched
`X-Smart-Route-Load-State`	Load state at routing time
`X-Smart-Route-Classifier`	Classifier used (`rule_based` or `llm_based`)

Admin API¶

Smart routing exposes several admin endpoints under /admin/smart-routing/ for observability and management. The full endpoint reference is in the Admin API documentation.

Key endpoints:

GET /status -- overall status, load state, policy count
POST /classify -- classify a request without routing (diagnostic)
POST /simulate -- simulate the full routing pipeline
GET /policies and PUT /policies -- view and hot-reload policies
GET /load-state -- current load state with assessment details
GET /cache/stats and POST /cache/clear -- LLM classifier cache management

Structured Logging¶

All smart routing decisions are logged at DEBUG level with structured fields:

level=DEBUG msg="Smart routing decision"
  source_model="auto"
  target_model="gpt-4o-mini"
  complexity="simple"
  domain="general"
  policy="trivial_to_lightweight"
  load_state="normal"
  classifier="rule_based"
  confidence=0.92
  classification_ms=0.3

Load state transitions and policy changes are logged at INFO level.

Hot Reload¶

The smart_routing section reloads immediately when the config file changes. After a reload, the inferred-profile cache is cleared so all models are re-evaluated on the next request. Routing policies in routing_policies and load_management settings also take effect immediately without restarting the server. Policies can also be updated at runtime via the PUT /admin/smart-routing/policies endpoint.

Environment-Specific Configurations¶

Development Configuration¶

# config/development.yaml
server:
  bind_address: "127.0.0.1:8080"

backends:
    - name: "local-ollama"
    url: "http://localhost:11434"

health_checks:
  interval: "10s"                 # More frequent checks
  timeout: "5s"

logging:
  level: "debug"                  # Verbose logging
  format: "pretty"                # Human-readable
  enable_colors: true

Production Configuration¶

# config/production.yaml
server:
  bind_address: "0.0.0.0:8080"
  workers: 8                      # More workers for production
  connection_pool_size: 300       # Larger connection pool

backends:
    - name: "primary-openai"
    url: "https://api.openai.com"
    weight: 3
    - name: "secondary-azure"
    url: "https://azure-openai.example.com"
    weight: 2
    - name: "fallback-local"
    url: "http://internal-llm:11434"
    weight: 1

health_checks:
  interval: "60s"                 # Less frequent checks
  timeout: "15s"                  # Longer timeout for network latency
  unhealthy_threshold: 5          # More tolerance
  healthy_threshold: 3

request:
  timeout: "120s"                 # Shorter timeout for production
  max_retries: 5                  # More retries

logging:
  level: "warn"                   # Less verbose logging
  format: "json"                  # Structured logging

Container Configuration¶

# config/container.yaml - optimized for containers
server:
  bind_address: "0.0.0.0:8080"
  workers: 0                      # Auto-detect based on container limits

backends:
    - name: "backend-1"
    url: "${BACKEND_1_URL}"       # Environment variable substitution
    - name: "backend-2"
    url: "${BACKEND_2_URL}"

logging:
  level: "${LOG_LEVEL}"           # Configurable via environment
  format: "json"                  # Always JSON in containers