Skip to content

Advanced Configuration

Global Prompts

Global prompts allow you to inject system prompts into all requests, providing centralized policy management for security, compliance, and behavioral guidelines. Prompts can be defined inline or loaded from external Markdown files.

Basic Configuration

global_prompts:
  # Inline default prompt
  default: |
    You must follow company security policies.
    Never reveal internal system details.
    Be helpful and professional.

  # Merge strategy: prepend (default), append, or replace
  merge_strategy: prepend

  # Custom separator between global and user prompts
  separator: "\n\n---\n\n"

External Prompt Files

For complex prompts, you can load content from external Markdown files. This provides: - Better editing experience with syntax highlighting - Version control without config file noise - Hot-reload support for prompt updates

global_prompts:
  # Directory containing prompt files (relative to config directory)
  prompts_dir: "./prompts"

  # Load default prompt from file
  default_file: "system.md"

  # Backend-specific prompts from files
  backends:
    anthropic:
      prompt_file: "anthropic-system.md"
    openai:
      prompt_file: "openai-system.md"

  # Model-specific prompts from files
  models:
    gpt-4o:
      prompt_file: "gpt4o-system.md"
    claude-3-opus:
      prompt_file: "claude-opus-system.md"

  merge_strategy: prepend

Prompt Resolution Priority

When determining which prompt to use for a request:

  1. Model-specific prompt (highest priority) - global_prompts.models.<model-id>
  2. Backend-specific prompt - global_prompts.backends.<backend-name>
  3. Default prompt - global_prompts.default or global_prompts.default_file

For each level, if both prompt (inline) and prompt_file are specified, prompt_file takes precedence.

Merge Strategies

Strategy Behavior
prepend Global prompt added before user's system prompt (default)
append Global prompt added after user's system prompt
replace Global prompt replaces user's system prompt entirely

REST API Management

Prompt files can be managed at runtime via the Admin API:

# List all prompts
curl http://localhost:8080/admin/config/prompts

# Get specific prompt file
curl http://localhost:8080/admin/config/prompts/prompts/system.md

# Update prompt file
curl -X PUT http://localhost:8080/admin/config/prompts/prompts/system.md \
  -H "Content-Type: application/json" \
  -d '{"content": "# Updated System Prompt\n\nNew content here."}'

# Reload all prompt files from disk
curl -X POST http://localhost:8080/admin/config/prompts/reload

See Admin REST API Reference for complete API documentation.

Security Considerations

  • Path Traversal Protection: All file paths are validated to prevent directory traversal attacks
  • File Size Limits: Individual files limited to 1MB, total cache limited to 50MB
  • Relative Paths Only: Prompt files must be within the configured prompts_dir or config directory
  • Sandboxed Access: Files outside the allowed directory are rejected

Hot Reload

Global prompts support immediate hot-reload. Changes to prompt configuration or files take effect on the next request without server restart.

Model Metadata

Continuum Router supports rich model metadata to provide detailed information about model capabilities, pricing, and limits. This metadata is returned in /v1/models API responses and can be used by clients to make informed model selection decisions.

Metadata Sources

Model metadata can be configured in three ways (in priority order):

  1. Backend-specific model_configs (highest priority)
  2. External metadata file (model-metadata.yaml)
  3. No metadata (models work without metadata)

External Metadata File

Create a model-metadata.yaml file:

models:
    - id: "gpt-4"
    aliases:                    # Alternative IDs that share this metadata
      - "gpt-4-0125-preview"
      - "gpt-4-turbo-preview"
      - "gpt-4-vision-preview"
    metadata:
      display_name: "GPT-4"
      summary: "Most capable GPT-4 model for complex tasks"
      capabilities: ["text", "image", "function_calling"]
      knowledge_cutoff: "2024-04"
      pricing:
        input_tokens: 0.03   # Per 1000 tokens
        output_tokens: 0.06  # Per 1000 tokens
      limits:
        context_window: 128000
        max_output: 4096

    - id: "llama-3-70b"
    aliases:                    # Different quantizations of the same model
      - "llama-3-70b-instruct"
      - "llama-3-70b-chat"
      - "llama-3-70b-q4"
      - "llama-3-70b-q8"
    metadata:
      display_name: "Llama 3 70B"
      summary: "Open-source model with strong performance"
      capabilities: ["text", "code"]
      knowledge_cutoff: "2023-12"
      pricing:
        input_tokens: 0.001
        output_tokens: 0.002
      limits:
        context_window: 8192
        max_output: 2048

Reference it in your config:

model_metadata_file: "model-metadata.yaml"

Thinking Pattern Configuration

Some models output reasoning/thinking content in non-standard ways. The router supports configuring thinking patterns per model to properly transform streaming responses.

Pattern Types:

Pattern Description Example Model
none No thinking pattern (default) Most models
standard Explicit start/end tags (<think>...</think>) Custom reasoning models
unterminated_start No start tag, only end tag nemotron-3-nano

Configuration Example:

models:
    - id: nemotron-3-nano
      metadata:
        display_name: "Nemotron 3 Nano"
        capabilities: ["chat", "reasoning"]
        # Thinking pattern configuration
        thinking:
          pattern: unterminated_start
          end_marker: "</think>"
          assume_reasoning_first: true

Thinking Pattern Fields:

Field Type Description
pattern string Pattern type: none, standard, or unterminated_start
start_marker string Start marker for standard pattern (e.g., <think>)
end_marker string End marker (e.g., </think>)
assume_reasoning_first boolean If true, treat first tokens as reasoning until end marker

How It Works:

When a model has a thinking pattern configured:

  1. Streaming responses are intercepted and transformed
  2. Content before end_marker is sent as reasoning_content field
  3. Content after end_marker is sent as content field
  4. The output follows OpenAI's reasoning_content format for compatibility

Example Output:

// Reasoning content (before end marker)
{"choices": [{"delta": {"reasoning_content": "Let me analyze..."}}]}

// Regular content (after end marker)
{"choices": [{"delta": {"content": "The answer is 42."}}]}

Responses-API-only Models

OpenAI exposes some models exclusively via the Responses API (/v1/responses). These models are not reachable through /v1/chat/completions, so a request that targets them on a Chat Completions endpoint returns a 404 not_found from upstream.

The responses_only capability flag marks such models so the router can dispatch them to the Responses API surface instead. The flag defaults to false, so existing model entries do not need to be touched.

Configuration Example:

models:
    - id: gpt-5.4-pro
      metadata:
        display_name: "GPT-5.4 Pro"
        capabilities: ["chat", "vision", "code", "reasoning", "tool"]
        # Served only on /v1/responses; not available on /v1/chat/completions.
        responses_only: true
        limits:
          context_window: 1050000
          max_output: 128000

Models marked Responses-API-only out of the box

The list below is kept in sync with model-metadata.yaml and the built-in OpenAI registry (src/infrastructure/backends/openai/models/gpt5_family.rs). When a new Responses-API-only model is added upstream, both files should be updated together.

Model ID Source Notes
gpt-5.2-pro Built-in OpenAI metadata + model-metadata.yaml Smartest model for difficult questions; xhigh reasoning effort
gpt-5.4-pro model-metadata.yaml Frontier-class deep reasoning; supports medium, high, xhigh
gpt-5.5-pro model-metadata.yaml High-capability variant of GPT-5.5 for high-stakes workloads

The flag follows the same lookup priority chain as the rest of the metadata (backend model_configs > model-metadata.yaml > built-in OpenAI metadata), so an operator-supplied entry can override the default for any model.

Marking a new model as Responses-API-only

To mark an additional model as Responses-API-only, add responses_only: true to the model entry's metadata block in any of the supported sources. Use the lookup priority that fits the deployment scope:

  • model-metadata.yaml for a router-wide default that applies to every backend. Add the flag alongside the existing capability metadata; no other field needs to change. This is the recommended location for newly-released Pro models that are uniformly Responses-API-only across providers.
  • Backend model_configs in config.yaml for a backend-specific override (for example, when a self-hosted clone of a Pro model is exposed on a Chat Completions endpoint and should not be dispatched to /v1/responses). A backend-level responses_only: false overrides the metadata-file default for that backend only.
  • Built-in OpenAI registry in src/infrastructure/backends/openai/models/gpt5_family.rs for models that ship with the binary. New entries here should also be reflected in model-metadata.yaml so externally-loaded metadata stays consistent.

After updating any of these sources, restart the router or trigger a hot reload so the new flag takes effect on subsequent requests.

Dispatch behavior

The router honors responses_only=true on every public surface that would otherwise hit /v1/chat/completions:

  • /v1/chat/completions: requests transparently forward to the upstream /v1/responses endpoint and the response is translated back into a strict-mode chat.completion (or chat.completion.chunk for streaming) envelope.
  • /anthropic/v1/messages: the Anthropic-formatted request is converted to the Responses API shape, dispatched to /v1/responses, and the upstream response is translated back into Anthropic Messages JSON (or the Anthropic SSE event sequence for streaming). Tool-call round-trips, web-search emulation, and Unix-socket transports all branch on the flag.

In both cases the dispatch is transparent to the client: the request and response shapes match the surface the client called, so no client-side changes are required to use a responses_only model.

Backend-type constraint

Only OpenAI and Azure OpenAI backends serve /v1/responses. When a responses_only model is paired with a backend whose type is not OpenAI or Azure OpenAI, the router rejects the request with a 400 invalid_request_error (Anthropic-shaped on /anthropic/v1/messages, OpenAI-shaped on /v1/chat/completions) before any upstream dispatch. The message names both the model and the configured backend type so the misconfiguration is visible from the client log.

The first dispatch per (backend, model) pair logs at info level so operators can confirm Responses-API routing without enabling debug logs.

Namespace-Aware Matching

The router handles model IDs with namespace prefixes. For example:

  • Backend returns: "custom/gpt-4", "openai/gpt-4", "optimized/gpt-4"
  • Metadata defined for: "gpt-4"
  • Result: All variants match and receive the same metadata

This allows different backends to use their own naming conventions while sharing common metadata definitions.

Metadata Priority and Alias Resolution

When looking up metadata for a model, the router uses the following priority chain:

  1. Exact model ID match
  2. Exact alias match
  3. Date suffix normalization (automatic, zero-config)
  4. Quantization / format suffix normalization (automatic, zero-config; see below)
  5. Combined date + format suffix normalization
  6. Wildcard pattern alias match
  7. Base model name fallback (namespace stripping)

Within each source (backend config, metadata file, built-in), the same priority applies:

  1. Backend-specific model_configs (highest priority)

    backends:
      - name: "my-backend"
        model_configs:
          - id: "gpt-4"
            aliases: ["gpt-4-turbo", "gpt-4-vision"]
            metadata: {...}  # This takes precedence
    

  2. External metadata file (second priority)

    model_metadata_file: "model-metadata.yaml"
    

  3. Built-in metadata (for OpenAI and Gemini backends)

Automatic Date Suffix Handling

LLM providers frequently release model versions with date suffixes. The router automatically detects and normalizes date suffixes without any configuration:

Supported date patterns:

  • -YYYYMMDD (e.g., claude-opus-4-5-20251130)
  • -YYYY-MM-DD (e.g., gpt-4o-2024-08-06)
  • -YYMM (e.g., o1-mini-2409)
  • @YYYYMMDD (e.g., model@20251130)

How it works:

Request: claude-opus-4-5-20251215
         ↓ (date suffix detected)
Lookup:  claude-opus-4-5-20251101  (existing metadata entry)
         ↓ (base names match)
Result:  Uses claude-opus-4-5-20251101 metadata

This means you only need to configure metadata once per model family, and new dated versions automatically inherit the metadata.

Automatic Quantization and Format Suffix Handling

Real-world model IDs arriving at /v1/models, routing logic, and backend metadata enrichment frequently combine a canonical base ID with one or more trailing quantization, format, or flavor tokens. The router strips an allowlisted set of such tokens iteratively and retries exact-id, exact-alias, and date-suffix matching after each peel, so you only need to configure metadata for the canonical base ID.

Token Categories

The following trailing tokens are detected and stripped (case-insensitive):

Category Examples
Bit-width -2bit, -3bit, -4bit, -5bit, -6bit, -8bit, -16bit
GGUF / llama.cpp quants -Q4_K_M, -Q4_K_S, -Q5_K_M, -Q6_K, -Q8_0, -Q2_K, -IQ2_XS, -IQ3_XXS, -IQ4_XS, -F16, -F32, -BF16
FP formats -FP4, -FP8, -FP16, -FP32, -NVFP4, -MXFP4
INT formats -INT2, -INT4, -INT8
Library tags -AWQ, -GPTQ, -BNB, -HQQ, -EXL2, -EXL3, -MLX
Imatrix / abbreviated -i1 through -i8, -q2 through -q8
Unsloth dynamic -UD-Q*, -UD-IQ*
Container formats -GGUF, -GGML, -SAFETENSORS
Flavors -it, -instruct, -chat, -base, -thinking, -qat

Parameter-Count Suffixes are Preserved

Tokens that look like parameter counts are never stripped, even when they share a trailing b:

  • Kept: -32b, -70b, -8b, -4b, -a3b, -a22b, -0.6b, -1.7b, -e4b
  • Stripped: -4bit, -8bit, -16bit (the literal bit suffix marks quantization)

This discrimination ensures that a parameter-count variant like qwen3-32b resolves only to explicit qwen3-32b metadata, never to a generic qwen3 entry via accidental stripping.

Layered Peeling

Tokens are stripped one at a time. After each peel, the router re-runs exact-id, exact-alias, and date-suffix match before attempting another peel. This lets alias configurations like gemma-3-4b-it-qat still win even when the request is gemma-3-4b-it-qat-4bit:

Request: gemma-3-4b-it-qat-4bit
         ↓ (peel -4bit)
Try:     gemma-3-4b-it-qat
         ↓ (matches alias of gemma-3-4b-qat)
Result:  Uses gemma-3-4b-qat metadata

Priority Note

Stripping runs after exact-id and exact-alias match. A canonical base ID that happens to end in an allowlisted token (for example gemma-3-12b-qat) wins before the peel phase runs, so existing configurations remain stable.

Suffix-Order Ambiguity

Both -qat-4bit and -4bit-qat orderings appear in real-world model IDs. Peeling removes one token at a time from the right, so the intermediate form mirrors the order in which tokens appeared in the input. The match sequence for gemma-3-12b-qat-4bit is gemma-3-12b-qat-4bitgemma-3-12b-qatgemma-3-12b, while gemma-3-12b-4bit-qat goes gemma-3-12b-4bit-qatgemma-3-12b-4bitgemma-3-12b. If both suffix orderings need to resolve to the same QAT-variant metadata, configure the canonical QAT base ID (gemma-3-12b-qat) with the matching metadata and let the non-QAT form (gemma-3-12b) carry its own entry; the deepest successful match wins at each peel depth. When the QAT and non-QAT variants need distinct tier or capability metadata, prefer aliases that enumerate the reorderings over relying on the peel order alone.

Length Bounds

The layered peel phase caps input length at 256 characters and iteration count at 8 peels as defense-in-depth against pathological inputs. Matching still runs (the exact-id and exact-alias phases remain in effect), but the peel phase short-circuits instead of walking a long allowlist-token chain. Request handlers enforce the same 256-character cap on the model field for every chat / completion / embedding endpoint, so normal traffic never hits the internal cap.

Case Insensitivity

Stripping is case-insensitive, so Qwen3.5-4B-4bit, QWEN3.5-4B-4BIT, and qwen3.5-4b-4bit all resolve to the same qwen3.5-4b metadata entry. Exact-id and exact-alias match phases (1 and 2) remain case-sensitive, so HuggingFace-style aliases like BAAI/bge-m3 keep their original behavior.

Wildcard Pattern Matching

Aliases support glob-style wildcard patterns using the * character:

  • Prefix matching: claude-* matches claude-opus, claude-sonnet, etc.
  • Suffix matching: *-preview matches gpt-4o-preview, o1-preview, etc.
  • Infix matching: gpt-*-turbo matches gpt-4-turbo, gpt-3.5-turbo, etc.

Example configuration with wildcard patterns:

models:
    - id: "claude-opus-4-5-20251101"
    aliases:
        - "claude-opus-4-5"     # Exact match for base name
        - "claude-opus-*"       # Wildcard for any claude-opus variant
    metadata:
        display_name: "Claude Opus 4.5"
        # Automatically matches: claude-opus-4-5-20251130, claude-opus-test, etc.

    - id: "gpt-4o"
    aliases:
        - "gpt-4o-*-preview"    # Matches preview versions
        - "*-4o-turbo"          # Suffix matching
    metadata:
        display_name: "GPT-4o"

Priority note: Exact aliases are always matched before wildcard patterns. When both could match, the exact alias wins.

Using Aliases for Model Variants

Aliases are particularly useful for:

  • Different quantizations: qwen3-32b-i1, qwen3-23b-i4 → all use qwen3 metadata
  • Version variations: gpt-4-0125-preview, gpt-4-turbo → share gpt-4 metadata
  • Deployment variations: llama-3-70b-instruct, llama-3-70b-chat → same base model
  • Dated versions: claude-3-5-sonnet-20241022, claude-3-5-sonnet-20241201 → share metadata (automatic with date suffix handling)

Example configuration with aliases:

model_configs:
    - id: "qwen3"
    aliases:
      - "qwen3-32b-i1"     # 32B with 1-bit quantization
      - "qwen3-23b-i4"     # 23B with 4-bit quantization
      - "qwen3-16b-q8"     # 16B with 8-bit quantization
      - "qwen3-*"          # Wildcard for any other qwen3 variant
    metadata:
      display_name: "Qwen 3"
      summary: "Alibaba's Qwen model family"
      # ... rest of metadata

Aliases vs. suffix normalization: when to use which

Two coverage layers resolve non-canonical model ids to their owning metadata entry: explicit YAML aliases, and the layered suffix-peel allowlist in src/models/pattern_matching.rs. They are complementary, not redundant. This section explains how to choose between them when adding a new entry.

Matching phase order

The pipeline runs in this order, and a successful match in an earlier phase short-circuits the later ones:

  1. Exact model id (case-sensitive).
  2. Exact alias (case-sensitive).
  3. Date-suffix normalization (-YYYYMMDD, -YYYY-MM-DD, -YYMM, @YYYYMMDD).
  4. Layered quantization / format / flavor peel (case-insensitive; after each peel, exact-id + exact-alias + date-suffix phases re-run; combined date + format handled in the same loop).
  5. HuggingFace repo-prefix stripping (vendor/repo -> repo) with re-entry into phases 1-4 on the stripped residual. Single-hop re-entry; phase 5 does not recurse.
  6. Wildcard alias (glob-style * patterns).

A retained explicit alias runs in phase 2, strictly before the peel (phase 4) and before the prefix-strip layer (phase 5). When a retained alias and a peel-or-strip path would resolve to different metadata, the alias wins deterministically. Aliases are therefore a stronger intent signal than peel-or-strip coverage, not a weaker one.

The three alias classes

Every alias in model-metadata.yaml falls into one of three classes.

peel-coverable

Normalization reaches the same owner id without the alias, and the target metadata is the correct one. These are deletion candidates. Example: qwen3.6-35b-a3b-instruct as an alias of qwen3.6-35b-a3b. Phase 4 peels the FLAVOR token -instruct and lands on the base id directly, so the explicit alias adds no coverage. The 64 aliases removed in issue #557 were all in this class, and each one has a regression assert in tests/format_suffix_normalization_test.rs::real_metadata_removed_aliases_still_resolve.

vendor-prefix

The alias carries a vendor or repo prefix that suffix peel cannot strip, because peel only removes right-side tokens from a closed allowlist. Historically (pre-#555) such aliases were strictly load-bearing because the old namespace-fallback phase was case-sensitive; after #555 introduced phase 5 (HuggingFace prefix stripping with re-entry into phases 1-4, where phase 4 is case-insensitive), the mixed-case HF form resolves without the alias. Example: Qwen/Qwen3.6-35B-A3B as an alias of qwen3.6-35b-a3b. Phase 2 still wins on the explicit alias today, but phase 5 would also reach the base id via Qwen/ -> residual Qwen3.6-35B-A3B -> phase 4 case-insensitive match. These aliases are now peel-coverable-adjacent. Retroactive removal is deferred to a follow-up audit; keep for now, with a YAML comment noting the covering phase.

intentional-override

The alias deliberately routes a differently-weighted model under another entry's metadata, as an operator decision. Keep. Example: smoothie-qwen3-32b-i1 as an alias of smoothie-qwen3. The smoothie-qwen3-32b-i1 fine-tune has its own weights; the operator has chosen to surface it under the umbrella smoothie-qwen3 metadata rather than give it a dedicated entry. Peel must not infer this equivalence on its own. When an alias sits in this class, the YAML comment on the line must note that the underlying weights differ from the owner id, so a future reader or auditor can tell an intentional override apart from a mechanical normalization gap.

Guidance for adding a new alias

Before adding a line to model-metadata.yaml, ask whether peel already covers it.

  • If the new id is a canonical base with a trailing quantization, format, or flavor token already in the allowlist, and the weights are equivalent to the base metadata, do not add the alias. The peel handles it, and adding the alias would be dead code.
  • If the new id shares weights with the base but ends in a token class that peel does not yet handle (for example, a novel fine-tune label like -abliterated or a new quantization format like -nf4), prefer extending the peel allowlist in src/models/pattern_matching.rs. This is a code change with test coverage, and it lifts an entire class of future variants in one move.
  • If the new id has a vendor prefix, a repo namespace that normalization would not case-match, a parameter-count token blocking the peel chain (-Nb, -aNb, -eNb), or intentionally-different weights, add the alias with a YAML comment that states the reason. If weights differ from the owner id, say so in the comment.

Surface distinction: code-gated vs. YAML-gated

Change site Gate Release cadence Use for
Peel allowlist in src/models/pattern_matching.rs Code review + Rust release Ships with the next router release Strategic normalization that covers a whole token class across all models.
Aliases in model-metadata.yaml YAML review + hot reload Same-day reload via admin API Individual overrides, vendor-prefix fixes, weight-differing overrides, and emergency coverage for novel tokens before they earn a peel allowlist entry.

The peel allowlist is the strategic layer. Aliases are the tactical override and emergency channel.

Token categories already on the peel allowlist

The allowlist in src/models/pattern_matching.rs currently covers:

  • BIT_WIDTH: -2bit, -3bit, -4bit, -5bit, -6bit, -8bit, -16bit
  • GGUF_QUANT: -Q4_K_M, -Q4_K_S, -Q5_K_M, -Q6_K, -Q8_0, -Q2_K, -IQ2_XS, -IQ3_XXS, -IQ4_XS, -F16, -F32, -BF16
  • FP_FORMAT: -FP4, -FP8, -FP16, -FP32, -NVFP4, -MXFP4
  • INT_FORMAT: -INT2, -INT4, -INT8
  • LIBRARY: -AWQ, -GPTQ, -BNB, -HQQ, -EXL2, -EXL3, -MLX
  • IMATRIX: -i1 through -i8, -q2 through -q8
  • UNSLOTH: -UD-Q<digit>_<KIND>, -UD-IQ<digit>_<KIND>
  • CONTAINER: -GGUF, -GGML, -SAFETENSORS
  • FLAVOR: -it, -instruct, -chat, -base, -thinking, -qat

Parameter-count suffixes (-Nb, -aNb, -eNb, -0.6b, -1.7b) are never peeled. They are part of canonical model identity and terminate the peel chain. This is why qwen3-32b-i1 must be kept as an explicit alias of qwen3: phase 4 strips -i1 and then halts at -32b, so without the alias the chain exhausts before reaching the base id.

HuggingFace repo-prefix stripping (phase 5)

Phase 5 normalizes HuggingFace-style vendor/repo prefixes off the left side of a model id, complementing the right-side suffix peel. It was added in issue #555 to resolve the common HF-GGUF class where a user submits an id like unsloth/Qwen3.6-35B-A3B-GGUF and expects it to route to the canonical qwen3.6-35b-a3b metadata without an explicit alias for every vendor x base x quant combination.

How phase 5 runs
  1. The input is inspected for a / separator. No /, no-op.
  2. Total segments (count of / plus one) must be at most MAX_PREFIX_SEGMENTS (3). org/team/repo is permitted; a/b/c/d/model is rejected outright.
  3. All segments must be non-empty and free of ASCII whitespace. Malformed inputs like /repo, vendor/, vendor//repo, or vendor /repo are rejected.
  4. On success, the residual is the substring after the last /. This residual is fed back into phases 1-4 with the re-entry gate closed. Phase 5 does not recurse: the inner call cannot trigger phase 5 again, so the recursion depth is exactly 1 by construction.
Composition with suffix peel

The re-entry runs through phase 4, so prefix stripping composes with suffix peel in a single lookup. unsloth/Qwen3.6-35B-A3B-GGUF strips to Qwen3.6-35B-A3B-GGUF, phase 4 peels -GGUF, case-insensitively matches qwen3.6-35b-a3b. This is the motivating case for the phase and covers HuggingFace GGUF forks without requiring hand-enumerated aliases.

Registered-alias precedence

Operators who explicitly register a vendor/repo form as a YAML alias keep deterministic control. Phase 2 runs before phase 5, so the exact alias wins before the stripping layer ever considers the input. Use this when the prefixed form must route to a different metadata entry than the canonical base id would.

Out of scope
  • Hyphen-delimited vendor prefixes (e.g., smoothie-qwen/smoothie-qwen3-32b-i1). Different semantic class, different detection difficulty, and often represents different weights where silent base-metadata routing is the wrong call. A future issue may revisit if demand materializes.
  • Automatic vendor discovery from the HuggingFace API. The layer is purely syntactic.
  • Extending the suffix peel allowlist. Orthogonal change; follow the peel-extension path for novel token classes.
Security bounds

Parallels the suffix peel:

Bound Value Effect
MAX_PREFIX_SEGMENTS 3 Inputs with more segments are rejected before any scan.
MAX_MODEL_ID_LEN 256 Oversized inputs skip phase 5 just like phase 4.
Re-entry depth 1 Structurally enforced via a recursion gate, not a counter.

Phase 5 is constant-time on adversarial input: after the segment-count, emptiness, whitespace, and length guards, the work reduces to a single slice lookup plus one additional pass through phases 1-4.

Audit procedure

To re-audit the YAML, run:

cargo test --test alias_audit_helper -- --ignored --nocapture audit_metadata_aliases

The helper prints every alias with its classification (REDUNDANT, LOAD-BEARING-DRIFT, LOAD-BEARING-LOSS, or WILDCARD) and the post-removal resolution target. The current snapshot is captured in docs/reports/alias-audit-2026-04.md.

API Response

The /v1/models endpoint returns enriched model information:

{
  "object": "list",
  "data": [
    {
      "id": "gpt-4",
      "object": "model",
      "created": 1234567890,
      "owned_by": "openai",
      "backends": ["openai-proxy"],
      "metadata": {
        "display_name": "GPT-4",
        "summary": "Most capable GPT-4 model for complex tasks",
        "capabilities": ["text", "image", "function_calling"],
        "knowledge_cutoff": "2024-04",
        "pricing": {
          "input_tokens": 0.03,
          "output_tokens": 0.06
        },
        "limits": {
          "context_window": 128000,
          "max_output": 4096
        }
      }
    }
  ]
}

Hot Reload

Continuum Router supports hot reload for runtime configuration updates without server restart. Configuration changes are detected automatically and applied based on their classification.

Configuration Item Classification

Configuration items are classified into three categories based on their hot reload capability:

Immediate Update (No Service Interruption)

These settings update immediately without any service disruption:

# Logging configuration
logging:
  level: "info"                  # ✅ Immediate: Log level changes apply instantly
  format: "json"                 # ✅ Immediate: Log format changes apply instantly

# Rate limiting settings
rate_limiting:
  enabled: true                  # ✅ Immediate: Enable/disable rate limiting
  limits:
    per_client:
      requests_per_second: 10    # ✅ Immediate: New limits apply immediately
      burst_capacity: 20         # ✅ Immediate: Burst settings update instantly

# Circuit breaker configuration
circuit_breaker:
  enabled: true                  # ✅ Immediate: Enable/disable circuit breaker
  failure_threshold: 5           # ✅ Immediate: Threshold updates apply instantly
  timeout_seconds: 60            # ✅ Immediate: Timeout changes immediate

# Retry configuration
retry:
  max_attempts: 3                # ✅ Immediate: Retry policy updates instantly
  base_delay: "100ms"            # ✅ Immediate: Backoff settings apply immediately
  exponential_backoff: true      # ✅ Immediate: Strategy changes instant

# Global prompts
global_prompts:
  default: "You are helpful"       # ✅ Immediate: Prompt changes apply to new requests
  default_file: "prompts/system.md"  # ✅ Immediate: File-based prompts also hot-reload

# Admin statistics
admin:
  stats:
    retention_window: "24h"        # ✅ Immediate: Retention window updates instantly
    token_tracking: true           # ✅ Immediate: Token tracking toggle applies immediately

Gradual Update (Existing Connections Maintained)

These settings apply to new connections while maintaining existing ones:

# Backend configuration
backends:
    - name: "ollama"               # ✅ Gradual: New requests use updated backend pool
    url: "http://localhost:11434"
    weight: 2                    # ✅ Gradual: Load balancing updates for new requests
    models: ["llama3.2"]         # ✅ Gradual: Model routing updates gradually

# Health check settings
health_checks:
  interval: "30s"                # ✅ Gradual: Next health check cycle uses new interval
  timeout: "10s"                 # ✅ Gradual: New checks use updated timeout
  unhealthy_threshold: 3         # ✅ Gradual: Threshold applies to new evaluations
  healthy_threshold: 2           # ✅ Gradual: Recovery threshold updates gradually

# Timeout configuration
timeouts:
  connection: "10s"              # ✅ Gradual: New requests use updated timeouts
  request:
    standard:
      first_byte: "30s"          # ✅ Gradual: Applies to new requests
      total: "180s"              # ✅ Gradual: New requests use new timeout
    streaming:
      chunk_interval: "30s"      # ✅ Gradual: New streams use updated settings

Requires Restart (Hot Reload Not Possible)

These settings require a server restart to take effect. Changes are logged as warnings:

server:
  bind_address: "0.0.0.0:8080"   # ❌ Restart required: TCP/Unix socket binding
  # bind_address:                 # ❌ Restart required: Any address changes
  #   - "0.0.0.0:8080"
  #   - "unix:/var/run/router.sock"
  socket_mode: 0o660              # ❌ Restart required: Socket permissions
  workers: 4                      # ❌ Restart required: Worker thread pool size

When these settings are changed, the router will log a warning like:

WARN server.bind_address changed from '0.0.0.0:8080' to '0.0.0.0:9000' - requires restart to take effect

Hot Reload Process

  1. File System Watcher - Detects configuration file changes automatically
  2. Configuration Loading - New configuration is loaded and parsed
  3. Validation - New configuration is validated against schema
  4. Change Detection - ConfigDiff computation identifies what changed
  5. Classification - Changes are classified (immediate/gradual/restart)
  6. Atomic Update - Valid configuration is applied atomically
  7. Component Propagation - Updates are propagated to affected components:
  8. HealthChecker updates check intervals and thresholds
  9. RateLimitStore updates rate limiting rules
  10. CircuitBreaker updates failure thresholds and timeouts
  11. BackendPool updates backend configuration
  12. Immediate Health Check - When backends are added, an immediate health check is triggered so new backends become available within 1-2 seconds instead of waiting for the next periodic check
  13. Error Handling - If invalid, error is logged and old configuration retained

Checking Hot Reload Status

Use the admin API to check hot reload status and capabilities:

# Check if hot reload is enabled
curl http://localhost:8080/admin/config/hot-reload-status

# View current configuration
curl http://localhost:8080/admin/config

Hot Reload Behavior Examples

Example 1: Changing Log Level (Immediate)

# Before
logging:
  level: "info"

# After
logging:
  level: "debug"
Result: Log level changes immediately. No restart needed. Ongoing requests continue, new logs use debug level.

Example 2: Adding a Backend (Gradual with Immediate Health Check)

# Before
backends:
    - name: "ollama"
    url: "http://localhost:11434"

# After
backends:
    - name: "ollama"
    url: "http://localhost:11434"
    - name: "lmstudio"
    url: "http://localhost:1234"
Result: New backend added to pool with immediate health check triggered. The new backend becomes available within 1-2 seconds (instead of waiting up to 30 seconds for the next periodic health check). Existing requests continue to current backends. New requests can route to lmstudio once health check passes.

Example 2b: Removing a Backend (Graceful Draining)

# Before
backends:
    - name: "ollama"
      url: "http://localhost:11434"
    - name: "lmstudio"
      url: "http://localhost:1234"

# After
backends:
    - name: "ollama"
      url: "http://localhost:11434"
Result: Backend "lmstudio" enters draining state. New requests are not routed to it, but existing in-flight requests (including streaming) continue until completion. After all references are released (or after 5 minutes timeout), the backend is fully removed from memory.

Backend State Lifecycle

When a backend is removed from configuration, it goes through a graceful shutdown process:

  1. Active → Draining: Backend is marked as draining. New requests skip this backend.
  2. In-flight Completion: Existing requests/streams continue uninterrupted.
  3. Cleanup: Once all references are released, or after 5-minute timeout, the backend is removed.

This ensures zero impact on ongoing connections during configuration changes.

Example 3: Changing Bind Address (Requires Restart)

# Before
server:
  bind_address: "0.0.0.0:8080"

# After
server:
  bind_address: "0.0.0.0:9000"
Result: Warning logged. Change does not take effect. Restart required to bind to new port.

Distributed Tracing

Continuum Router supports distributed tracing for request correlation across backend services. This feature helps with debugging and monitoring requests as they flow through multiple services.

Configuration

tracing:
  enabled: true                         # Enable/disable distributed tracing (default: true)
  w3c_trace_context: true               # Support W3C Trace Context header (default: true)
  headers:
    trace_id: "X-Trace-ID"              # Header name for trace ID (default)
    request_id: "X-Request-ID"          # Header name for request ID (default)
    correlation_id: "X-Correlation-ID"  # Header name for correlation ID (default)

How It Works

  1. Trace ID Extraction: When a request arrives, the router extracts trace IDs from headers in the following priority order:
  2. W3C traceparent header (if W3C support enabled)
  3. Configured trace_id header (X-Trace-ID)
  4. Configured request_id header (X-Request-ID)
  5. Configured correlation_id header (X-Correlation-ID)

  6. Trace ID Generation: If no trace ID is found in headers, a new UUID is generated.

  7. Header Propagation: The trace ID is propagated to backend services via multiple headers:

  8. X-Request-ID: For broad compatibility
  9. X-Trace-ID: Primary trace identifier
  10. X-Correlation-ID: For correlation tracking
  11. traceparent: W3C Trace Context (if enabled)
  12. tracestate: W3C Trace State (if present in original request)

  13. Retry Preservation: The same trace ID is preserved across all retry attempts, making it easy to correlate multiple backend requests for a single client request.

Structured Logging

When tracing is enabled, all log messages include the trace_id field:

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "info",
  "trace_id": "0af7651916cd43dd8448eb211c80319c",
  "message": "Processing chat completions request",
  "backend": "openai",
  "model": "gpt-4o"
}

W3C Trace Context

When w3c_trace_context is enabled, the router supports the W3C Trace Context standard:

  • Incoming: Parses traceparent header (format: 00-{trace_id}-{span_id}-{flags})
  • Outgoing: Generates new traceparent header with preserved trace ID and new span ID
  • State: Forwards tracestate header if present in original request

Example traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01

Disabling Tracing

To disable distributed tracing:

tracing:
  enabled: false

Load Balancing Strategies

load_balancer:
  strategy: "round_robin"         # round_robin, weighted, random
  health_aware: true              # Only use healthy backends

Strategies:

  • round_robin: Equal distribution across backends
  • weighted: Distribution based on backend weights
  • random: Random selection (good for avoiding patterns)

Per-Backend Retry Configuration

backends:
    - name: "slow-backend"
    url: "http://slow.example.com"
    retry_override:               # Override global retry settings
      max_attempts: 5             # More attempts for slower backends
      base_delay: "500ms"         # Longer delays
      max_delay: "60s"

Model Fallback

Continuum Router supports automatic model fallback when the primary model is unavailable. This feature integrates with the circuit breaker for layered failover protection.

Pre-Stream vs. Mid-Stream Fallback

The router provides two independent fallback mechanisms:

Mechanism When it activates Config section Default
Pre-stream fallback Before or at the start of a response: connection errors, timeouts, trigger error codes, unhealthy backend at routing time fallback Enabled when fallback.enabled: true
Mid-stream fallback After streaming has started and the backend fails mid-response fallback + streaming.mid_stream_fallback Activates when fallback.enabled: true and a fallback chain is configured. Continuation mode is enabled by default.

When fallback.enabled: true and a fallback chain is configured for the requested model, mid-stream connection drops are suppressed and the router transparently switches to the next backend, even if streaming.mid_stream_fallback.enabled is false.

streaming.mid_stream_fallback.enabled controls continuation behavior only: whether the fallback backend receives a continuation prompt (using accumulated partial response) or a full restart of the original request. The default is true (continuation mode), which produces uninterrupted output for the client. Setting it to false forces restart mode, which may cause duplicate or incoherent content if partial output was already sent to the client.

Configuration

fallback:
  enabled: true

  # Define fallback chains for each primary model
  fallback_chains:
    # Same-provider fallback
    "gpt-4o":
      - "gpt-4-turbo"
      - "gpt-3.5-turbo"

    "claude-opus-4-5-20251101":
      - "claude-sonnet-4-5"
      - "claude-haiku-4-5"

    # Cross-provider fallback
    "gemini-2.5-pro":
      - "gemini-2.5-flash"
      - "gpt-4o"  # Falls back to OpenAI if Gemini unavailable

  fallback_policy:
    trigger_conditions:
      error_codes: [429, 500, 502, 503, 504]
      timeout: true
      connection_error: true
      model_not_found: true
      circuit_breaker_open: true

    max_fallback_attempts: 3
    fallback_timeout_multiplier: 1.5
    preserve_parameters: true

  model_settings:
    "gpt-4o":
      fallback_enabled: true
      notify_on_fallback: true

Trigger Conditions

Condition Description
error_codes HTTP status codes that trigger fallback (e.g., 429, 500, 502, 503, 504)
timeout Request timeout
connection_error TCP connection failures
model_not_found Model not available on backend
circuit_breaker_open Backend circuit breaker is open

Response Headers

When fallback is used, the following headers are added to the response:

Header Description Example
X-Fallback-Used Indicates fallback was used true
X-Original-Model Originally requested model gpt-4o
X-Fallback-Model Model that served the request gpt-4-turbo
X-Fallback-Reason Why fallback was triggered error_code_429
X-Fallback-Attempts Number of fallback attempts 2

Cross-Provider Parameter Translation

When falling back across providers (e.g., OpenAI → Anthropic), the router automatically translates request parameters:

OpenAI Parameter Anthropic Parameter Notes
max_tokens max_tokens Auto-filled if missing (required by Anthropic)
temperature temperature Direct mapping
top_p top_p Direct mapping
stop stop_sequences Array conversion

Provider-specific parameters are automatically removed or converted during cross-provider fallback.

Integration with Circuit Breaker

The fallback system works in conjunction with the circuit breaker:

  1. Circuit Breaker detects failures and opens when threshold is exceeded
  2. Fallback chain activates when circuit breaker is open
  3. Requests route to fallback models based on configured chains
  4. Circuit breaker tests recovery and closes when backend recovers
# Example: Combined circuit breaker and fallback configuration
circuit_breaker:
  enabled: true
  failure_threshold: 5
  timeout: 60s

fallback:
  enabled: true
  fallback_policy:
    trigger_conditions:
      circuit_breaker_open: true  # Link to circuit breaker

Mid-Stream Fallback

Mid-stream fallback allows the router to transparently continue an active SSE stream on a fallback backend when the primary backend fails mid-response. The client's connection remains open and sees an uninterrupted response with only a brief pause during the switchover.

Mid-stream fallback activates automatically when fallback.enabled: true and a fallback chain is configured for the requested model. The streaming.mid_stream_fallback section controls how the fallback backend is invoked (continuation vs restart mode), not whether fallback happens.

Configuration

fallback:
  enabled: true  # Required: enables mid-stream fallback path
  fallback_chains:
    "gpt-4o":
      - "gpt-4-turbo"
      - "gpt-3.5-turbo"

streaming:
  mid_stream_fallback:
    # Enable continuation mode (default: true).
    # When true, accumulated partial response is used to build a continuation prompt,
    # producing uninterrupted output for the client.
    # When false, the fallback backend restarts the request from scratch, which may
    # cause duplicate or incoherent content if partial output was already sent.
    enabled: true

    # Minimum estimated tokens accumulated before using continuation mode (default: 50)
    # Below this threshold the request is restarted from scratch on the fallback backend
    # instead of appending a continuation prompt.
    min_accumulated_tokens: 50

    # Maximum fallback attempts per streaming request (default: 2, max: 10)
    max_fallback_attempts: 2

    # Prompt appended as a user message after the partial assistant response
    continuation_prompt: "Continue from where you left off exactly. Do not repeat any previously generated content."

How It Works

  1. The client sends a streaming chat completion request.
  2. The router begins streaming from the primary backend, accumulating response content.
  3. If the backend fails mid-stream (connection drop, timeout, error event):

    • The error is NOT forwarded to the client.
    • The accumulated partial response is captured.
    • The next healthy backend in the fallback chain is selected (unhealthy backends are skipped).
    • A continuation or restart request is sent to the fallback backend.
    • Streaming resumes on the fallback backend without closing the client connection.
  4. The client receives an uninterrupted response with only a brief pause during the switchover.

Continuation vs. Restart Mode

The min_accumulated_tokens threshold controls which recovery mode is used:

Condition Mode Behavior
enabled: true (default) and tokens ≥ min_accumulated_tokens and not truncated Continuation Original messages + partial assistant response + continuation prompt
enabled: true (default) and tokens < min_accumulated_tokens Restart Original request replayed (not enough context to continue)
enabled: true (default) and content truncated (> 100 KB) Restart Forced restart to avoid incoherent context
mid_stream_fallback.enabled: false Restart Original request replayed on fallback backend from scratch

Continuation mode (the default) produces uninterrupted output for the client. Restart mode is used automatically when there is too little context to continue meaningfully, or when the accumulated response is too long to include safely. Explicitly setting enabled: false forces restart mode unconditionally, which may cause duplicate or incoherent content visible to the client.

Edge Case Handling

The mid-stream fallback path addresses several edge cases automatically:

  • Global timeout budget: All fallback attempts share the original request start time. Each attempt checks remaining budget before sending, preventing indefinite timeout accumulation across the chain.
  • Cross-provider parameter translation: When the fallback model is on a different provider (e.g., OpenAI → Anthropic), request parameters are automatically translated: provider-specific fields removed and parameter names mapped.
  • Concurrent request storms: A global semaphore (50 permits) limits simultaneous fallback attempts. Requests that cannot acquire a permit within 5 seconds are rejected gracefully.
  • Accumulator truncation: When accumulated response content exceeds 100 KB, the continuation mode is forced to restart to avoid sending incoherent context to the fallback backend.
  • Health re-check: Backend health is re-verified before each fallback attempt in the chain. Unhealthy backends are skipped to the next entry.
  • Missing [DONE] marker: Streams ending without [DONE] but with finish_reason: "stop" are treated as completed successfully, preventing unnecessary fallback.

Metrics

Three Prometheus metrics track mid-stream fallback activity. See Mid-Stream Fallback Metrics for details.

Minimizing Failover Latency

When a backend goes down during streaming, the time until the fallback backend takes over depends on several configuration parameters across different subsystems. Below is a tuning guide for minimizing this switchover delay.

How failover delay is composed

The total time a client waits during a mid-stream failover is roughly:

failover_delay ≈ failure_detection_time + health_recheck_time + fallback_connection_time

Each component maps to specific configuration:

Component What determines it Default Tuning target
Failure detection Stream inactivity timeout (hardcoded 60 s) or TCP read error (immediate) or chunk_interval timeout 30–60 s Lower chunk_interval
Health re-check Health check before fallback attempt timeout: 5s Keep low
Fallback connection TCP connect + TLS handshake to fallback backend connection: 10s Lower connection
# 1. Timeouts — the most impactful settings for failover speed
timeouts:
  connection: 5s               # Faster TCP connect timeout (default: 10s)
  request:
    streaming:
      first_byte: 30s          # How long to wait for the first token (default: 60s)
      chunk_interval: 10s      # Max silence between chunks before treating as failure (default: 30s)
      total: 600s              # Total streaming budget (keep generous)

# 2. Health checks — detect backend failures proactively
health_checks:
  interval: 10s                # Check every 10s instead of 30s (default: 30s)
  timeout: 3s                  # Fail health checks faster (default: 5s)
  unhealthy_threshold: 2       # Mark unhealthy after 2 failures (default: 3)
  healthy_threshold: 1         # Recover after 1 success (default: 2)
  warmup_check_interval: 1s   # Fast checks during backend startup

# 3. Circuit breaker — stop routing to a failed backend immediately
circuit_breaker:
  enabled: true
  failure_threshold: 3         # Open circuit after 3 failures (default: 5)
  timeout: 30s                 # Try recovery after 30s (default: 60s)
  half_open_max_requests: 2
  half_open_success_threshold: 1
  timeout_as_failure: true     # Count timeouts toward circuit breaker

# 4. Fallback chain — must be configured for mid-stream fallback to activate
fallback:
  enabled: true
  fallback_chains:
    "gpt-4o":
        - "gpt-4-turbo"
        - "gpt-3.5-turbo"
  fallback_policy:
    trigger_conditions:
      error_codes: [429, 500, 502, 503, 504]
      timeout: true
      connection_error: true
      circuit_breaker_open: true

# 5. Mid-stream fallback — continuation mode (default: enabled)
streaming:
  mid_stream_fallback:
    enabled: true              # Use continuation mode (default)
    max_fallback_attempts: 3   # Allow more retries for resilience (default: 2)
    min_accumulated_tokens: 30 # Lower threshold for continuation vs restart (default: 50)

Parameter impact summary

Parameter Effect on failover speed Trade-off
timeouts.request.streaming.chunk_interval High — directly controls how quickly a stalled stream is detected Too low may cause false positives on slow models (e.g., reasoning models with long thinking phases)
timeouts.connection Medium — limits TCP connect delay to fallback backend Too low may fail on high-latency networks
health_checks.interval Medium — faster detection means the circuit breaker opens sooner, preventing requests from reaching a dead backend More frequent checks increase backend load
health_checks.unhealthy_threshold Medium — fewer failures needed to mark backend unhealthy Lower values increase sensitivity to transient errors
circuit_breaker.failure_threshold Medium — fewer failures to open circuit Too aggressive may open circuit on temporary spikes
circuit_breaker.timeout Low — affects recovery time, not failover speed Shorter means faster recovery but more probing of unhealthy backends
mid_stream_fallback.max_fallback_attempts Low — more attempts increase resilience but not speed of individual switchover More attempts consume more of the global timeout budget

Failure detection scenarios

Different failure types are detected at different speeds:

Failure type Detection time Mechanism
TCP connection reset / backend crash Immediate (< 1 s) Stream read error triggers instant fallback
Backend returns 5xx error Immediate (< 1 s) HTTP status check before streaming begins
Backend becomes unresponsive (stall) chunk_interval (default 30 s) Inactivity timeout on the stream
Backend sends error SSE events After 5 errors Error count threshold in stream processing
Backend process killed mid-response Immediate (< 1 s) TCP FIN/RST detected as stream read error

The most common scenario in production, a backend becoming unresponsive, is governed by chunk_interval. For latency-sensitive applications, lowering this to 10–15 seconds is recommended, with model-specific overrides for slow models:

timeouts:
  request:
    streaming:
      chunk_interval: 10s      # Fast detection for most models
    model_overrides:
      gemini-2.5-pro:          # Reasoning models need longer intervals
        streaming:
          chunk_interval: 30s
          first_byte: 120s

Rate Limiting

Continuum Router includes built-in rate limiting for the /v1/models endpoint to prevent abuse and ensure fair resource allocation.

Current Configuration

Rate limiting is currently configured with the following default values:

# Note: These values are currently hardcoded but may become configurable in future versions
rate_limiting:
  models_endpoint:
    # Per-client limits (identified by API key or IP address)
    sustained_limit: 100          # Maximum requests per minute
    burst_limit: 20               # Maximum requests in any 5-second window

    # Time windows
    window_duration: 60s          # Sliding window for sustained limit
    burst_window: 5s              # Window for burst detection

    # Client identification priority
    identification:
      - api_key                   # Bearer token (first 16 chars used as ID)
      - x_forwarded_for           # Proxy/load balancer header
      - x_real_ip                 # Alternative IP header
      - fallback: "unknown"       # When no identifier available

How It Works

  1. Client Identification: Each request is associated with a client using:
  2. API key from Authorization: Bearer <token> header (preferred)
  3. IP address from proxy headers (fallback)

  4. Dual-Window Approach:

  5. Sustained limit: Prevents excessive usage over time
  6. Burst protection: Catches rapid-fire requests

  7. Independent Quotas: Each client has separate rate limits:

  8. Client A with API key abc123...: 100 req/min
  9. Client B with API key def456...: 100 req/min
  10. Client C from IP 192.168.1.1: 100 req/min

Response Headers

When rate limited, the response includes: - Status Code: 429 Too Many Requests - Error Message: Indicates whether burst or sustained limit was exceeded

Cache TTL Optimization

To prevent cache poisoning attacks: - Empty model lists: Cached for 5 seconds only - Normal responses: Cached for 60 seconds

This prevents attackers from forcing the router to cache empty responses during backend outages.

Monitoring

Rate limit violations are tracked in metrics: - rate_limit_violations: Total rejected requests - empty_responses_returned: Empty model lists served - Per-client violation tracking for identifying problematic clients

Future Enhancements

Future versions may support: - Configurable rate limits via YAML/environment variables - Per-endpoint rate limiting - Custom rate limits per API key - Redis-backed distributed rate limiting

Smart Routing

Smart routing classifies incoming requests by complexity and domain, then routes them to the most appropriate model tier using configurable policies. It combines a model tier registry (mapping models to tiers and domains) with a rule-based request classifier and a policy engine that maps classification results to routing decisions.

When model: "auto" is used in a chat completion request, the pipeline runs: classify the request, evaluate policies top-to-bottom, select a model from the matched tier. The same pipeline runs for all requests when intercept_all: true.

Configuration

smart_routing:
  enabled: true

  # Default tier when no profile matches and auto-inference is inconclusive.
  # 1 = Flagship, 2 = Standard, 3 = Lightweight. Defaults to 2.
  default_tier: 2

  # Model name that triggers smart routing. Defaults to "auto".
  virtual_model: "auto"

  # When true, all requests go through smart routing regardless of model name.
  intercept_all: false

  model_profiles:
    # Exact model name
    - model: "gpt-4o"
      tier: 1
      domains: [general, code, reasoning, creative]
      cost_per_1k_input_tokens: 0.005
      cost_per_1k_output_tokens: 0.015

    # Another exact match
    - model: "gpt-4o-mini"
      tier: 2
      domains: [general, code]
      cost_per_1k_input_tokens: 0.00015
      cost_per_1k_output_tokens: 0.0006

    # Glob pattern — matches all GGUF Q4_K_M quantized models
    - model_pattern: "*-q4_K_M"
      tier: 3
      domains: [general]

  # Routing policies: first match wins, top-to-bottom evaluation
  routing_policies:
    - name: "trivial_to_lightweight"
      when:
        complexity: [trivial, simple]
        domain: [general]
      route_to:
        tier: 3

    - name: "code_to_flagship"
      when:
        domain: [code]
        complexity: [moderate, complex, expert]
      route_to:
        tier: 1
        prefer_domains: [code]

    - name: "vision_required"
      when:
        requires: [vision]
      route_to:
        tier: 1
        require_capabilities: [vision]

    - name: "complex_to_flagship"
      when:
        complexity: [complex, expert]
      route_to:
        tier: 1

    - name: "default_to_standard"
      when: {}                      # Catch-all (always matches)
      route_to:
        tier: 2

Tier Classification

Tier Value Meaning Typical Examples
Flagship 1 Highest capability, highest cost gpt-4o, claude-3.5-sonnet, gemini-1.5-pro
Standard 2 Balanced capability and speed gpt-4o-mini, claude-3-haiku
Lightweight 3 Optimized for speed and low cost llama-3-8b, phi-3-mini, quantized variants

Domain Specialization Tags

Tag Description
general No specific specialty
code Code generation, debugging, review
reasoning Complex multi-step reasoning, math
creative Creative writing, storytelling
multilingual Translation and multilingual tasks
vision Image understanding

Auto-Inference

When a model has no matching explicit profile or glob pattern, the router infers its tier automatically using three sources in priority order:

  1. Pricing (from model-metadata.yaml): input cost >= $3/1k tokens maps to Flagship; >= $0.50/1k to Standard; below that to Lightweight. Zero-cost models skip pricing inference.

  2. Capabilities (from model-metadata.yaml): models with 3+ high-value capabilities (vision, reasoning, audio, video, function_calling, tool) map to Flagship; 1+ such capability or 3+ total capabilities map to Standard.

  3. Name heuristics: keywords like pro, ultra, opus, sonnet, turbo map to Flagship; keywords like mini, small, tiny, nano, lite, flash, haiku and quantization markers (q4_, q5_, q8_, gguf, gptq, awq) map to Lightweight.

Auto-inferred results are cached per model ID (up to 10,000 entries). The cache clears on hot-reload and when the /admin/smart-routing/model-profiles PUT endpoint is called.

Glob Pattern Syntax

Patterns use * as the only wildcard character. Multiple wildcards are supported.

Pattern Matches Does Not Match
gpt-* gpt-4o, gpt-4o-mini claude-3
*-q4_K_M llama-3-8b-q4_K_M llama-3-8b-q5_K_M
gpt-*-turbo gpt-4-turbo, gpt-3.5-turbo gpt-4o
* everything nothing

Request Classifier

The rule-based classifier analyzes each request using 11 signal types and produces a ClassificationResult containing complexity level, domain tag, required capabilities, and confidence score.

Complexity Levels

Level Description Example
trivial Greetings, yes/no, single-fact lookup "What is 2+2?"
simple Short explanations, basic summaries "Summarize this paragraph"
moderate Multi-step reasoning, medium code tasks "Refactor this function"
complex Advanced algorithms, system design "Design a distributed cache"
expert Research-level problems, formal proofs "Prove this theorem"

Classification Signals

Signal What it detects
message_length Total token count across all messages
code_blocks Fenced code blocks or inline code
math_notation LaTeX, equations, mathematical symbols
system_prompt_complexity Length and complexity of the system prompt
conversation_depth Number of turns in the conversation
image_attachments Multimodal image content in messages
tool_definitions Tool/function definitions in the request
complexity_keywords Words like "optimize", "architect", "prove"
language_detection Non-Latin scripts or multilingual content
creative_markers Words like "story", "poem", "imagine"
analysis_markers Words like "analyze", "compare", "evaluate"

Each detected signal contributes to the final complexity level and domain tag. Conflicting signals (e.g., both creative and code markers present) reduce the confidence score.

LLM-Based Classifier

The rule-based classifier is fast but sometimes ambiguous. When that is not accurate enough, the LLM-based classifier sends the request to a small, cheap model for classification. Three operating modes are supported via classifier.method:

Method Behavior
rule Rule-based only (default). No LLM calls.
llm Always calls the LLM classifier; falls back to rule-based on failure.
hybrid Rule-based first; escalates to LLM only when rule confidence is below confidence_threshold.

Hybrid mode is the recommended production setting: it adds latency only for genuinely ambiguous requests (typically 10-20% of traffic), while trivial and clear-cut requests are classified in microseconds.

smart_routing:
  enabled: true

  classifier:
    # Classification method: "rule" (default), "llm", or "hybrid".
    method: hybrid

    rule:
      # Confidence below this threshold triggers LLM escalation in hybrid mode.
      # Range: 0.0 – 1.0. Default: 0.7.
      confidence_threshold: 0.7

    llm:
      # Model used for classification. Any fast, cheap model works well.
      model: "gpt-4o-mini"

      # Backend name to route classification requests to. Must be a configured
      # backend. If omitted, the router uses the backend URL directly.
      backend: "openai-fast"

      # Maximum time allowed for a classification request (milliseconds).
      timeout_ms: 2000

      # Maximum input tokens sent to the classifier (content is truncated).
      max_input_tokens: 500

      # Number of retries after a parse failure (0 or 1). Default: 1.
      max_retries: 1

      # Temperature for classification. 0.0 gives deterministic output.
      temperature: 0.0

      # Maximum output tokens in the classification response.
      max_output_tokens: 150

      # Structured output strategy. "auto" selects the best method for the
      # configured backend: json_schema (OpenAI/vLLM), tool_use (Anthropic),
      # json_object (Ollama/Gemini/LM Studio), prompt_only (others).
      structured_output: auto   # auto | json_schema | tool_use | json_object | prompt_only

      # Include built-in few-shot examples in the system prompt.
      few_shot_examples: true

      # Custom few-shot examples appended after the built-in ones.
      custom_examples:
        - user: "What is the recommended dose of ibuprofen?"
          classification:
            complexity: simple
            domain: medical

      # Cache TTL for classification results (seconds). Default: 300.
      cache_ttl_seconds: 300

      # Maximum number of cached entries. Default: 10000.
      max_cache_entries: 10000

      # Disable the LLM classifier when load reaches this state.
      # "critical" (default) or "warning". Set to "" to never disable.
      disable_under_load_state: critical

    # Extend the built-in complexity taxonomy with custom levels.
    custom_complexity_levels:
      - name: specialized
        description: "Requires a domain-specific fine-tuned model"
        rank: 6   # Optional ordering hint (higher = harder)

    # Extend the built-in domain taxonomy with custom categories.
    custom_domains:
      - name: medical
        description: "Medical and clinical questions"

Structured Output Strategies

The LLM classifier needs structured JSON from the classifier model. The auto strategy picks the right mechanism based on the backend type, but you can override it explicitly:

Strategy Mechanism Supported backends
json_schema response_format: { type: "json_schema" } OpenAI, Azure, vLLM
tool_use Tool/function calling Anthropic, Gemini
json_object response_format: { type: "json_object" } OpenAI, Ollama, Gemini, LM Studio, llama.cpp
prompt_only JSON extracted from free-form text via regex Any backend

When the classifier response cannot be parsed, the router retries once with a correction prompt (controlled by max_retries). If the retry also fails, the result from the rule-based classifier is used instead.

Classification Cache

Classification results are cached in memory with a configurable TTL to avoid repeated LLM calls for the same request. The cache key is a SHA-256 hash of the truncated user message, so semantically identical requests share the same cached result. The cache is per-process and not shared across router instances.

Custom Taxonomy

Both complexity levels and domain tags are extensible. Custom values added via custom_complexity_levels and custom_domains appear in the classifier's system prompt and are accepted in the structured-output schema. Routing policies can reference custom values just like built-in ones:

routing_policies:
  - name: specialized_to_flagship
    when:
      complexity: [specialized]
    route_to:
      tier: 1

Bypass Header

The LLM classifier sends an X-Smart-Route-Bypass: true header with its classification requests. The router skips smart routing for any request carrying this header, preventing circular classification loops when the classifier backend is itself behind the same router instance.

Routing Policies

Policies are evaluated top-to-bottom; the first match wins. If no policy matches and no catch-all is defined, the request falls back to default_tier.

Policy Condition Logic

  • Fields within a when block are AND-ed: all specified fields must match.
  • Values within a single field are OR-ed: complexity: [trivial, simple] matches either.
  • when: {} is a catch-all that always matches.

Policy Fields

when conditions:

Field Type Description
complexity [string] Complexity levels that match (OR logic)
domain [string] Domain tags that match (OR logic)
requires [string] Capabilities that must all be present (AND logic)

route_to action:

Field Type Description
tier int Target tier (1=Flagship, 2=Standard, 3=Lightweight)
prefer_domains [string] Soft preference for domain-specialized models
require_capabilities [string] Hard filter: model must have these capabilities

Model Selection Within a Tier

When multiple models belong to the matched tier, the selector scores them using:

  1. Domain preference match (soft bonus)
  2. Capability match (hard filter if require_capabilities is set)
  3. Cost scoring (lower cost scores higher within a tier)
  4. Random tiebreak for equal-score models

If no models are available in the matched tier, the router tries adjacent tiers in order (e.g., if Lightweight is empty, tries Standard, then Flagship).

Load-Aware Dynamic Tier Adjustment

Under normal conditions, smart routing picks the tier that best matches each request. When traffic spikes or latency rises, the load monitor tracks real-time metrics and automatically shifts routing to lighter model tiers until the system recovers.

Load management is disabled by default. Enable it under smart_routing.load_management:

smart_routing:
  enabled: true
  load_management:
    enabled: true

    # How often load metrics are evaluated (milliseconds). Default: 1000.
    assessment_interval_ms: 1000

    # Thresholds for entering Warning and Critical states.
    # Any single threshold being exceeded triggers the state transition.
    thresholds:
      warning:
        requests_per_second: 100
        avg_latency_ms: 3000
        error_rate: 0.05
        in_flight_requests: 50
      critical:
        requests_per_second: 200
        avg_latency_ms: 5000
        error_rate: 0.15
        in_flight_requests: 100

    # Routing restrictions applied per load state.
    degradation:
      warning:
        max_tier: 2           # Cap routing at Standard tier
        prefer_quantized: false
        reject_expert: false
      critical:
        max_tier: 3           # Cap routing at Lightweight tier
        prefer_quantized: true
        reject_expert: false

    # Recovery behavior.
    recovery:
      cooldown_seconds: 30    # Minimum time before downgrading the load state
      hysteresis_factor: 0.8  # Metric must drop to 80% of threshold to recover

Load States

State Meaning Default routing restriction
normal All metrics within bounds No restriction
warning At least one metric above Warning threshold Capped at Standard tier (tier 2)
critical At least one metric above Critical threshold Capped at Lightweight tier (tier 3), prefer quantized

When a request would normally route to Flagship (tier 1) but the load state is Warning, it is silently downgraded to Standard. The routing decision log records the adjusted policy name as <original_policy>__load_warning or <original_policy>__load_critical.

Hysteresis and Cooldown

Rapid oscillation between load states can itself cause instability. Two mechanisms prevent it:

  • Hysteresis: to leave Warning state, a metric must drop below threshold * hysteresis_factor (default 0.8), not just below the threshold. A system that entered Warning at 100 RPS stays there until RPS drops below 80.
  • Cooldown: after any state transition, recovery to a lower state is blocked for cooldown_seconds (default 30). Escalation (Normal to Warning, Warning to Critical) always bypasses the cooldown.

Per-Tier Threshold Overrides

If different tiers have different capacity characteristics, you can override thresholds per tier:

smart_routing:
  load_management:
    enabled: true
    thresholds:
      warning:
        requests_per_second: 100
    tier_thresholds:
      "1":                      # Tier 1 (Flagship) has a lower RPS tolerance
        warning:
          requests_per_second: 50
      "3":                      # Tier 3 (Lightweight) can handle more
        warning:
          requests_per_second: 300

Prometheus Metrics

When the metrics feature is enabled, the following counters and gauges are exported for load management:

Metric Type Description
smart_routing_load_state Gauge Current load state: 0=Normal, 1=Warning, 2=Critical
smart_routing_tier_degradation_total Counter Number of times routing was degraded due to load
smart_routing_load_transitions_total Counter Number of load state transitions, labeled by from and to

The LLM classifier exports six additional metrics:

Metric Type Description
smart_routing_llm_classifier_calls_total Counter Total LLM classifier invocations
smart_routing_llm_classifier_cache_hits_total Counter Classification results served from cache
smart_routing_llm_classifier_duration_seconds Histogram End-to-end LLM classification latency
smart_routing_llm_classifier_fallbacks_total Counter Times the LLM classifier fell back to rule-based
smart_routing_llm_classifier_parse_errors_total Counter Response parse failures before retry
smart_routing_llm_classifier_retries_total Counter Retry attempts after initial parse failure

Debug Response Headers

Set debug_headers: true to include smart routing decision details in HTTP response headers. This is intended for development and staging environments.

smart_routing:
  enabled: true
  debug_headers: true  # Enable in development/staging

When enabled, smart-routed responses include:

Header Description
X-Smart-Route-Source Original model requested (e.g., auto)
X-Smart-Route-Target Selected model (e.g., gpt-4o-mini)
X-Smart-Route-Complexity Classified complexity level
X-Smart-Route-Domain Classified domain
X-Smart-Route-Policy Policy that matched
X-Smart-Route-Load-State Load state at routing time
X-Smart-Route-Classifier Classifier used (rule_based or llm_based)

Admin API

Smart routing exposes several admin endpoints under /admin/smart-routing/ for observability and management. The full endpoint reference is in the Admin API documentation.

Key endpoints:

  • GET /status -- overall status, load state, policy count
  • POST /classify -- classify a request without routing (diagnostic)
  • POST /simulate -- simulate the full routing pipeline
  • GET /policies and PUT /policies -- view and hot-reload policies
  • GET /load-state -- current load state with assessment details
  • GET /cache/stats and POST /cache/clear -- LLM classifier cache management

Structured Logging

All smart routing decisions are logged at DEBUG level with structured fields:

level=DEBUG msg="Smart routing decision"
  source_model="auto"
  target_model="gpt-4o-mini"
  complexity="simple"
  domain="general"
  policy="trivial_to_lightweight"
  load_state="normal"
  classifier="rule_based"
  confidence=0.92
  classification_ms=0.3

Load state transitions and policy changes are logged at INFO level.

Hot Reload

The smart_routing section reloads immediately when the config file changes. After a reload, the inferred-profile cache is cleared so all models are re-evaluated on the next request. Routing policies in routing_policies and load_management settings also take effect immediately without restarting the server. Policies can also be updated at runtime via the PUT /admin/smart-routing/policies endpoint.


Environment-Specific Configurations

Development Configuration

# config/development.yaml
server:
  bind_address: "127.0.0.1:8080"

backends:
    - name: "local-ollama"
    url: "http://localhost:11434"

health_checks:
  interval: "10s"                 # More frequent checks
  timeout: "5s"

logging:
  level: "debug"                  # Verbose logging
  format: "pretty"                # Human-readable
  enable_colors: true

Production Configuration

# config/production.yaml
server:
  bind_address: "0.0.0.0:8080"
  workers: 8                      # More workers for production
  connection_pool_size: 300       # Larger connection pool

backends:
    - name: "primary-openai"
    url: "https://api.openai.com"
    weight: 3
    - name: "secondary-azure"
    url: "https://azure-openai.example.com"
    weight: 2
    - name: "fallback-local"
    url: "http://internal-llm:11434"
    weight: 1

health_checks:
  interval: "60s"                 # Less frequent checks
  timeout: "15s"                  # Longer timeout for network latency
  unhealthy_threshold: 5          # More tolerance
  healthy_threshold: 3

request:
  timeout: "120s"                 # Shorter timeout for production
  max_retries: 5                  # More retries

logging:
  level: "warn"                   # Less verbose logging
  format: "json"                  # Structured logging

Container Configuration

# config/container.yaml - optimized for containers
server:
  bind_address: "0.0.0.0:8080"
  workers: 0                      # Auto-detect based on container limits

backends:
    - name: "backend-1"
    url: "${BACKEND_1_URL}"       # Environment variable substitution
    - name: "backend-2"
    url: "${BACKEND_2_URL}"

logging:
  level: "${LOG_LEVEL}"           # Configurable via environment
  format: "json"                  # Always JSON in containers