Health, Caching & Logging¶

Health Checks Section¶

Configures backend health monitoring:

health_checks:
  enabled: true                    # Enable/disable health monitoring
  interval: "30s"                  # Check frequency
  timeout: "10s"                   # Request timeout
  unhealthy_threshold: 3           # Failures before marking unhealthy
  healthy_threshold: 2             # Successes before marking healthy
  endpoint: "/v1/models"           # Endpoint to check
  warmup_check_interval: "1s"      # Accelerated interval during warmup
  max_warmup_duration: "300s"      # Maximum warmup detection duration

Health Check Process: 1. Router queries the health endpoint on each backend 2. Successful responses increment success counter 3. Failed responses increment failure counter 4. Backends marked unhealthy after reaching failure threshold 5. Backends marked healthy after reaching success threshold 6. Only healthy backends receive traffic

Accelerated Warmup Health Checks¶

The router supports accelerated health checks during backend warmup, which is particularly useful for backends like llama.cpp that return HTTP 503 while loading models.

Backend States:

State	HTTP Response	Behavior
`ready`	200 OK	Normal interval checks
`warming_up`	503 Service Unavailable	Accelerated interval checks
`down`	Connection failure	Normal interval checks
`unknown`	Initial state	First check determines state

Warmup Configuration:

Option	Default	Description
`warmup_check_interval`	`1s`	Accelerated check interval during warmup
`max_warmup_duration`	`300s`	Maximum time to stay in accelerated mode

How it works:

When a backend returns HTTP 503, it enters the warming_up state
Health checks switch to the accelerated interval (default: 1 second)
Once the backend returns HTTP 200, it becomes ready and returns to normal interval
If warmup exceeds max_warmup_duration, the backend is marked as unhealthy

This reduces model availability detection latency from up to 30 seconds (worst case) to approximately 1 second.

Per-Backend Health Check Configuration¶

Each backend type has sensible default health check endpoints. You can override these defaults with a custom health_check configuration per backend.

Default Health Check Endpoints by Backend Type:

Backend Type	Primary Endpoint	Fallback Endpoints	Method	Notes
`openai`	`/v1/models`	-	GET	Standard OpenAI endpoint
`vllm`	`/health`	`/v1/models`	GET	`/health` available after model load
`ollama`	`/api/tags`	`/`	GET	Ollama-specific endpoint
`llamacpp`	`/health`	`/v1/models`	GET	llama-server endpoint
`mlxcel`	`/health`	`/v1/models`	GET	MLxcel server (llama-server compatible)
`lmstudio`	`/v1/models`	`/api/v1/models`	GET	OpenAI-compat + native API
`continuum-router`	`/health`	`/v1/models`	GET	Remote CR / Backend.AI GO
`anthropic`	`/v1/messages`	-	POST	Accepts 200, 400, 401, 429 as healthy
`gemini`	`/models`	`/v1beta/models`	GET	Native Gemini endpoint
`azure`	`/health`	`/v1/models`	GET	Azure OpenAI endpoint
`generic`	`/health`	`/v1/models`	GET	Generic fallback

Fallback Behavior:

When the primary health check endpoint returns HTTP 404, the router automatically tries the fallback endpoints in order. This ensures compatibility with backends that may not implement all standard endpoints.

Custom Health Check Configuration:

backends:
    - name: vllm-custom
    type: vllm
    url: http://localhost:8000
    models:
        - my-model
    health_check:
        endpoint: /custom-health          # Primary endpoint
        fallback_endpoints:               # Tried if primary returns 404
            - /health
            - /v1/models
        method: GET                       # HTTP method: GET, POST, or HEAD
        timeout: 10s                      # Override global health check timeout
        accept_status:                    # Status codes indicating healthy
            - 200
            - 204
        warmup_status:                    # Status codes indicating model loading
            - 503

Health Check Configuration Options:

Option	Type	Default	Description
`endpoint`	string	Backend-type specific	Primary health check endpoint path
`fallback_endpoints`	array	Backend-type specific	Endpoints to try if primary returns 404
`method`	string	`GET`	HTTP method: `GET`, `POST`, or `HEAD`
`body`	object	null	JSON body for POST requests
`accept_status`	array	`[200]`	Status codes indicating the backend is healthy
`warmup_status`	array	`[503]`	Status codes indicating the backend is warming up
`timeout`	string	Global timeout	Override the global health check timeout

Example: Anthropic-Style Health Check:

For backends that use POST requests or accept error codes as healthy indicators:

backends:
    - name: custom-api
    type: generic
    url: http://localhost:9000
    models:
        - custom-model
    health_check:
        endpoint: /api/v1/health
        method: POST
        body:
            check: true
        accept_status:
            - 200
            - 400    # Bad request means server is up
            - 401    # Unauthorized means server is up
            - 429    # Rate limited means server is up

Request Section¶

Controls request handling behavior:

request:
  timeout: "300s"                  # Maximum request duration
  max_retries: 3                   # Retry attempts for failed requests
  retry_delay: "1s"                # Initial delay between retries

Timeout Considerations:

Long timeouts (300s) accommodate slow model inference
Streaming requests may take longer than non-streaming
Balance between user experience and resource usage

Retry Section¶

Global retry configuration for resilience:

retry:
  max_attempts: 3                  # Maximum retry attempts
  base_delay: "100ms"              # Base delay between retries
  max_delay: "30s"                 # Cap on retry delays
  exponential_backoff: true        # Use exponential backoff
  jitter: true                     # Add random jitter

Retry Strategy:

Exponential backoff: delays increase exponentially (100ms, 200ms, 400ms...)
Jitter: adds randomness to prevent thundering herd
Max delay: prevents extremely long waits

Cache Section¶

Controls caching and optimization:

cache:
  model_cache_ttl: "300s"         # How long to cache model lists
  deduplication_ttl: "60s"        # How long to cache identical requests
  enable_deduplication: true      # Enable request deduplication

Cache Stampede Prevention¶

The router implements three strategies to prevent cache stampede (thundering herd problem):

Singleflight Pattern: Only one aggregation request runs at a time
Stale-While-Revalidate: Return stale data while refreshing in background
Background Refresh: Proactive cache updates before expiration

Advanced cache configuration:

model_aggregation:
  cache_ttl: 60                     # Cache TTL in seconds (default: 60)
  soft_ttl_ratio: 0.8               # When to trigger background refresh (default: 0.8 = 80%)
  empty_response_base_ttl_seconds: 5   # Base TTL for empty responses
  empty_response_max_ttl_seconds: 60   # Max TTL with exponential backoff
  max_cache_entries: 100            # Maximum cache entries
  background_refresh:
    enabled: true                   # Enable background refresh
    check_interval: 10s             # Check interval

Option	Default	Description
`cache_ttl`	60s	Hard TTL - cache expires after this time
`soft_ttl_ratio`	0.8	Soft TTL = cachettl * softttl_ratio. Cache is stale but usable between soft and hard TTL
`empty_response_base_ttl_seconds`	5	Base TTL for empty responses (prevents DoS)
`empty_response_max_ttl_seconds`	60	Maximum TTL with exponential backoff (base * 2^n)
`max_cache_entries`	100	Maximum number of cache entries
`background_refresh.enabled`	true	Enable proactive cache refresh
`background_refresh.check_interval`	10s	How often to check cache freshness

Cache Benefits:

Model caching reduces backend queries
Deduplication prevents duplicate processing
TTL prevents stale data issues
Stampede prevention avoids thundering herd
Background refresh ensures cache is always fresh

Response Cache Section¶

Caches complete LLM responses for deterministic (temperature == 0) requests — both non-streaming and streaming. Repeated identical requests are served from memory without calling the backend.

response_cache:
  enabled: true                     # Enable response caching (default: false)
  backend: memory                   # Cache backend: "memory" (default), "redis", or "tiered"
  capacity: 1000                    # Maximum number of cached responses (LRU eviction)
  ttl: "5m"                         # Time-to-live for cached entries (e.g., "5m", "1h")
  max_response_size: 1048576        # Maximum response body size in bytes (default: 1 MiB)
  max_stream_buffer_size: 10485760  # Maximum streaming buffer size in bytes (default: 10 MiB)

Cache Eligibility¶

A response is cached only when all of the following conditions hold:

response_cache.enabled is true
The request's temperature field is 0 or absent (deterministic output)
The response body does not exceed max_response_size
For non-streaming requests: the response does not contain finish_reason: "error" in any choice
For streaming requests: the stream completes successfully (receives the final [DONE] event); interrupted or errored streams are discarded and not cached

Non-eligible requests are forwarded to the backend normally.

Streaming Cache Behaviour¶

When caching is enabled and temperature is 0, streaming (stream: true) chat completions are also eligible for caching:

SSE events are buffered in memory alongside normal stream forwarding to the client — there is no observable latency increase.
On successful stream completion (final [DONE] event), the router reconstructs a complete chat.completion JSON object from the buffered chunks and stores it in the cache.
If the stream is interrupted (client disconnect, timeout, backend error), the buffer is discarded and nothing is stored.
On cache hit, the cached response is replayed to the client as a synthetic SSE stream (text/event-stream) with an X-Cache: HIT header.
The max_stream_buffer_size setting limits the streaming buffer. Streams whose total buffered content exceeds this threshold are not cached.

X-Cache Response Header¶

Every chat completion response includes an X-Cache header:

Value	Meaning
`HIT`	Response was served from the cache; no backend call was made
`MISS`	Cache was checked but no entry was found; response came from the backend and was stored
`BYPASS`	Request is not cacheable (e.g., `temperature > 0`); backend was called, nothing was stored

Cache Configuration Options¶

Option	Default	Description
`enabled`	`false`	Whether response caching is active
`backend`	`"memory"`	Cache backend type: `"memory"`, `"redis"`, or `"tiered"`. Switching requires a restart.
`capacity`	`1000`	Maximum number of entries; oldest entries are evicted when the limit is reached (LRU). Ignored when using the Redis or tiered backend.
`ttl`	`"5m"`	How long a cached entry remains valid. Supports duration strings: `"30s"`, `"5m"`, `"1h"`
`max_response_size`	`1048576`	Non-streaming responses larger than this value (in bytes) are not cached
`max_stream_buffer_size`	`10485760`	Streaming responses whose accumulated buffer exceeds this value (in bytes) are not cached (default: 10 MiB)
`redis.url`	--	Redis/Valkey connection URL (requires `redis-cache` build feature and `backend: redis`).
`redis.pool_size`	`8`	Number of Redis connections in the pool
`redis.key_prefix`	`"cr:resp:"`	Namespace prefix for all cache keys. Prevents collisions with other applications on the same Redis instance.
`redis.connect_timeout_ms`	`3000`	Timeout in milliseconds for establishing a new Redis connection
`redis.command_timeout_ms`	`1000`	Timeout in milliseconds for individual Redis commands
`redis.tls`	`false`	Whether to use TLS for the Redis connection (or use `rediss://` scheme in URL)
`redis.fallback_to_memory`	`true`	Fall back to in-memory cache when Redis is unreachable
`l1.type`	`"memory"`	L1 tier type for tiered backend: `"memory"` or `"redis"` (requires `redis-cache` feature)
`l1.max_value_size`	`1048576`	Maximum value size (bytes) eligible for L1 storage. Larger values are stored in L2 only.
`l2.type`	`"s3"`	L2 tier type (currently only `"s3"` is supported; requires `s3-cache` feature)
`l2.endpoint`	--	S3-compatible endpoint URL (e.g., `https://s3.example.com:8080`)
`l2.bucket`	--	S3 bucket name for cache object storage
`l2.key_prefix`	`"response-cache/"`	Object key prefix within the bucket
`l2.region`	`"us-east-1"`	AWS-compatible region string
`l2.access_key`	--	S3 access key. Supports `${ENV_VAR}` expansion.
`l2.secret_key`	--	S3 secret key. Supports `${ENV_VAR}` expansion.
`l2.ttl_override`	--	Optional TTL override for L2 entries (e.g., `"24h"`). Overrides the global `ttl` for L2 storage only.
`tiered.promote_on_hit`	`true`	Whether L2 hits are promoted back to L1 for faster subsequent access
`tiered.l1_promotion_ttl`	`"5m"`	TTL applied to values promoted from L2 to L1

Redis/Valkey Backend¶

By default the response cache stores entries in process memory. To share the cache across multiple router instances or survive restarts, configure a Redis or Valkey backend.

Build requirement: The binary must be compiled with the redis-cache Cargo feature:

cargo build --release --features redis-cache

Configuration:

response_cache:
  enabled: true
  backend: redis                     # Select Redis backend
  ttl: "5m"
  redis:
    url: "redis://localhost:6379"    # plain TCP
    # url: "rediss://redis.example.com:6380"   # TLS
    # url: "redis://:password@localhost:6379"  # with auth
    pool_size: 8                     # connection pool size (default: 8)
    key_prefix: "cr:resp:"           # key namespace prefix (default: "cr:resp:")
    connect_timeout_ms: 3000         # connection timeout (default: 3000)
    command_timeout_ms: 1000         # per-command timeout (default: 1000)
    tls: false                       # use TLS (default: false)
    fallback_to_memory: true         # fallback to in-memory on failure (default: true)

Automatic failover: When fallback_to_memory is true (the default) and Redis becomes unreachable, the router transparently falls back to an in-memory cache and logs a warning. A background health monitor periodically sends PING commands and switches back to Redis once connectivity is restored. Set fallback_to_memory: false to disable this behaviour -- operations will fail instead of falling back.

Key safety: clear() uses SCAN + DEL with the configured prefix pattern instead of FLUSHDB. Only keys matching the prefix are removed, making it safe to share a Redis instance with other applications.

S3-Compatible Tiered Backend¶

The tiered backend composes a fast, bounded L1 cache with a virtually unlimited L2 cache backed by an S3-compatible API. Large responses are stored only in L2 to avoid eviction pressure on L1. When an L2 hit occurs, the value is optionally promoted back to L1 for faster subsequent access.

Any S3-compatible storage can be used as L2, including AWS S3, MinIO, Ceph, or VAST Data.

Build requirement: The binary must be compiled with the s3-cache Cargo feature:

cargo build --release --features s3-cache

Tiered cache write path:

Every set() writes to L2 (S3) unconditionally.
If the value is smaller than l1.max_value_size, it is also written to L1.
Values exceeding the threshold are demoted to L2-only; the demotion counter is incremented.

Tiered cache read path:

L1 is checked first. On hit, the value is returned immediately.
On L1 miss, L2 (S3) is queried.
On L2 hit, if tiered.promote_on_hit is true and the value fits in L1, it is promoted back to L1 with tiered.l1_promotion_ttl as the TTL.

TTL enforcement: Each S3 object carries an expires-at metadata field (Unix timestamp). On GET, if the timestamp has passed, the object is deleted lazily and a cache miss is returned. Use l2.ttl_override to keep L2 entries alive longer than the global ttl.

Configuration:

response_cache:
  enabled: true
  backend: tiered                    # Select tiered L1/L2 backend
  ttl: "5m"                          # Default TTL for both tiers

  l1:                                # L1 (hot) tier — fast, bounded
    type: memory                     # "memory" or "redis" (redis requires redis-cache feature)
    max_value_size: 1048576          # Values larger than this go to L2 only (default: 1 MiB)

  l2:                                # L2 (warm) tier — S3-compatible storage
    type: s3
    endpoint: "https://s3.example.com:8080"
    bucket: "llm-response-cache"
    key_prefix: "response-cache/"    # Object key prefix (default: "response-cache/")
    region: "us-east-1"              # Region string (default: "us-east-1")
    access_key: "${S3_ACCESS_KEY}"   # Supports env var expansion
    secret_key: "${S3_SECRET_KEY}"
    ttl_override: "24h"              # Optional: keep L2 entries longer than global TTL

  tiered:                            # Promotion/demotion behavior
    promote_on_hit: true             # Promote L2 hits back to L1 (default: true)
    l1_promotion_ttl: "5m"          # TTL for promoted entries (default: "5m")

Metrics: The s3-cache build adds Prometheus counters for L1/L2 hit rates, promotion/demotion counts, and an S3 operation latency histogram (continuum_cache_s3_latency_seconds).

Using Redis as L1: Combine the redis-cache and s3-cache features to use Redis as the L1 tier:

cargo build --release --features redis-cache,s3-cache

response_cache:
  backend: tiered
  l1:
    type: redis                      # Use Redis as L1 for cross-instance sharing
    max_value_size: 1048576
  redis:                             # Shared Redis config reused by l1
    url: "redis://localhost:6379"
    pool_size: 8
  l2:
    type: s3
    endpoint: "https://s3.example.com:8080"
    bucket: "llm-response-cache"
    access_key: "${S3_ACCESS_KEY}"
    secret_key: "${S3_SECRET_KEY}"

Limitations¶

Streaming cache is supported for OpenAI-compatible backends only. Anthropic and Gemini native streaming formats are not cached.
The Anthropic (/anthropic/v1/messages) and OpenAI (/v1/chat/completions) endpoints each maintain the same shared cache instance when response_cache.enabled: true.
Without the Redis backend configured, cache is stored in-memory and does not survive server restarts.

KV Cache Index Section¶

Enables real-time KV cache state tracking for precise overlap-scored routing. When a vLLM backend processes a request, it emits SSE events describing which token prefixes are cached. The router subscribes to these events, maintains an index of cache state across all backends, and uses this data to route requests to the backend most likely to have relevant KV cache data.

kv_cache_index:
  enabled: true                      # Enable KV cache index (default: false)
  backend: memory                    # Index backend: "memory" (default) or "redis"
  max_entries: 100000                # Maximum prefix hash entries tracked
  entry_ttl_seconds: 600             # TTL for index entries in seconds
  scoring:                           # Scoring weights for backend selection
    overlap_weight: 0.6              # Weight for cache overlap signal (0.0-1.0)
    load_weight: 0.3                 # Weight for backend load signal (0.0-1.0)
    health_weight: 0.1               # Weight for backend health signal (0.0-1.0)
    min_overlap_threshold: 0.3       # Minimum overlap to activate KV-aware routing
    gpu_tier_weight: 1.0             # Tier multiplier for GPU-resident (hot) cached data
    storage_tier_weight: 0.6         # Tier multiplier for storage-offloaded (warm) cached data
  storage_offloading:                # Tiered storage awareness (GPU hot / storage warm)
    enabled: false                   # Enable storage tier tracking (default: false)
    treat_eviction_as_offload: true  # Treat eviction as offload to warm storage (default: true)
  event_sources:                     # vLLM backend KV event stream endpoints
    - backend_name: "vllm-1"
      endpoint: "http://vllm-1:8000/v1/kv_events"
      reconnect_interval_ms: 5000
    - backend_name: "vllm-2"
      endpoint: "http://vllm-2:8000/v1/kv_events"
      reconnect_interval_ms: 5000

How It Works¶

Event consumption: The router subscribes to SSE event streams from configured vLLM backends. Each event reports a CacheCreated, CacheEvicted, CacheOffloaded, CacheReloaded, or CachePurged action with a prefix hash and token count.
Index update: Events update an in-memory (or Redis-backed) index mapping prefix hashes to backends with their cached token counts and storage tier (GpuHot or StorageWarm).
Scoring: On each request, the overlap scorer queries the index for the request's prefix key and computes a weighted score combining cache overlap (adjusted by tier multiplier), backend load, and backend health.
Routing decision: If the best score exceeds min_overlap_threshold, the scorer selects that backend. Otherwise, the configured selection strategy (e.g., PrefixAwareHash) is used as fallback.

Scoring Formula¶

final_score = overlap_weight * (raw_overlap * tier_multiplier)
            + load_weight * (1 - load_ratio)
            + health_weight * health_score

Where:

raw_overlap = backendtokencount / maxtokencountacrossbackends (0.0 to 1.0)
tier_multiplier = gpu_tier_weight for GPU-resident data, or storage_tier_weight for offloaded data
load_ratio = inflightrequests / maxinflight (0.0 to 1.0)
health_score = backend success_rate (0.0 to 1.0)

When storage_offloading.enabled is false, all data is treated as GpuHot and tier_multiplier always equals gpu_tier_weight (default: 1.0).

Configuration Options¶

Option	Default	Description
`enabled`	`false`	Whether KV cache index is active
`backend`	`"memory"`	Index storage backend: `"memory"` or `"redis"`. Redis requires `redis-cache` build feature and shared Redis config from `response_cache.redis`.
`max_entries`	`100000`	Maximum number of prefix hash entries. LRU eviction kicks in at capacity.
`entry_ttl_seconds`	`600`	TTL for index entries. Stale entries are cleaned up periodically.
`scoring.overlap_weight`	`0.6`	Weight for the cache overlap signal. Higher values favor cache affinity.
`scoring.load_weight`	`0.3`	Weight for the backend load signal. Higher values favor less-loaded backends.
`scoring.health_weight`	`0.1`	Weight for the backend health signal. Higher values favor healthier backends.
`scoring.min_overlap_threshold`	`0.3`	Minimum overlap score to activate KV-aware routing. If no backend exceeds this, the fallback strategy is used.
`scoring.gpu_tier_weight`	`1.0`	Multiplier applied to the overlap score when data is GPU-resident (`GpuHot`). Effective only when `storage_offloading.enabled` is `true`.
`scoring.storage_tier_weight`	`0.6`	Multiplier applied to the overlap score when data is storage-offloaded (`StorageWarm`). Effective only when `storage_offloading.enabled` is `true`.
`storage_offloading.enabled`	`false`	Enable storage tier tracking. When `false`, eviction events remove entries entirely.
`storage_offloading.treat_eviction_as_offload`	`true`	When `true`, `cache_evicted` events downgrade the entry from `GpuHot` to `StorageWarm` instead of removing it. Only effective when `storage_offloading.enabled` is `true`.
`event_sources[].backend_name`	--	Name of the backend (must match a configured backend name)
`event_sources[].endpoint`	--	URL of the vLLM KV event SSE endpoint (http, https, ws, or wss)
`event_sources[].reconnect_interval_ms`	`5000`	Reconnect delay after connection loss

Redis Backend¶

When backend: redis, the KV index uses the shared Redis connection from response_cache.redis. This enables multiple router instances to share the same index for consistent routing decisions across the fleet.

response_cache:
  redis:
    url: "redis://localhost:6379"
    pool_size: 8

kv_cache_index:
  enabled: true
  backend: redis    # Uses the Redis config from response_cache.redis

Admin Endpoints¶

GET /admin/kv-index/stats — Index size, event rates, connection status, scoring distribution
GET /admin/kv-index/backends — Per-backend cache state summary
POST /admin/kv-index/clear — Clear the index (for debugging)

See Admin API for full response schemas.

Limitations¶

Event sources only support vLLM-compatible SSE endpoints currently.
Changing backend between memory and redis requires a restart.
The KV index is an optional enhancement — its unavailability does not break routing.

Logging Section¶

Configures logging output:

logging:
  level: "info"                   # trace, debug, info, warn, error
  format: "json"                  # json, pretty
  enable_colors: false            # Colored output (pretty format only)

Log Levels:

trace: Extremely verbose, includes all details
debug: Detailed debugging information
info: General operational information
warn: Warning messages and potential issues
error: Error conditions only

Log Formats:

json: Structured JSON logging (recommended for production)
pretty: Human-readable format (good for development)