Health, Caching & Logging¶
Health Checks Section¶
Configures backend health monitoring:
health_checks:
enabled: true # Enable/disable health monitoring
interval: "30s" # Check frequency
timeout: "10s" # Request timeout
unhealthy_threshold: 3 # Failures before marking unhealthy
healthy_threshold: 2 # Successes before marking healthy
endpoint: "/v1/models" # Endpoint to check
warmup_check_interval: "1s" # Accelerated interval during warmup
max_warmup_duration: "300s" # Maximum warmup detection duration
Health Check Process: 1. Router queries the health endpoint on each backend 2. Successful responses increment success counter 3. Failed responses increment failure counter 4. Backends marked unhealthy after reaching failure threshold 5. Backends marked healthy after reaching success threshold 6. Only healthy backends receive traffic
Accelerated Warmup Health Checks¶
The router supports accelerated health checks during backend warmup, which is particularly useful for backends like llama.cpp that return HTTP 503 while loading models.
Backend States:
| State | HTTP Response | Behavior |
|---|---|---|
ready | 200 OK | Normal interval checks |
warming_up | 503 Service Unavailable | Accelerated interval checks |
down | Connection failure | Normal interval checks |
unknown | Initial state | First check determines state |
Warmup Configuration:
| Option | Default | Description |
|---|---|---|
warmup_check_interval | 1s | Accelerated check interval during warmup |
max_warmup_duration | 300s | Maximum time to stay in accelerated mode |
How it works:
- When a backend returns HTTP 503, it enters the
warming_upstate - Health checks switch to the accelerated interval (default: 1 second)
- Once the backend returns HTTP 200, it becomes
readyand returns to normal interval - If warmup exceeds
max_warmup_duration, the backend is marked as unhealthy
This reduces model availability detection latency from up to 30 seconds (worst case) to approximately 1 second.
Per-Backend Health Check Configuration¶
Each backend type has sensible default health check endpoints. You can override these defaults with a custom health_check configuration per backend.
Default Health Check Endpoints by Backend Type:
| Backend Type | Primary Endpoint | Fallback Endpoints | Method | Notes |
|---|---|---|---|---|
openai | /v1/models | - | GET | Standard OpenAI endpoint |
vllm | /health | /v1/models | GET | /health available after model load |
ollama | /api/tags | / | GET | Ollama-specific endpoint |
llamacpp | /health | /v1/models | GET | llama-server endpoint |
mlxcel | /health | /v1/models | GET | MLxcel server (llama-server compatible) |
lmstudio | /v1/models | /api/v1/models | GET | OpenAI-compat + native API |
continuum-router | /health | /v1/models | GET | Remote CR / Backend.AI GO |
anthropic | /v1/messages | - | POST | Accepts 200, 400, 401, 429 as healthy |
gemini | /models | /v1beta/models | GET | Native Gemini endpoint |
azure | /health | /v1/models | GET | Azure OpenAI endpoint |
generic | /health | /v1/models | GET | Generic fallback |
Fallback Behavior:
When the primary health check endpoint returns HTTP 404, the router automatically tries the fallback endpoints in order. This ensures compatibility with backends that may not implement all standard endpoints.
Custom Health Check Configuration:
backends:
- name: vllm-custom
type: vllm
url: http://localhost:8000
models:
- my-model
health_check:
endpoint: /custom-health # Primary endpoint
fallback_endpoints: # Tried if primary returns 404
- /health
- /v1/models
method: GET # HTTP method: GET, POST, or HEAD
timeout: 10s # Override global health check timeout
accept_status: # Status codes indicating healthy
- 200
- 204
warmup_status: # Status codes indicating model loading
- 503
Health Check Configuration Options:
| Option | Type | Default | Description |
|---|---|---|---|
endpoint | string | Backend-type specific | Primary health check endpoint path |
fallback_endpoints | array | Backend-type specific | Endpoints to try if primary returns 404 |
method | string | GET | HTTP method: GET, POST, or HEAD |
body | object | null | JSON body for POST requests |
accept_status | array | [200] | Status codes indicating the backend is healthy |
warmup_status | array | [503] | Status codes indicating the backend is warming up |
timeout | string | Global timeout | Override the global health check timeout |
Example: Anthropic-Style Health Check:
For backends that use POST requests or accept error codes as healthy indicators:
backends:
- name: custom-api
type: generic
url: http://localhost:9000
models:
- custom-model
health_check:
endpoint: /api/v1/health
method: POST
body:
check: true
accept_status:
- 200
- 400 # Bad request means server is up
- 401 # Unauthorized means server is up
- 429 # Rate limited means server is up
Request Section¶
Controls request handling behavior:
request:
timeout: "300s" # Maximum request duration
max_retries: 3 # Retry attempts for failed requests
retry_delay: "1s" # Initial delay between retries
Timeout Considerations:
- Long timeouts (300s) accommodate slow model inference
- Streaming requests may take longer than non-streaming
- Balance between user experience and resource usage
Retry Section¶
Global retry configuration for resilience:
retry:
max_attempts: 3 # Maximum retry attempts
base_delay: "100ms" # Base delay between retries
max_delay: "30s" # Cap on retry delays
exponential_backoff: true # Use exponential backoff
jitter: true # Add random jitter
Retry Strategy:
- Exponential backoff: delays increase exponentially (100ms, 200ms, 400ms...)
- Jitter: adds randomness to prevent thundering herd
- Max delay: prevents extremely long waits
Cache Section¶
Controls caching and optimization:
cache:
model_cache_ttl: "300s" # How long to cache model lists
deduplication_ttl: "60s" # How long to cache identical requests
enable_deduplication: true # Enable request deduplication
Cache Stampede Prevention¶
The router implements three strategies to prevent cache stampede (thundering herd problem):
- Singleflight Pattern: Only one aggregation request runs at a time
- Stale-While-Revalidate: Return stale data while refreshing in background
- Background Refresh: Proactive cache updates before expiration
Advanced cache configuration:
model_aggregation:
cache_ttl: 60 # Cache TTL in seconds (default: 60)
soft_ttl_ratio: 0.8 # When to trigger background refresh (default: 0.8 = 80%)
empty_response_base_ttl_seconds: 5 # Base TTL for empty responses
empty_response_max_ttl_seconds: 60 # Max TTL with exponential backoff
max_cache_entries: 100 # Maximum cache entries
background_refresh:
enabled: true # Enable background refresh
check_interval: 10s # Check interval
| Option | Default | Description |
|---|---|---|
cache_ttl | 60s | Hard TTL - cache expires after this time |
soft_ttl_ratio | 0.8 | Soft TTL = cachettl * softttl_ratio. Cache is stale but usable between soft and hard TTL |
empty_response_base_ttl_seconds | 5 | Base TTL for empty responses (prevents DoS) |
empty_response_max_ttl_seconds | 60 | Maximum TTL with exponential backoff (base * 2^n) |
max_cache_entries | 100 | Maximum number of cache entries |
background_refresh.enabled | true | Enable proactive cache refresh |
background_refresh.check_interval | 10s | How often to check cache freshness |
Cache Benefits:
- Model caching reduces backend queries
- Deduplication prevents duplicate processing
- TTL prevents stale data issues
- Stampede prevention avoids thundering herd
- Background refresh ensures cache is always fresh
Response Cache Section¶
Caches complete LLM responses for deterministic (temperature == 0) requests — both non-streaming and streaming. Repeated identical requests are served from memory without calling the backend.
response_cache:
enabled: true # Enable response caching (default: false)
backend: memory # Cache backend: "memory" (default), "redis", or "tiered"
capacity: 1000 # Maximum number of cached responses (LRU eviction)
ttl: "5m" # Time-to-live for cached entries (e.g., "5m", "1h")
max_response_size: 1048576 # Maximum response body size in bytes (default: 1 MiB)
max_stream_buffer_size: 10485760 # Maximum streaming buffer size in bytes (default: 10 MiB)
Cache Eligibility¶
A response is cached only when all of the following conditions hold:
response_cache.enabledistrue- The request's
temperaturefield is0or absent (deterministic output) - The response body does not exceed
max_response_size - For non-streaming requests: the response does not contain
finish_reason: "error"in any choice - For streaming requests: the stream completes successfully (receives the final
[DONE]event); interrupted or errored streams are discarded and not cached
Non-eligible requests are forwarded to the backend normally.
Streaming Cache Behaviour¶
When caching is enabled and temperature is 0, streaming (stream: true) chat completions are also eligible for caching:
- SSE events are buffered in memory alongside normal stream forwarding to the client — there is no observable latency increase.
- On successful stream completion (final
[DONE]event), the router reconstructs a completechat.completionJSON object from the buffered chunks and stores it in the cache. - If the stream is interrupted (client disconnect, timeout, backend error), the buffer is discarded and nothing is stored.
- On cache hit, the cached response is replayed to the client as a synthetic SSE stream (
text/event-stream) with anX-Cache: HITheader. - The
max_stream_buffer_sizesetting limits the streaming buffer. Streams whose total buffered content exceeds this threshold are not cached.
X-Cache Response Header¶
Every chat completion response includes an X-Cache header:
| Value | Meaning |
|---|---|
HIT | Response was served from the cache; no backend call was made |
MISS | Cache was checked but no entry was found; response came from the backend and was stored |
BYPASS | Request is not cacheable (e.g., temperature > 0); backend was called, nothing was stored |
Cache Configuration Options¶
| Option | Default | Description |
|---|---|---|
enabled | false | Whether response caching is active |
backend | "memory" | Cache backend type: "memory", "redis", or "tiered". Switching requires a restart. |
capacity | 1000 | Maximum number of entries; oldest entries are evicted when the limit is reached (LRU). Ignored when using the Redis or tiered backend. |
ttl | "5m" | How long a cached entry remains valid. Supports duration strings: "30s", "5m", "1h" |
max_response_size | 1048576 | Non-streaming responses larger than this value (in bytes) are not cached |
max_stream_buffer_size | 10485760 | Streaming responses whose accumulated buffer exceeds this value (in bytes) are not cached (default: 10 MiB) |
redis.url | -- | Redis/Valkey connection URL (requires redis-cache build feature and backend: redis). |
redis.pool_size | 8 | Number of Redis connections in the pool |
redis.key_prefix | "cr:resp:" | Namespace prefix for all cache keys. Prevents collisions with other applications on the same Redis instance. |
redis.connect_timeout_ms | 3000 | Timeout in milliseconds for establishing a new Redis connection |
redis.command_timeout_ms | 1000 | Timeout in milliseconds for individual Redis commands |
redis.tls | false | Whether to use TLS for the Redis connection (or use rediss:// scheme in URL) |
redis.fallback_to_memory | true | Fall back to in-memory cache when Redis is unreachable |
l1.type | "memory" | L1 tier type for tiered backend: "memory" or "redis" (requires redis-cache feature) |
l1.max_value_size | 1048576 | Maximum value size (bytes) eligible for L1 storage. Larger values are stored in L2 only. |
l2.type | "s3" | L2 tier type (currently only "s3" is supported; requires s3-cache feature) |
l2.endpoint | -- | S3-compatible endpoint URL (e.g., https://s3.example.com:8080) |
l2.bucket | -- | S3 bucket name for cache object storage |
l2.key_prefix | "response-cache/" | Object key prefix within the bucket |
l2.region | "us-east-1" | AWS-compatible region string |
l2.access_key | -- | S3 access key. Supports ${ENV_VAR} expansion. |
l2.secret_key | -- | S3 secret key. Supports ${ENV_VAR} expansion. |
l2.ttl_override | -- | Optional TTL override for L2 entries (e.g., "24h"). Overrides the global ttl for L2 storage only. |
tiered.promote_on_hit | true | Whether L2 hits are promoted back to L1 for faster subsequent access |
tiered.l1_promotion_ttl | "5m" | TTL applied to values promoted from L2 to L1 |
Redis/Valkey Backend¶
By default the response cache stores entries in process memory. To share the cache across multiple router instances or survive restarts, configure a Redis or Valkey backend.
Build requirement: The binary must be compiled with the redis-cache Cargo feature:
Configuration:
response_cache:
enabled: true
backend: redis # Select Redis backend
ttl: "5m"
redis:
url: "redis://localhost:6379" # plain TCP
# url: "rediss://redis.example.com:6380" # TLS
# url: "redis://:password@localhost:6379" # with auth
pool_size: 8 # connection pool size (default: 8)
key_prefix: "cr:resp:" # key namespace prefix (default: "cr:resp:")
connect_timeout_ms: 3000 # connection timeout (default: 3000)
command_timeout_ms: 1000 # per-command timeout (default: 1000)
tls: false # use TLS (default: false)
fallback_to_memory: true # fallback to in-memory on failure (default: true)
Automatic failover: When fallback_to_memory is true (the default) and Redis becomes unreachable, the router transparently falls back to an in-memory cache and logs a warning. A background health monitor periodically sends PING commands and switches back to Redis once connectivity is restored. Set fallback_to_memory: false to disable this behaviour -- operations will fail instead of falling back.
Key safety: clear() uses SCAN + DEL with the configured prefix pattern instead of FLUSHDB. Only keys matching the prefix are removed, making it safe to share a Redis instance with other applications.
S3-Compatible Tiered Backend¶
The tiered backend composes a fast, bounded L1 cache with a virtually unlimited L2 cache backed by an S3-compatible API. Large responses are stored only in L2 to avoid eviction pressure on L1. When an L2 hit occurs, the value is optionally promoted back to L1 for faster subsequent access.
Any S3-compatible storage can be used as L2, including AWS S3, MinIO, Ceph, or VAST Data.
Build requirement: The binary must be compiled with the s3-cache Cargo feature:
Tiered cache write path:
- Every
set()writes to L2 (S3) unconditionally. - If the value is smaller than
l1.max_value_size, it is also written to L1. - Values exceeding the threshold are demoted to L2-only; the demotion counter is incremented.
Tiered cache read path:
- L1 is checked first. On hit, the value is returned immediately.
- On L1 miss, L2 (S3) is queried.
- On L2 hit, if
tiered.promote_on_hitistrueand the value fits in L1, it is promoted back to L1 withtiered.l1_promotion_ttlas the TTL.
TTL enforcement: Each S3 object carries an expires-at metadata field (Unix timestamp). On GET, if the timestamp has passed, the object is deleted lazily and a cache miss is returned. Use l2.ttl_override to keep L2 entries alive longer than the global ttl.
Configuration:
response_cache:
enabled: true
backend: tiered # Select tiered L1/L2 backend
ttl: "5m" # Default TTL for both tiers
l1: # L1 (hot) tier — fast, bounded
type: memory # "memory" or "redis" (redis requires redis-cache feature)
max_value_size: 1048576 # Values larger than this go to L2 only (default: 1 MiB)
l2: # L2 (warm) tier — S3-compatible storage
type: s3
endpoint: "https://s3.example.com:8080"
bucket: "llm-response-cache"
key_prefix: "response-cache/" # Object key prefix (default: "response-cache/")
region: "us-east-1" # Region string (default: "us-east-1")
access_key: "${S3_ACCESS_KEY}" # Supports env var expansion
secret_key: "${S3_SECRET_KEY}"
ttl_override: "24h" # Optional: keep L2 entries longer than global TTL
tiered: # Promotion/demotion behavior
promote_on_hit: true # Promote L2 hits back to L1 (default: true)
l1_promotion_ttl: "5m" # TTL for promoted entries (default: "5m")
Metrics: The s3-cache build adds Prometheus counters for L1/L2 hit rates, promotion/demotion counts, and an S3 operation latency histogram (continuum_cache_s3_latency_seconds).
Using Redis as L1: Combine the redis-cache and s3-cache features to use Redis as the L1 tier:
response_cache:
backend: tiered
l1:
type: redis # Use Redis as L1 for cross-instance sharing
max_value_size: 1048576
redis: # Shared Redis config reused by l1
url: "redis://localhost:6379"
pool_size: 8
l2:
type: s3
endpoint: "https://s3.example.com:8080"
bucket: "llm-response-cache"
access_key: "${S3_ACCESS_KEY}"
secret_key: "${S3_SECRET_KEY}"
Limitations¶
- Streaming cache is supported for OpenAI-compatible backends only. Anthropic and Gemini native streaming formats are not cached.
- The Anthropic (
/anthropic/v1/messages) and OpenAI (/v1/chat/completions) endpoints each maintain the same shared cache instance whenresponse_cache.enabled: true. - Without the Redis backend configured, cache is stored in-memory and does not survive server restarts.
KV Cache Index Section¶
Enables real-time KV cache state tracking for precise overlap-scored routing. When a vLLM backend processes a request, it emits SSE events describing which token prefixes are cached. The router subscribes to these events, maintains an index of cache state across all backends, and uses this data to route requests to the backend most likely to have relevant KV cache data.
kv_cache_index:
enabled: true # Enable KV cache index (default: false)
backend: memory # Index backend: "memory" (default) or "redis"
max_entries: 100000 # Maximum prefix hash entries tracked
entry_ttl_seconds: 600 # TTL for index entries in seconds
scoring: # Scoring weights for backend selection
overlap_weight: 0.6 # Weight for cache overlap signal (0.0-1.0)
load_weight: 0.3 # Weight for backend load signal (0.0-1.0)
health_weight: 0.1 # Weight for backend health signal (0.0-1.0)
min_overlap_threshold: 0.3 # Minimum overlap to activate KV-aware routing
gpu_tier_weight: 1.0 # Tier multiplier for GPU-resident (hot) cached data
storage_tier_weight: 0.6 # Tier multiplier for storage-offloaded (warm) cached data
storage_offloading: # Tiered storage awareness (GPU hot / storage warm)
enabled: false # Enable storage tier tracking (default: false)
treat_eviction_as_offload: true # Treat eviction as offload to warm storage (default: true)
event_sources: # vLLM backend KV event stream endpoints
- backend_name: "vllm-1"
endpoint: "http://vllm-1:8000/v1/kv_events"
reconnect_interval_ms: 5000
- backend_name: "vllm-2"
endpoint: "http://vllm-2:8000/v1/kv_events"
reconnect_interval_ms: 5000
How It Works¶
- Event consumption: The router subscribes to SSE event streams from configured vLLM backends. Each event reports a
CacheCreated,CacheEvicted,CacheOffloaded,CacheReloaded, orCachePurgedaction with a prefix hash and token count. - Index update: Events update an in-memory (or Redis-backed) index mapping prefix hashes to backends with their cached token counts and storage tier (
GpuHotorStorageWarm). - Scoring: On each request, the overlap scorer queries the index for the request's prefix key and computes a weighted score combining cache overlap (adjusted by tier multiplier), backend load, and backend health.
- Routing decision: If the best score exceeds
min_overlap_threshold, the scorer selects that backend. Otherwise, the configured selection strategy (e.g.,PrefixAwareHash) is used as fallback.
Scoring Formula¶
final_score = overlap_weight * (raw_overlap * tier_multiplier)
+ load_weight * (1 - load_ratio)
+ health_weight * health_score
Where:
raw_overlap= backendtokencount / maxtokencountacrossbackends (0.0 to 1.0)tier_multiplier=gpu_tier_weightfor GPU-resident data, orstorage_tier_weightfor offloaded dataload_ratio= inflightrequests / maxinflight (0.0 to 1.0)health_score= backend success_rate (0.0 to 1.0)
When storage_offloading.enabled is false, all data is treated as GpuHot and tier_multiplier always equals gpu_tier_weight (default: 1.0).
Configuration Options¶
| Option | Default | Description |
|---|---|---|
enabled | false | Whether KV cache index is active |
backend | "memory" | Index storage backend: "memory" or "redis". Redis requires redis-cache build feature and shared Redis config from response_cache.redis. |
max_entries | 100000 | Maximum number of prefix hash entries. LRU eviction kicks in at capacity. |
entry_ttl_seconds | 600 | TTL for index entries. Stale entries are cleaned up periodically. |
scoring.overlap_weight | 0.6 | Weight for the cache overlap signal. Higher values favor cache affinity. |
scoring.load_weight | 0.3 | Weight for the backend load signal. Higher values favor less-loaded backends. |
scoring.health_weight | 0.1 | Weight for the backend health signal. Higher values favor healthier backends. |
scoring.min_overlap_threshold | 0.3 | Minimum overlap score to activate KV-aware routing. If no backend exceeds this, the fallback strategy is used. |
scoring.gpu_tier_weight | 1.0 | Multiplier applied to the overlap score when data is GPU-resident (GpuHot). Effective only when storage_offloading.enabled is true. |
scoring.storage_tier_weight | 0.6 | Multiplier applied to the overlap score when data is storage-offloaded (StorageWarm). Effective only when storage_offloading.enabled is true. |
storage_offloading.enabled | false | Enable storage tier tracking. When false, eviction events remove entries entirely. |
storage_offloading.treat_eviction_as_offload | true | When true, cache_evicted events downgrade the entry from GpuHot to StorageWarm instead of removing it. Only effective when storage_offloading.enabled is true. |
event_sources[].backend_name | -- | Name of the backend (must match a configured backend name) |
event_sources[].endpoint | -- | URL of the vLLM KV event SSE endpoint (http, https, ws, or wss) |
event_sources[].reconnect_interval_ms | 5000 | Reconnect delay after connection loss |
Redis Backend¶
When backend: redis, the KV index uses the shared Redis connection from response_cache.redis. This enables multiple router instances to share the same index for consistent routing decisions across the fleet.
response_cache:
redis:
url: "redis://localhost:6379"
pool_size: 8
kv_cache_index:
enabled: true
backend: redis # Uses the Redis config from response_cache.redis
Admin Endpoints¶
GET /admin/kv-index/stats— Index size, event rates, connection status, scoring distributionGET /admin/kv-index/backends— Per-backend cache state summaryPOST /admin/kv-index/clear— Clear the index (for debugging)
See Admin API for full response schemas.
Limitations¶
- Event sources only support vLLM-compatible SSE endpoints currently.
- Changing
backendbetweenmemoryandredisrequires a restart. - The KV index is an optional enhancement — its unavailability does not break routing.
Logging Section¶
Configures logging output:
logging:
level: "info" # trace, debug, info, warn, error
format: "json" # json, pretty
enable_colors: false # Colored output (pretty format only)
Log Levels:
trace: Extremely verbose, includes all detailsdebug: Detailed debugging informationinfo: General operational informationwarn: Warning messages and potential issueserror: Error conditions only
Log Formats:
json: Structured JSON logging (recommended for production)pretty: Human-readable format (good for development)