Skip to content

Health, Caching & Logging

Health Checks Section

Configures backend health monitoring:

health_checks:
  enabled: true                    # Enable/disable health monitoring
  interval: "30s"                  # Check frequency
  timeout: "10s"                   # Request timeout
  unhealthy_threshold: 3           # Failures before marking unhealthy
  healthy_threshold: 2             # Successes before marking healthy
  endpoint: "/v1/models"           # Endpoint to check
  warmup_check_interval: "1s"      # Accelerated interval during warmup
  max_warmup_duration: "300s"      # Maximum warmup detection duration

Health Check Process: 1. Router queries the health endpoint on each backend 2. Successful responses increment success counter 3. Failed responses increment failure counter 4. Backends marked unhealthy after reaching failure threshold 5. Backends marked healthy after reaching success threshold 6. Only healthy backends receive traffic

Accelerated Warmup Health Checks

The router supports accelerated health checks during backend warmup, which is particularly useful for backends like llama.cpp that return HTTP 503 while loading models.

Backend States:

State HTTP Response Behavior
ready 200 OK Normal interval checks
warming_up 503 Service Unavailable Accelerated interval checks
down Connection failure Normal interval checks
unknown Initial state First check determines state

Warmup Configuration:

Option Default Description
warmup_check_interval 1s Accelerated check interval during warmup
max_warmup_duration 300s Maximum time to stay in accelerated mode

How it works:

  1. When a backend returns HTTP 503, it enters the warming_up state
  2. Health checks switch to the accelerated interval (default: 1 second)
  3. Once the backend returns HTTP 200, it becomes ready and returns to normal interval
  4. If warmup exceeds max_warmup_duration, the backend is marked as unhealthy

This reduces model availability detection latency from up to 30 seconds (worst case) to approximately 1 second.

Per-Backend Health Check Configuration

Each backend type has sensible default health check endpoints. You can override these defaults with a custom health_check configuration per backend.

Default Health Check Endpoints by Backend Type:

Backend Type Primary Endpoint Fallback Endpoints Method Notes
openai /v1/models - GET Standard OpenAI endpoint
vllm /health /v1/models GET /health available after model load
ollama /api/tags / GET Ollama-specific endpoint
llamacpp /health /v1/models GET llama-server endpoint
mlxcel /health /v1/models GET MLxcel server (llama-server compatible)
lmstudio /v1/models /api/v1/models GET OpenAI-compat + native API
continuum-router /health /v1/models GET Remote CR / Backend.AI GO
anthropic /v1/messages - POST Accepts 200, 400, 401, 429 as healthy
gemini /models /v1beta/models GET Native Gemini endpoint
azure /health /v1/models GET Azure OpenAI endpoint
generic /health /v1/models GET Generic fallback

Fallback Behavior:

When the primary health check endpoint returns HTTP 404, the router automatically tries the fallback endpoints in order. This ensures compatibility with backends that may not implement all standard endpoints.

Custom Health Check Configuration:

backends:
    - name: vllm-custom
    type: vllm
    url: http://localhost:8000
    models:
        - my-model
    health_check:
        endpoint: /custom-health          # Primary endpoint
        fallback_endpoints:               # Tried if primary returns 404
            - /health
            - /v1/models
        method: GET                       # HTTP method: GET, POST, or HEAD
        timeout: 10s                      # Override global health check timeout
        accept_status:                    # Status codes indicating healthy
            - 200
            - 204
        warmup_status:                    # Status codes indicating model loading
            - 503

Health Check Configuration Options:

Option Type Default Description
endpoint string Backend-type specific Primary health check endpoint path
fallback_endpoints array Backend-type specific Endpoints to try if primary returns 404
method string GET HTTP method: GET, POST, or HEAD
body object null JSON body for POST requests
accept_status array [200] Status codes indicating the backend is healthy
warmup_status array [503] Status codes indicating the backend is warming up
timeout string Global timeout Override the global health check timeout

Example: Anthropic-Style Health Check:

For backends that use POST requests or accept error codes as healthy indicators:

backends:
    - name: custom-api
    type: generic
    url: http://localhost:9000
    models:
        - custom-model
    health_check:
        endpoint: /api/v1/health
        method: POST
        body:
            check: true
        accept_status:
            - 200
            - 400    # Bad request means server is up
            - 401    # Unauthorized means server is up
            - 429    # Rate limited means server is up

Request Section

Controls request handling behavior:

request:
  timeout: "300s"                  # Maximum request duration
  max_retries: 3                   # Retry attempts for failed requests
  retry_delay: "1s"                # Initial delay between retries

Timeout Considerations:

  • Long timeouts (300s) accommodate slow model inference
  • Streaming requests may take longer than non-streaming
  • Balance between user experience and resource usage

Retry Section

Global retry configuration for resilience:

retry:
  max_attempts: 3                  # Maximum retry attempts
  base_delay: "100ms"              # Base delay between retries
  max_delay: "30s"                 # Cap on retry delays
  exponential_backoff: true        # Use exponential backoff
  jitter: true                     # Add random jitter

Retry Strategy:

  • Exponential backoff: delays increase exponentially (100ms, 200ms, 400ms...)
  • Jitter: adds randomness to prevent thundering herd
  • Max delay: prevents extremely long waits

Cache Section

Controls caching and optimization:

cache:
  model_cache_ttl: "300s"         # How long to cache model lists
  deduplication_ttl: "60s"        # How long to cache identical requests
  enable_deduplication: true      # Enable request deduplication

Cache Stampede Prevention

The router implements three strategies to prevent cache stampede (thundering herd problem):

  1. Singleflight Pattern: Only one aggregation request runs at a time
  2. Stale-While-Revalidate: Return stale data while refreshing in background
  3. Background Refresh: Proactive cache updates before expiration

Advanced cache configuration:

model_aggregation:
  cache_ttl: 60                     # Cache TTL in seconds (default: 60)
  soft_ttl_ratio: 0.8               # When to trigger background refresh (default: 0.8 = 80%)
  empty_response_base_ttl_seconds: 5   # Base TTL for empty responses
  empty_response_max_ttl_seconds: 60   # Max TTL with exponential backoff
  max_cache_entries: 100            # Maximum cache entries
  background_refresh:
    enabled: true                   # Enable background refresh
    check_interval: 10s             # Check interval
Option Default Description
cache_ttl 60s Hard TTL - cache expires after this time
soft_ttl_ratio 0.8 Soft TTL = cachettl * softttl_ratio. Cache is stale but usable between soft and hard TTL
empty_response_base_ttl_seconds 5 Base TTL for empty responses (prevents DoS)
empty_response_max_ttl_seconds 60 Maximum TTL with exponential backoff (base * 2^n)
max_cache_entries 100 Maximum number of cache entries
background_refresh.enabled true Enable proactive cache refresh
background_refresh.check_interval 10s How often to check cache freshness

Cache Benefits:

  • Model caching reduces backend queries
  • Deduplication prevents duplicate processing
  • TTL prevents stale data issues
  • Stampede prevention avoids thundering herd
  • Background refresh ensures cache is always fresh

Response Cache Section

Caches complete LLM responses for deterministic (temperature == 0) requests — both non-streaming and streaming. Repeated identical requests are served from memory without calling the backend.

response_cache:
  enabled: true                     # Enable response caching (default: false)
  backend: memory                   # Cache backend: "memory" (default), "redis", or "tiered"
  capacity: 1000                    # Maximum number of cached responses (LRU eviction)
  ttl: "5m"                         # Time-to-live for cached entries (e.g., "5m", "1h")
  max_response_size: 1048576        # Maximum response body size in bytes (default: 1 MiB)
  max_stream_buffer_size: 10485760  # Maximum streaming buffer size in bytes (default: 10 MiB)

Cache Eligibility

A response is cached only when all of the following conditions hold:

  • response_cache.enabled is true
  • The request's temperature field is 0 or absent (deterministic output)
  • The response body does not exceed max_response_size
  • For non-streaming requests: the response does not contain finish_reason: "error" in any choice
  • For streaming requests: the stream completes successfully (receives the final [DONE] event); interrupted or errored streams are discarded and not cached

Non-eligible requests are forwarded to the backend normally.

Streaming Cache Behaviour

When caching is enabled and temperature is 0, streaming (stream: true) chat completions are also eligible for caching:

  1. SSE events are buffered in memory alongside normal stream forwarding to the client — there is no observable latency increase.
  2. On successful stream completion (final [DONE] event), the router reconstructs a complete chat.completion JSON object from the buffered chunks and stores it in the cache.
  3. If the stream is interrupted (client disconnect, timeout, backend error), the buffer is discarded and nothing is stored.
  4. On cache hit, the cached response is replayed to the client as a synthetic SSE stream (text/event-stream) with an X-Cache: HIT header.
  5. The max_stream_buffer_size setting limits the streaming buffer. Streams whose total buffered content exceeds this threshold are not cached.

X-Cache Response Header

Every chat completion response includes an X-Cache header:

Value Meaning
HIT Response was served from the cache; no backend call was made
MISS Cache was checked but no entry was found; response came from the backend and was stored
BYPASS Request is not cacheable (e.g., temperature > 0); backend was called, nothing was stored

Cache Configuration Options

Option Default Description
enabled false Whether response caching is active
backend "memory" Cache backend type: "memory", "redis", or "tiered". Switching requires a restart.
capacity 1000 Maximum number of entries; oldest entries are evicted when the limit is reached (LRU). Ignored when using the Redis or tiered backend.
ttl "5m" How long a cached entry remains valid. Supports duration strings: "30s", "5m", "1h"
max_response_size 1048576 Non-streaming responses larger than this value (in bytes) are not cached
max_stream_buffer_size 10485760 Streaming responses whose accumulated buffer exceeds this value (in bytes) are not cached (default: 10 MiB)
redis.url -- Redis/Valkey connection URL (requires redis-cache build feature and backend: redis).
redis.pool_size 8 Number of Redis connections in the pool
redis.key_prefix "cr:resp:" Namespace prefix for all cache keys. Prevents collisions with other applications on the same Redis instance.
redis.connect_timeout_ms 3000 Timeout in milliseconds for establishing a new Redis connection
redis.command_timeout_ms 1000 Timeout in milliseconds for individual Redis commands
redis.tls false Whether to use TLS for the Redis connection (or use rediss:// scheme in URL)
redis.fallback_to_memory true Fall back to in-memory cache when Redis is unreachable
l1.type "memory" L1 tier type for tiered backend: "memory" or "redis" (requires redis-cache feature)
l1.max_value_size 1048576 Maximum value size (bytes) eligible for L1 storage. Larger values are stored in L2 only.
l2.type "s3" L2 tier type (currently only "s3" is supported; requires s3-cache feature)
l2.endpoint -- S3-compatible endpoint URL (e.g., https://s3.example.com:8080)
l2.bucket -- S3 bucket name for cache object storage
l2.key_prefix "response-cache/" Object key prefix within the bucket
l2.region "us-east-1" AWS-compatible region string
l2.access_key -- S3 access key. Supports ${ENV_VAR} expansion.
l2.secret_key -- S3 secret key. Supports ${ENV_VAR} expansion.
l2.ttl_override -- Optional TTL override for L2 entries (e.g., "24h"). Overrides the global ttl for L2 storage only.
tiered.promote_on_hit true Whether L2 hits are promoted back to L1 for faster subsequent access
tiered.l1_promotion_ttl "5m" TTL applied to values promoted from L2 to L1

Redis/Valkey Backend

By default the response cache stores entries in process memory. To share the cache across multiple router instances or survive restarts, configure a Redis or Valkey backend.

Build requirement: The binary must be compiled with the redis-cache Cargo feature:

cargo build --release --features redis-cache

Configuration:

response_cache:
  enabled: true
  backend: redis                     # Select Redis backend
  ttl: "5m"
  redis:
    url: "redis://localhost:6379"    # plain TCP
    # url: "rediss://redis.example.com:6380"   # TLS
    # url: "redis://:password@localhost:6379"  # with auth
    pool_size: 8                     # connection pool size (default: 8)
    key_prefix: "cr:resp:"           # key namespace prefix (default: "cr:resp:")
    connect_timeout_ms: 3000         # connection timeout (default: 3000)
    command_timeout_ms: 1000         # per-command timeout (default: 1000)
    tls: false                       # use TLS (default: false)
    fallback_to_memory: true         # fallback to in-memory on failure (default: true)

Automatic failover: When fallback_to_memory is true (the default) and Redis becomes unreachable, the router transparently falls back to an in-memory cache and logs a warning. A background health monitor periodically sends PING commands and switches back to Redis once connectivity is restored. Set fallback_to_memory: false to disable this behaviour -- operations will fail instead of falling back.

Key safety: clear() uses SCAN + DEL with the configured prefix pattern instead of FLUSHDB. Only keys matching the prefix are removed, making it safe to share a Redis instance with other applications.

S3-Compatible Tiered Backend

The tiered backend composes a fast, bounded L1 cache with a virtually unlimited L2 cache backed by an S3-compatible API. Large responses are stored only in L2 to avoid eviction pressure on L1. When an L2 hit occurs, the value is optionally promoted back to L1 for faster subsequent access.

Any S3-compatible storage can be used as L2, including AWS S3, MinIO, Ceph, or VAST Data.

Build requirement: The binary must be compiled with the s3-cache Cargo feature:

cargo build --release --features s3-cache

Tiered cache write path:

  1. Every set() writes to L2 (S3) unconditionally.
  2. If the value is smaller than l1.max_value_size, it is also written to L1.
  3. Values exceeding the threshold are demoted to L2-only; the demotion counter is incremented.

Tiered cache read path:

  1. L1 is checked first. On hit, the value is returned immediately.
  2. On L1 miss, L2 (S3) is queried.
  3. On L2 hit, if tiered.promote_on_hit is true and the value fits in L1, it is promoted back to L1 with tiered.l1_promotion_ttl as the TTL.

TTL enforcement: Each S3 object carries an expires-at metadata field (Unix timestamp). On GET, if the timestamp has passed, the object is deleted lazily and a cache miss is returned. Use l2.ttl_override to keep L2 entries alive longer than the global ttl.

Configuration:

response_cache:
  enabled: true
  backend: tiered                    # Select tiered L1/L2 backend
  ttl: "5m"                          # Default TTL for both tiers

  l1:                                # L1 (hot) tier — fast, bounded
    type: memory                     # "memory" or "redis" (redis requires redis-cache feature)
    max_value_size: 1048576          # Values larger than this go to L2 only (default: 1 MiB)

  l2:                                # L2 (warm) tier — S3-compatible storage
    type: s3
    endpoint: "https://s3.example.com:8080"
    bucket: "llm-response-cache"
    key_prefix: "response-cache/"    # Object key prefix (default: "response-cache/")
    region: "us-east-1"              # Region string (default: "us-east-1")
    access_key: "${S3_ACCESS_KEY}"   # Supports env var expansion
    secret_key: "${S3_SECRET_KEY}"
    ttl_override: "24h"              # Optional: keep L2 entries longer than global TTL

  tiered:                            # Promotion/demotion behavior
    promote_on_hit: true             # Promote L2 hits back to L1 (default: true)
    l1_promotion_ttl: "5m"          # TTL for promoted entries (default: "5m")

Metrics: The s3-cache build adds Prometheus counters for L1/L2 hit rates, promotion/demotion counts, and an S3 operation latency histogram (continuum_cache_s3_latency_seconds).

Using Redis as L1: Combine the redis-cache and s3-cache features to use Redis as the L1 tier:

cargo build --release --features redis-cache,s3-cache
response_cache:
  backend: tiered
  l1:
    type: redis                      # Use Redis as L1 for cross-instance sharing
    max_value_size: 1048576
  redis:                             # Shared Redis config reused by l1
    url: "redis://localhost:6379"
    pool_size: 8
  l2:
    type: s3
    endpoint: "https://s3.example.com:8080"
    bucket: "llm-response-cache"
    access_key: "${S3_ACCESS_KEY}"
    secret_key: "${S3_SECRET_KEY}"

Limitations

  • Streaming cache is supported for OpenAI-compatible backends only. Anthropic and Gemini native streaming formats are not cached.
  • The Anthropic (/anthropic/v1/messages) and OpenAI (/v1/chat/completions) endpoints each maintain the same shared cache instance when response_cache.enabled: true.
  • Without the Redis backend configured, cache is stored in-memory and does not survive server restarts.

KV Cache Index Section

Enables real-time KV cache state tracking for precise overlap-scored routing. When a vLLM backend processes a request, it emits SSE events describing which token prefixes are cached. The router subscribes to these events, maintains an index of cache state across all backends, and uses this data to route requests to the backend most likely to have relevant KV cache data.

kv_cache_index:
  enabled: true                      # Enable KV cache index (default: false)
  backend: memory                    # Index backend: "memory" (default) or "redis"
  max_entries: 100000                # Maximum prefix hash entries tracked
  entry_ttl_seconds: 600             # TTL for index entries in seconds
  scoring:                           # Scoring weights for backend selection
    overlap_weight: 0.6              # Weight for cache overlap signal (0.0-1.0)
    load_weight: 0.3                 # Weight for backend load signal (0.0-1.0)
    health_weight: 0.1               # Weight for backend health signal (0.0-1.0)
    min_overlap_threshold: 0.3       # Minimum overlap to activate KV-aware routing
    gpu_tier_weight: 1.0             # Tier multiplier for GPU-resident (hot) cached data
    storage_tier_weight: 0.6         # Tier multiplier for storage-offloaded (warm) cached data
  storage_offloading:                # Tiered storage awareness (GPU hot / storage warm)
    enabled: false                   # Enable storage tier tracking (default: false)
    treat_eviction_as_offload: true  # Treat eviction as offload to warm storage (default: true)
  event_sources:                     # vLLM backend KV event stream endpoints
    - backend_name: "vllm-1"
      endpoint: "http://vllm-1:8000/v1/kv_events"
      reconnect_interval_ms: 5000
    - backend_name: "vllm-2"
      endpoint: "http://vllm-2:8000/v1/kv_events"
      reconnect_interval_ms: 5000

How It Works

  1. Event consumption: The router subscribes to SSE event streams from configured vLLM backends. Each event reports a CacheCreated, CacheEvicted, CacheOffloaded, CacheReloaded, or CachePurged action with a prefix hash and token count.
  2. Index update: Events update an in-memory (or Redis-backed) index mapping prefix hashes to backends with their cached token counts and storage tier (GpuHot or StorageWarm).
  3. Scoring: On each request, the overlap scorer queries the index for the request's prefix key and computes a weighted score combining cache overlap (adjusted by tier multiplier), backend load, and backend health.
  4. Routing decision: If the best score exceeds min_overlap_threshold, the scorer selects that backend. Otherwise, the configured selection strategy (e.g., PrefixAwareHash) is used as fallback.

Scoring Formula

final_score = overlap_weight * (raw_overlap * tier_multiplier)
            + load_weight * (1 - load_ratio)
            + health_weight * health_score

Where:

  • raw_overlap = backendtokencount / maxtokencountacrossbackends (0.0 to 1.0)
  • tier_multiplier = gpu_tier_weight for GPU-resident data, or storage_tier_weight for offloaded data
  • load_ratio = inflightrequests / maxinflight (0.0 to 1.0)
  • health_score = backend success_rate (0.0 to 1.0)

When storage_offloading.enabled is false, all data is treated as GpuHot and tier_multiplier always equals gpu_tier_weight (default: 1.0).

Configuration Options

Option Default Description
enabled false Whether KV cache index is active
backend "memory" Index storage backend: "memory" or "redis". Redis requires redis-cache build feature and shared Redis config from response_cache.redis.
max_entries 100000 Maximum number of prefix hash entries. LRU eviction kicks in at capacity.
entry_ttl_seconds 600 TTL for index entries. Stale entries are cleaned up periodically.
scoring.overlap_weight 0.6 Weight for the cache overlap signal. Higher values favor cache affinity.
scoring.load_weight 0.3 Weight for the backend load signal. Higher values favor less-loaded backends.
scoring.health_weight 0.1 Weight for the backend health signal. Higher values favor healthier backends.
scoring.min_overlap_threshold 0.3 Minimum overlap score to activate KV-aware routing. If no backend exceeds this, the fallback strategy is used.
scoring.gpu_tier_weight 1.0 Multiplier applied to the overlap score when data is GPU-resident (GpuHot). Effective only when storage_offloading.enabled is true.
scoring.storage_tier_weight 0.6 Multiplier applied to the overlap score when data is storage-offloaded (StorageWarm). Effective only when storage_offloading.enabled is true.
storage_offloading.enabled false Enable storage tier tracking. When false, eviction events remove entries entirely.
storage_offloading.treat_eviction_as_offload true When true, cache_evicted events downgrade the entry from GpuHot to StorageWarm instead of removing it. Only effective when storage_offloading.enabled is true.
event_sources[].backend_name -- Name of the backend (must match a configured backend name)
event_sources[].endpoint -- URL of the vLLM KV event SSE endpoint (http, https, ws, or wss)
event_sources[].reconnect_interval_ms 5000 Reconnect delay after connection loss

Redis Backend

When backend: redis, the KV index uses the shared Redis connection from response_cache.redis. This enables multiple router instances to share the same index for consistent routing decisions across the fleet.

response_cache:
  redis:
    url: "redis://localhost:6379"
    pool_size: 8

kv_cache_index:
  enabled: true
  backend: redis    # Uses the Redis config from response_cache.redis

Admin Endpoints

  • GET /admin/kv-index/stats — Index size, event rates, connection status, scoring distribution
  • GET /admin/kv-index/backends — Per-backend cache state summary
  • POST /admin/kv-index/clear — Clear the index (for debugging)

See Admin API for full response schemas.

Limitations

  • Event sources only support vLLM-compatible SSE endpoints currently.
  • Changing backend between memory and redis requires a restart.
  • The KV index is an optional enhancement — its unavailability does not break routing.

Logging Section

Configures logging output:

logging:
  level: "info"                   # trace, debug, info, warn, error
  format: "json"                  # json, pretty
  enable_colors: false            # Colored output (pretty format only)

Log Levels:

  • trace: Extremely verbose, includes all details
  • debug: Detailed debugging information
  • info: General operational information
  • warn: Warning messages and potential issues
  • error: Error conditions only

Log Formats:

  • json: Structured JSON logging (recommended for production)
  • pretty: Human-readable format (good for development)