KV Cache Optimization¶

Continuum Router implements a four-tier KV cache optimization system that reduces redundant computation on LLM backends. Each tier is independently configurable and the tiers compose together during backend selection.

Table of Contents¶

Overview
Four-Tier Caching Strategy
Tier 1: Prefix-Aware Sticky Routing
Tier 2: Response Cache
Tier 3: Shared External Cache
Tier 4: Backend KV Cache Index
Backend Selection Pipeline
Configuration Reference
Metrics
Admin Endpoints
Deployment Guide
Performance Characteristics

Overview¶

Modern LLM inference engines (vLLM, TensorRT-LLM, SGLang) maintain a KV cache in GPU memory that stores attention key-value tensors computed for the token prefix of a request. When the same prefix is seen again, the engine can skip recomputing those tensors — a substantial GPU time saving for long system prompts or repeated context.

Continuum Router maximizes KV cache reuse across backends through four complementary mechanisms:

Prefix-Aware Sticky Routing — routes requests sharing the same prompt prefix to the same backend via consistent hashing, keeping GPU KV cache warm.
Response Cache — serves repeated deterministic requests directly from router memory or Redis without hitting any backend at all.
Shared External Cache — stores response cache state in Redis/Valkey so multiple router instances share the same cache entries.
Backend KV Cache Index — tracks which backends actually hold GPU-resident KV tensors for recent prefixes, enabling fine-grained routing decisions informed by real cache state.

Four-Tier Caching Strategy¶

flowchart TD
    Client([Client Request])
    RC{Response\nCache hit?}
    PRL{Prefix key\navailable?}
    CHWBL[CHWBL Hash Ring\nPrefix → Backend]
    KVI{KV Index\noverlap ≥ threshold?}
    SCORE[Composite Score\noverlap + load + health]
    FALLBACK[Default Selection\nStrategy]
    BACKEND([Selected Backend])
    STORE_RC[Store Response\nin Cache]

    Client --> RC
    RC -->|HIT| Client
    RC -->|MISS| PRL
    PRL -->|Yes| CHWBL
    CHWBL --> KVI
    KVI -->|Yes| SCORE
    SCORE --> BACKEND
    KVI -->|No| FALLBACK
    PRL -->|No| FALLBACK
    FALLBACK --> BACKEND
    BACKEND --> STORE_RC
    STORE_RC --> Client

The tiers are not mutually exclusive: Tiers 1 and 4 both participate in backend selection. Tier 2 intercepts the entire request before any backend is contacted. Tier 3 provides the storage substrate for Tier 2.

Tier 1: Prefix-Aware Sticky Routing¶

Tier 1 routes requests that share a common prompt prefix to the same backend, maximizing the probability that the GPU KV cache on that backend is already warm for those tokens.

Prefix Key Extraction¶

For each incoming chat completion request, the router extracts a prefix key — a 32-byte SHA256 digest that uniquely identifies the semantic anchor of the request.

The extraction logic handles both OpenAI and Anthropic request formats:

Format	Preferred anchor	Fallback anchor
OpenAI	`messages[].role == "system"` content	First non-system message
Anthropic	Top-level `system` string or content-block array	First non-system message

The hash is computed as:

# With system prompt:
SHA256(model_bytes ++ "\x00" ++ "S" ++ system_bytes[:max_prefix_length])

# Without system prompt (first message fallback):
SHA256(model_bytes ++ "\x00" ++ "M" ++ first_msg_bytes[:max_prefix_length])

The \x00 separator prevents length-extension collisions between the model name and content. The tag bytes S/M prevent identical text from hashing to the same value when it appears as a system prompt versus a first user message.

The max_prefix_length parameter (default: 1024 bytes) truncates the content before hashing, with UTF-8 boundary awareness to avoid splitting multibyte characters.

Implementation: src/core/prefix_key.rs, src/core/hashing.rs

Consistent Hashing with Bounded Loads (CHWBL)¶

The PrefixAwareHash selection strategy uses a consistent hash ring to map prefix keys to backends. Simple consistent hashing can produce uneven load distribution when some prefixes are far more popular than others. Continuum Router addresses this with the Consistent Hashing with Bounded Loads (CHWBL) algorithm.

CHWBL adds a load cap: a backend can handle at most (1 + epsilon) * average_load requests simultaneously. When a backend is at its load cap, the request overflows to the next node clockwise on the ring.

epsilon = 0.25  →  a backend can be at most 25% above average load

The ring is populated with virtual_nodes (default: 150) virtual replicas per backend to improve key distribution uniformity.

Routing decision labels:

prefix_hash — request was placed on the backend that owns this prefix on the hash ring
overflow — preferred backend was at load cap, request went to the next ring node
fallback — no prefix key was extractable, fell back to the configured selection strategy

Anthropic Cache Control Injection¶

When anthropic_cache_control_injection: true is set, the router automatically adds cache_control: { type: "ephemeral" } markers to system prompt content blocks in Anthropic API requests. This activates Anthropic's server-side prompt caching, which is distinct from the router-level KV cache but complementary to it.

Tier 2: Response Cache¶

The response cache stores complete LLM responses for deterministic requests, allowing the router to serve repeated queries without contacting any backend.

Cache Eligibility¶

A request is eligible for caching when all of the following are true:

temperature is 0 or absent
The request does not use streaming (or uses streaming with buffering enabled)
The accumulated response size is within max_response_size

Requests with non-zero temperature are probabilistic and are never cached. The response header X-Cache: HIT, X-Cache: MISS, or X-Cache: BYPASS indicates the cache disposition.

Cache Key Computation¶

The cache key is a SHA256 hash over all parameters that affect LLM output:

SHA256(
    model,
    "\x00",
    SHA256(messages),   // pre-hashed messages array
    "\x00",
    temperature_bytes,
    "\x00",
    SOME/NONE_tag + max_tokens_bytes,
    "\x00",
    SOME/NONE_tag + top_p_bytes,
    "\x00",
    SOME/NONE_tag + tenant_id_bytes,
)

SOME/NONE tag bytes prevent collisions between None and Some(0) for optional parameters. The tenant ID is included to provide multi-tenant isolation — tenants cannot read each other's cached responses.

Implementation: src/infrastructure/cache/response_cache.rs

Streaming Cache¶

For streaming responses, the router accumulates the SSE stream into a buffer up to max_stream_buffer_size (default: 10 MiB). If the complete stream fits within the limit, the buffer is stored as a single serialized blob and replayed as a synthetic SSE stream on cache hits.

Cache Eviction¶

The in-memory backend uses LRU eviction when the entry count reaches capacity. The Redis backend relies on TTL-based expiration managed by Redis itself.

Tier 3: Shared External Cache¶

The shared external cache provides a CacheStore trait abstraction over Redis/Valkey, allowing multiple router instances to share response cache state.

CacheStore Trait¶

pub trait CacheStore: Send + Sync + 'static {
    async fn get(&self, key: &str) -> CacheStoreResult<Option<Vec<u8>>>;
    async fn set(&self, key: &str, value: &[u8], ttl: Duration) -> CacheStoreResult<()>;
    async fn delete(&self, key: &str) -> CacheStoreResult<()>;
    async fn clear(&self) -> CacheStoreResult<()>;
    async fn stats(&self) -> CacheStoreStats;
}

Implementations: InMemoryCacheStore (default, LRU + TTL), RedisCacheStore (Redis/Valkey with connection pooling).

Implementation: src/infrastructure/cache/store.rs

Redis Backend¶

RedisCacheStore uses deadpool-redis for connection pooling. All keys are namespaced with a configurable prefix (default: cr:resp:) to avoid collisions with other applications sharing the same Redis instance.

Key namespacing format:

cr:resp:<sha256-cache-key-hex>

Operations use SET EX for writes and GET for reads, with configurable command timeouts (default: 1s).

Automatic Fallback¶

When Redis is unreachable, RedisCacheStore transparently activates an in-memory fallback cache. The fallback is activated on the first connection failure and a background health-monitor task (running every 30 seconds) attempts to restore the Redis connection. On recovery, the flag is cleared and subsequent operations go back to Redis.

The continuum_cache_fallback_active metric (value 1) indicates that fallback mode is currently active.

The deadpool_redis::Pool is stored as an Arc in AppState and shared between the response cache and the KV cache index (Tier 4). This avoids double-counting connections and simplifies configuration: both consumers reuse the same pool credentials.

Tier 4: Backend KV Cache Index¶

Tier 4 tracks in real time which backends hold GPU-resident KV tensors for specific token prefix hashes. This enables routing decisions based on actual GPU cache state rather than statistical affinity.

Event Consumption¶

Each vLLM backend exposes a KV cache event stream at an SSE endpoint (e.g., http://vllm-1:8000/v1/kv_events). The KvEventConsumerManager spawns a background Tokio task per backend that subscribes to this stream and processes events.

Event types:

Event	Meaning
`cache_created`	A KV block for a token prefix was created on this backend (data enters GPU VRAM)
`cache_evicted`	A KV block was evicted from GPU memory on this backend
`cache_offloaded`	A KV block was explicitly offloaded from GPU to external storage (e.g., S3-compatible storage)
`cache_reloaded`	A KV block was reloaded from external storage back into GPU memory
`cache_purged`	A KV block was permanently removed from all storage tiers

Each event carries a prefix_hash (hex string) and an optional token_count indicating how many tokens are cached for that prefix on that backend.

SSE parsing details:

The consumer preferentially uses the SSE event: field to determine event type; the JSON event field is a fallback.
The buffer is capped at 1 MiB (MAX_SSE_BUFFER_SIZE) to protect against malformed streams.
On connection failure or stream end, the consumer applies exponential backoff (initial: 1s, max: 60s) before reconnecting.

Implementation: src/infrastructure/kv_index/event_consumer.rs

Index Structure¶

Events are fed into a KvCacheIndex implementation, which maintains a mapping:

prefix_hash → {backend_id → token_count}

The token count acts as the score: a backend holding 2048 cached tokens for a prefix ranks higher than one holding 512.

Two implementations are provided:

InMemoryKvIndex¶

DashMap<String, PrefixEntry> for lock-free concurrent reads
LRU eviction when entry count reaches max_entries (evicts the oldest 10% of entries)
TTL-based expiration checked lazily on query_backends()
Periodic cleanup_expired() removes stale entries proactively
Default: 100,000 max entries, 300s TTL

RedisKvIndex¶

Stores each prefix as a Redis sorted set: ZADD cr:kvidx:<prefix_hash> <token_count> <backend_id>
EXPIRE is pipelined with ZADD in a single atomic round-trip
ZREVRANGEBYSCORE +inf -inf WITHSCORES returns backends ranked by descending score
Enables sharing of KV index state across multiple router instances
Key prefix: cr:kvidx:

Implementation: src/infrastructure/kv_index/index.rs

Storage Tier Awareness¶

When storage_offloading.enabled is true, the index tracks two storage tiers for each (prefix, backend) entry:

Tier	Name	Description
Hot	`GpuHot`	KV data is resident in GPU VRAM. Immediate cache hit, no reload latency.
Warm	`StorageWarm`	KV data has been offloaded to external storage (e.g., S3-compatible storage). Cache hit with additional reload latency.

Tier transitions from events:

Event	Resulting tier
`cache_created`	`GpuHot`
`cache_offloaded`	`StorageWarm`
`cache_reloaded`	`GpuHot`
`cache_evicted` (when `treat_eviction_as_offload: true`)	`StorageWarm`
`cache_evicted` (when `treat_eviction_as_offload: false`)	Entry removed
`cache_purged`	Entry removed

The treat_eviction_as_offload option controls whether generic cache_evicted events (which do not carry tier information) are treated as offloads to warm storage or as permanent removals. This is useful when vLLM backends emit only cache_created and cache_evicted events without the explicit cache_offloaded event type.

Implementation: src/infrastructure/kv_index/types.rs

Overlap Scoring¶

The KvOverlapScorer implements the BackendScorer trait and computes a composite score for each backend:

final_score = overlap_weight   * (raw_overlap * tier_multiplier)
            + load_weight      * (1.0 - load_ratio)
            + health_weight    * health_score

Where:

raw_overlap = backend_token_count / max_token_count_across_backends (0.0 to 1.0)
tier_multiplier = gpu_tier_weight for GpuHot data, or storage_tier_weight for StorageWarm data
load_ratio = backend_in_flight / max_in_flight_across_backends (0.0 to 1.0)
health_score = backend_success_rate (0.0 to 1.0)

Default weights:

Parameter	Default	Description
`overlap_weight`	0.6	Weight for the cache overlap signal
`load_weight`	0.3	Weight for the backend load signal
`health_weight`	0.1	Weight for the backend health signal
`gpu_tier_weight`	1.0	Tier multiplier for GPU-resident (`GpuHot`) data
`storage_tier_weight`	0.6	Tier multiplier for storage-offloaded (`StorageWarm`) data

The three main weights (overlap_weight, load_weight, health_weight) must sum to exactly 1.0 (validated at startup).

The tier weight multipliers (gpu_tier_weight, storage_tier_weight) are independent — they scale the raw overlap score before the weighted sum is computed. A StorageWarm backend with 100 cached tokens scores lower than a GpuHot backend with 100 cached tokens because the tier multiplier reduces the effective overlap signal.

Minimum overlap threshold: If the best overlap score across all backends is below min_overlap_threshold (default: 0.3), the scorer returns 0.0 for all backends and the pool falls back to its default selection strategy. This prevents routing to a sub-optimal backend when no backend has meaningful cache coverage.

Async preparation model: KvCacheIndex.query_backends() is async but BackendScorer.score() must be synchronous. The scorer uses a two-phase design: prepare() fetches and caches the index query result (with a 100ms internal TTL), then score() reads synchronously from the cache.

Implementation: src/infrastructure/kv_index/scorer.rs

Backend Selection Pipeline¶

When a chat completion request arrives, the router executes the following pipeline:

sequenceDiagram
    participant C as Client
    participant R as Router
    participant RC as Response Cache
    participant PK as Prefix Key
    participant CHWBL as CHWBL Ring
    participant KVI as KV Index
    participant B as Backend

    C->>R: POST /v1/chat/completions
    R->>RC: Lookup cache key (T=0 only)
    alt Cache HIT
        RC-->>R: Cached response
        R-->>C: 200 OK (X-Cache: HIT)
    else Cache MISS
        R->>PK: Extract prefix key
        alt Prefix key available
            R->>CHWBL: Map prefix → preferred backend
            R->>KVI: prepare(prefix_hash)
            KVI-->>R: Backend scores from index
            R->>R: Score = overlap + load + health
            alt Best overlap ≥ threshold
                R->>B: Forward request (KV-aware routing)
            else Below threshold
                R->>CHWBL: Use CHWBL result (prefix routing)
                R->>B: Forward request
            end
        else No prefix key
            R->>B: Forward request (default strategy)
        end
        B-->>R: Response
        R->>RC: Store response (T=0)
        R-->>C: 200 OK (X-Cache: MISS)
    end

The KV overlap scorer composes with prefix-aware routing rather than replacing it. When the KV index has data and the overlap exceeds the threshold, the overlap scorer steers the selection. When data is absent or insufficient, the CHWBL ring result is used.

Configuration Reference¶

Tier 1: Prefix-Aware Routing¶

prefix_routing:
  enabled: true

  # Maximum bytes of prompt content used in prefix hash (default: 1024)
  max_prefix_length: 1024

  # CHWBL load cap epsilon: backend can handle (1 + epsilon) * avg_load
  # Range: 0.01 to 10.0 (default: 0.25)
  load_factor_epsilon: 0.25

  # Virtual nodes per backend on the consistent hash ring (default: 150)
  # Higher values improve distribution uniformity
  virtual_nodes: 150

  # Inject Anthropic cache_control markers into system prompts (default: false)
  anthropic_cache_control_injection: false

Tier 2: Response Cache¶

response_cache:
  enabled: true

  # Cache backend: "memory" (default) or "redis"
  # Changing backend requires restart; other fields support hot-reload
  backend: memory

  # Maximum cached entries before LRU eviction (default: 1000)
  capacity: 1000

  # TTL for cached entries (default: "5m")
  ttl: "5m"

  # Maximum response body size eligible for caching (default: 1 MiB)
  max_response_size: 1048576

  # Maximum streaming buffer size eligible for caching (default: 10 MiB)
  max_stream_buffer_size: 10485760

Tier 3: Shared External Cache (Redis backend)¶

response_cache:
  enabled: true
  backend: redis

  redis:
    # Redis/Valkey connection URL
    url: "redis://redis:6379"

    # rediss:// for TLS; or set tls: true with redis:// URL
    # url: "rediss://redis:6380"

    # Connection pool size (default: 8)
    pool_size: 8

    # Key namespace prefix (must not contain glob characters)
    key_prefix: "cr:resp:"

    # Connection timeout in milliseconds (default: 3000)
    connect_timeout_ms: 3000

    # Per-command timeout in milliseconds (default: 1000)
    command_timeout_ms: 1000

    # Fallback capacity for in-memory cache when Redis is unreachable (default: 1000)
    fallback_capacity: 1000

    # TTL for fallback in-memory entries in seconds (default: 300)
    fallback_ttl_seconds: 300

    # Fall back to in-memory cache on Redis failure (default: true)
    fallback_to_memory: true

Tier 4: KV Cache Index¶

kv_cache_index:
  enabled: true

  # Index backend: "memory" (default) or "redis"
  # When "redis", reuses the connection pool from response_cache.redis
  backend: memory

  # Maximum prefix hash entries tracked (default: 100000)
  # Range: 100 to 10,000,000
  max_entries: 100000

  # TTL for index entries in seconds (default: 600)
  # Range: 1 to 86400
  entry_ttl_seconds: 600

  # Backend scoring weights (overlap + load + health must sum to 1.0)
  scoring:
    overlap_weight: 0.6
    load_weight: 0.3
    health_weight: 0.1

    # Minimum best-overlap to activate KV-aware routing (default: 0.3)
    # If no backend exceeds this, falls back to configured strategy
    min_overlap_threshold: 0.3

    # Tier weight multipliers (independent of the three main weights)
    # Applied to the raw overlap score before the weighted sum
    gpu_tier_weight: 1.0       # Multiplier for GpuHot (GPU-resident) data
    storage_tier_weight: 0.6   # Multiplier for StorageWarm (offloaded) data

  # Tiered storage awareness (GPU hot vs. external storage warm)
  storage_offloading:
    enabled: false               # Enable storage tier tracking (default: false)
    treat_eviction_as_offload: true  # Treat cache_evicted as offload to warm (default: true)

  # vLLM backends to subscribe to for KV cache events
  event_sources:
    - backend_name: vllm-1
      endpoint: "http://vllm-1:8000/v1/kv_events"
      reconnect_interval_ms: 5000

    - backend_name: vllm-2
      endpoint: "http://vllm-2:8000/v1/kv_events"
      reconnect_interval_ms: 5000

Supported endpoint schemes for event_sources[].endpoint: http, https, ws, wss.

Metrics¶

All KV cache metrics use the prefix continuum_. Label values are sanitized against allowlists to prevent cardinality explosion.

Tier 1: Prefix Routing¶

Metric	Type	Labels	Description
`continuum_prefix_routing_requests_total`	Counter	`strategy`	Routing decisions by strategy (`prefix_hash`, `overflow`, `fallback`)
`continuum_prefix_routing_backend_distribution`	Gauge	`backend`	In-flight requests per backend under prefix routing
`continuum_prefix_routing_prefix_cardinality`	Gauge	—	Approximate count of unique prefix keys seen

Example PromQL:

# Overflow rate (CHWBL load cap activations)
rate(continuum_prefix_routing_requests_total{strategy="overflow"}[5m])
/ rate(continuum_prefix_routing_requests_total[5m])

# Per-backend load distribution
continuum_prefix_routing_backend_distribution

Tier 2: Response Cache¶

Metric	Type	Labels	Description
`continuum_response_cache_requests_total`	Counter	`result`	Cache lookups by result (`hit`, `miss`, `skip`)
`continuum_response_cache_entries`	Gauge	—	Current number of cached entries
`continuum_response_cache_size_bytes`	Gauge	—	Approximate memory usage in bytes
`continuum_response_cache_evictions_total`	Counter	—	LRU evictions from the response cache
`continuum_response_cache_hit_rate`	Gauge	—	Rolling hit rate (0.0 to 1.0)

Example PromQL:

# Cache hit rate over 5 minutes
rate(continuum_response_cache_requests_total{result="hit"}[5m])
/ rate(continuum_response_cache_requests_total{result=~"hit|miss"}[5m])

# Cache bypass rate (non-deterministic requests)
rate(continuum_response_cache_requests_total{result="skip"}[5m])

Tier 3: Redis Backend¶

Metric	Type	Labels	Description
`continuum_cache_backend_type`	Gauge	`backend`	Active backend type (`memory`=1 or `redis`=1)
`continuum_cache_redis_connections_active`	Gauge	—	Active (in-use) Redis connections
`continuum_cache_redis_connections_idle`	Gauge	—	Idle Redis connections in pool
`continuum_cache_redis_latency_seconds`	Histogram	`operation`	Redis operation latency (`get`, `set`, `delete`)
`continuum_cache_redis_errors_total`	Counter	`type`	Redis errors (`connection`, `timeout`, `other`)
`continuum_cache_fallback_active`	Gauge	—	`1` when in-memory fallback is active

Example PromQL:

# Redis P99 GET latency
histogram_quantile(0.99,
  rate(continuum_cache_redis_latency_seconds_bucket{operation="get"}[5m])
)

# Redis error rate
rate(continuum_cache_redis_errors_total[5m])

# Alert: fallback active
continuum_cache_fallback_active == 1

Tier 4: KV Cache Index¶

Metric	Type	Labels	Description
`continuum_kv_event_received_total`	Counter	`backend`	KV events received per backend
`continuum_kv_event_processed_total`	Counter	`backend`	KV events successfully processed per backend
`continuum_kv_event_dropped_total`	Counter	`backend`	KV events dropped due to channel backpressure
`continuum_kv_consumer_connected`	Gauge	`backend`	`1` when consumer is connected to SSE stream
`continuum_kv_consumer_reconnects_total`	Counter	`backend`	SSE reconnection attempts per backend
`continuum_kv_index_entries`	Gauge	—	Current number of tracked (prefix, backend) pairs
`continuum_kv_index_events_total`	Counter	`backend`, `type`	Index mutations (`created`, `evicted`)
`continuum_kv_index_query_latency_seconds`	Histogram	—	KV index query latency in seconds
`continuum_kv_index_routing_decisions_total`	Counter	`decision`	KV-aware routing decisions (`kv_aware`, `fallback`)
`continuum_kv_index_overlap_score`	Histogram	—	Overlap scores for routed requests (0.0 to 1.0)
`continuum_kv_index_event_source_status`	Gauge	`backend`, `status`	Event source connection status per backend

Example PromQL:

# KV-aware routing activation rate
rate(continuum_kv_index_routing_decisions_total{decision="kv_aware"}[5m])
/ rate(continuum_kv_index_routing_decisions_total[5m])

# Event drop rate per backend (indicates backpressure)
rate(continuum_kv_event_dropped_total[5m])

# P50 overlap score for routed requests
histogram_quantile(0.50, rate(continuum_kv_index_overlap_score_bucket[5m]))

# Disconnected event consumers
continuum_kv_consumer_connected == 0

Admin Endpoints¶

All admin endpoints are under the /admin prefix and require authentication if configured.

Prefix Routing¶

`GET /admin/prefix-routing/stats`¶

Returns prefix routing statistics including routing decision counts, overflow rate, backend load distribution, and CHWBL configuration.

Example response:

{
  "enabled": true,
  "config": {
    "max_prefix_length": 1024,
    "load_factor_epsilon": 0.25,
    "virtual_nodes": 150,
    "anthropic_cache_control_injection": false
  },
  "routing_decisions": {
    "total": 4926,
    "prefix_hash": 4821,
    "overflow": 93,
    "fallback": 12,
    "overflow_rate": "0.0189"
  },
  "backend_distribution": [
    { "backend": "vllm-1", "in_flight_requests": 4 },
    { "backend": "vllm-2", "in_flight_requests": 3 }
  ],
  "unique_prefixes": 247
}

Response Cache¶

`GET /admin/response-cache/stats`¶

Returns response cache statistics: hit/miss/skip counts, hit rate, entry count, memory usage, and Redis connection info (when applicable).

Example response:

{
  "enabled": true,
  "backend_type": "redis",
  "entries": 1243,
  "capacity": 5000,
  "requests": {
    "hit": 8912,
    "miss": 2341,
    "skip": 441,
    "total": 11694
  },
  "hit_rate": "0.7924",
  "evictions": 0,
  "size_bytes": 0,
  "config": {
    "backend": "redis",
    "ttl": "30m",
    "capacity": 5000,
    "max_response_size": 1048576,
    "max_stream_buffer_size": 10485760
  },
  "redis": {
    "connections": {
      "active": 3,
      "idle": 5
    },
    "fallback_active": false,
    "errors": {
      "connection": 0,
      "timeout": 2,
      "other": 0
    }
  }
}

`POST /admin/response-cache/invalidate`¶

Invalidates cached responses. Accepts a JSON body:

{ "clear_all": true }

Example response:

{
  "success": true,
  "action": "clear_all",
  "cleared_entries": 1243
}

KV Cache Index¶

`GET /admin/kv-index/stats`¶

Returns KV cache index statistics: entry counts, routing decision breakdown, query latency counters, and overlap score counts.

Example response:

{
  "enabled": true,
  "config": {
    "backend": "memory",
    "max_entries": 100000,
    "entry_ttl_seconds": 600,
    "event_sources_count": 2,
    "scoring": {
      "overlap_weight": 0.6,
      "load_weight": 0.3,
      "health_weight": 0.1,
      "min_overlap_threshold": 0.3
    }
  },
  "index": {
    "prefix_count": 312,
    "entry_count": 618,
    "total_hits": 12490,
    "total_evictions": 83
  },
  "event_sources": [
    {
      "backend_name": "vllm-1",
      "connected": true,
      "events_received": 8412,
      "events_dropped": 0,
      "last_event_at": "2026-03-13T10:24:17Z",
      "reconnect_count": 0
    }
  ],
  "routing_decisions": {
    "kv_aware": 9841,
    "fallback": 2649,
    "total": 12490
  },
  "query_latency_count": 12490,
  "overlap_score_count": 9841
}

`GET /admin/kv-index/backends`¶

Returns per-backend KV cache event statistics: events received, processed, dropped, connection status, and index event counts (created/evicted).

Example response:

{
  "enabled": true,
  "backends": [
    {
      "backend_name": "vllm-1",
      "connection": {
        "connected": true,
        "reconnect_count": 0,
        "last_event_at": "2026-03-13T10:24:17Z"
      },
      "events": {
        "received": 8412,
        "dropped": 0,
        "index_created": 7981,
        "index_evicted": 431
      }
    }
  ]
}

`POST /admin/kv-index/clear`¶

Clears all entries from the KV cache index. The index rebuilds automatically from incoming events. Intended for debugging.

Example response:

{
  "success": true,
  "entries_before_clear": 618,
  "cleared_entries": 2
}

Deployment Guide¶

Tier 1 Only (Minimal Configuration)¶

Enable prefix routing for GPU KV cache locality without any external dependencies:

prefix_routing:
  enabled: true
  load_factor_epsilon: 0.25
  virtual_nodes: 150

Effective when multiple instances of the same model run across backends and system prompts are long (>128 tokens).

Tier 1 + 2 (Response Cache)¶

Add response caching to eliminate repeated deterministic requests entirely:

prefix_routing:
  enabled: true

response_cache:
  enabled: true
  backend: memory
  capacity: 5000
  ttl: "10m"

Effective when the application makes repeated identical requests (e.g., document QA with fixed system prompts and fixed queries).

Tier 1 + 2 + 3 (Distributed Response Cache)¶

For multi-instance deployments, share the response cache across all router instances:

prefix_routing:
  enabled: true

response_cache:
  enabled: true
  backend: redis
  ttl: "30m"
  redis:
    url: "redis://redis-service:6379"
    pool_size: 16
    key_prefix: "cr:resp:"
    fallback_to_memory: true
    fallback_capacity: 2000

Redis/Valkey must be accessible from all router pods. Use rediss:// URL or tls: true for encrypted connections.

All Tiers (Full KV-Aware Routing)¶

Enable all tiers for maximum GPU cache reuse:

prefix_routing:
  enabled: true
  load_factor_epsilon: 0.20

response_cache:
  enabled: true
  backend: redis
  ttl: "30m"
  redis:
    url: "redis://redis-service:6379"
    pool_size: 16

kv_cache_index:
  enabled: true
  backend: redis   # shares the pool from response_cache.redis
  max_entries: 500000
  entry_ttl_seconds: 900
  scoring:
    overlap_weight: 0.6
    load_weight: 0.3
    health_weight: 0.1
    min_overlap_threshold: 0.25
    gpu_tier_weight: 1.0       # GPU-resident data gets full overlap credit
    storage_tier_weight: 0.6   # Offloaded data is still valuable but discounted
  storage_offloading:
    enabled: true              # Track GPU hot vs. storage warm tiers
    treat_eviction_as_offload: true
  event_sources:
    - backend_name: vllm-1
      endpoint: "http://vllm-1:8000/v1/kv_events"
    - backend_name: vllm-2
      endpoint: "http://vllm-2:8000/v1/kv_events"

vLLM Requirements¶

The kv_events SSE endpoint must be enabled on each vLLM backend. vLLM exposes this endpoint when launched with:

vllm serve <model> \
  --enable-prefix-caching \
  --kv-cache-dtype auto

The event stream URL is typically http://<host>:<port>/v1/kv_events.

Redis/Valkey Sizing¶

For response cache sizing, estimate:

Average serialized response size: 2–10 KB per entry
capacity = (target_hit_rate * rps * avg_unique_rate) / eviction_frequency

For the KV index:

Each (prefix, backend) entry consumes approximately 200 bytes in memory
Set max_entries to at least num_unique_prefixes * num_backends * 2 for headroom

High Availability Considerations¶

The response cache and KV index tolerate Redis failure via automatic in-memory fallback (Tier 3).
The KV index rebuilds from the SSE streams on restart; no persistence is required.
Prefix routing (Tier 1) has no external dependencies and is always available.
Deploy Redis with replication (Sentinel or Cluster) if cache persistence across Redis restarts is required.

Performance Characteristics¶

Tier 1: Prefix Routing¶

Prefix key extraction (SHA256): < 10 µs per request
CHWBL ring lookup: O(log N) where N = virtual_nodes * num_backends; < 5 µs for typical deployments
No network I/O; operates entirely in-process

Tier 2: Response Cache¶

In-memory cache lookup: < 1 µs
Cache key computation (SHA256): < 5 µs
Cache hit serves the entire response without any backend latency

Tier 3: Redis Backend¶

Redis GET latency (LAN): 0.1–2 ms typical; P99 < 5 ms
Redis SET latency: similar to GET
Command timeout default: 1 second; operations exceeding this activate fallback

Tier 4: KV Cache Index¶

InMemoryKvIndex.query_backends(): < 100 µs (DashMap read, no allocation on empty result)
RedisKvIndex.query_backends(): same as Redis GET latency (0.1–2 ms)
KvOverlapScorer.prepare(): one query_backends() call per unique prefix per 100 ms window
KvOverlapScorer.score(): < 1 µs (synchronous read from pre-fetched cache)
1000 scoring calls: < 100 ms total (verified by unit benchmark in scorer.rs)

Expected Gains¶

The following are illustrative estimates based on typical LLM workload patterns:

Scenario	Metric	Expected Improvement
Long system prompt (>512 tokens), repeated across requests	Time-to-first-token	20–40% reduction via KV cache reuse
Fixed document QA (same doc + same questions)	Backend requests	Up to 100% elimination via response cache
Multi-replica vLLM, hot prefixes	Cache hit rate (Tier 4)	60–80% of requests routed to backend with warm cache
Redis failure	Service availability	No degradation; fallback to in-memory within one request

Actual gains depend on workload prefix overlap, GPU memory capacity, and backend configuration.

KV Cache Optimization¶

Table of Contents¶

Overview¶

Four-Tier Caching Strategy¶

Tier 1: Prefix-Aware Sticky Routing¶

Prefix Key Extraction¶

Consistent Hashing with Bounded Loads (CHWBL)¶

Anthropic Cache Control Injection¶

Tier 2: Response Cache¶

Cache Eligibility¶

Cache Key Computation¶

Streaming Cache¶

Cache Eviction¶

Tier 3: Shared External Cache¶

CacheStore Trait¶

Redis Backend¶

Automatic Fallback¶

Connection Pool Sharing¶

Tier 4: Backend KV Cache Index¶

Event Consumption¶

Index Structure¶

InMemoryKvIndex¶

RedisKvIndex¶

Storage Tier Awareness¶

Overlap Scoring¶

Backend Selection Pipeline¶

Configuration Reference¶

Tier 1: Prefix-Aware Routing¶

Tier 2: Response Cache¶

Tier 3: Shared External Cache (Redis backend)¶

Tier 4: KV Cache Index¶

Metrics¶

Tier 1: Prefix Routing¶

Tier 2: Response Cache¶

Tier 3: Redis Backend¶

Tier 4: KV Cache Index¶

Admin Endpoints¶

Prefix Routing¶

GET /admin/prefix-routing/stats¶

Response Cache¶

GET /admin/response-cache/stats¶

POST /admin/response-cache/invalidate¶

KV Cache Index¶

GET /admin/kv-index/stats¶

GET /admin/kv-index/backends¶

POST /admin/kv-index/clear¶

Deployment Guide¶

Tier 1 Only (Minimal Configuration)¶

Tier 1 + 2 (Response Cache)¶

Tier 1 + 2 + 3 (Distributed Response Cache)¶

All Tiers (Full KV-Aware Routing)¶

vLLM Requirements¶

Redis/Valkey Sizing¶

High Availability Considerations¶

Performance Characteristics¶

Tier 1: Prefix Routing¶

Tier 2: Response Cache¶

Tier 3: Redis Backend¶

Tier 4: KV Cache Index¶

Expected Gains¶

`GET /admin/prefix-routing/stats`¶

`GET /admin/response-cache/stats`¶

`POST /admin/response-cache/invalidate`¶

`GET /admin/kv-index/stats`¶

`GET /admin/kv-index/backends`¶

`POST /admin/kv-index/clear`¶