KV Cache Optimization¶
Continuum Router implements a four-tier KV cache optimization system that reduces redundant computation on LLM backends. Each tier is independently configurable and the tiers compose together during backend selection.
Table of Contents¶
- Overview
- Four-Tier Caching Strategy
- Tier 1: Prefix-Aware Sticky Routing
- Tier 2: Response Cache
- Tier 3: Shared External Cache
- Tier 4: Backend KV Cache Index
- Backend Selection Pipeline
- Configuration Reference
- Metrics
- Admin Endpoints
- Deployment Guide
- Performance Characteristics
Overview¶
Modern LLM inference engines (vLLM, TensorRT-LLM, SGLang) maintain a KV cache in GPU memory that stores attention key-value tensors computed for the token prefix of a request. When the same prefix is seen again, the engine can skip recomputing those tensors — a substantial GPU time saving for long system prompts or repeated context.
Continuum Router maximizes KV cache reuse across backends through four complementary mechanisms:
- Prefix-Aware Sticky Routing — routes requests sharing the same prompt prefix to the same backend via consistent hashing, keeping GPU KV cache warm.
- Response Cache — serves repeated deterministic requests directly from router memory or Redis without hitting any backend at all.
- Shared External Cache — stores response cache state in Redis/Valkey so multiple router instances share the same cache entries.
- Backend KV Cache Index — tracks which backends actually hold GPU-resident KV tensors for recent prefixes, enabling fine-grained routing decisions informed by real cache state.
Four-Tier Caching Strategy¶
flowchart TD
Client([Client Request])
RC{Response\nCache hit?}
PRL{Prefix key\navailable?}
CHWBL[CHWBL Hash Ring\nPrefix → Backend]
KVI{KV Index\noverlap ≥ threshold?}
SCORE[Composite Score\noverlap + load + health]
FALLBACK[Default Selection\nStrategy]
BACKEND([Selected Backend])
STORE_RC[Store Response\nin Cache]
Client --> RC
RC -->|HIT| Client
RC -->|MISS| PRL
PRL -->|Yes| CHWBL
CHWBL --> KVI
KVI -->|Yes| SCORE
SCORE --> BACKEND
KVI -->|No| FALLBACK
PRL -->|No| FALLBACK
FALLBACK --> BACKEND
BACKEND --> STORE_RC
STORE_RC --> Client The tiers are not mutually exclusive: Tiers 1 and 4 both participate in backend selection. Tier 2 intercepts the entire request before any backend is contacted. Tier 3 provides the storage substrate for Tier 2.
Tier 1: Prefix-Aware Sticky Routing¶
Tier 1 routes requests that share a common prompt prefix to the same backend, maximizing the probability that the GPU KV cache on that backend is already warm for those tokens.
Prefix Key Extraction¶
For each incoming chat completion request, the router extracts a prefix key — a 32-byte SHA256 digest that uniquely identifies the semantic anchor of the request.
The extraction logic handles both OpenAI and Anthropic request formats:
| Format | Preferred anchor | Fallback anchor |
|---|---|---|
| OpenAI | messages[].role == "system" content | First non-system message |
| Anthropic | Top-level system string or content-block array | First non-system message |
The hash is computed as:
# With system prompt:
SHA256(model_bytes ++ "\x00" ++ "S" ++ system_bytes[:max_prefix_length])
# Without system prompt (first message fallback):
SHA256(model_bytes ++ "\x00" ++ "M" ++ first_msg_bytes[:max_prefix_length])
The \x00 separator prevents length-extension collisions between the model name and content. The tag bytes S/M prevent identical text from hashing to the same value when it appears as a system prompt versus a first user message.
The max_prefix_length parameter (default: 1024 bytes) truncates the content before hashing, with UTF-8 boundary awareness to avoid splitting multibyte characters.
Implementation: src/core/prefix_key.rs, src/core/hashing.rs
Consistent Hashing with Bounded Loads (CHWBL)¶
The PrefixAwareHash selection strategy uses a consistent hash ring to map prefix keys to backends. Simple consistent hashing can produce uneven load distribution when some prefixes are far more popular than others. Continuum Router addresses this with the Consistent Hashing with Bounded Loads (CHWBL) algorithm.
CHWBL adds a load cap: a backend can handle at most (1 + epsilon) * average_load requests simultaneously. When a backend is at its load cap, the request overflows to the next node clockwise on the ring.
The ring is populated with virtual_nodes (default: 150) virtual replicas per backend to improve key distribution uniformity.
Routing decision labels:
prefix_hash— request was placed on the backend that owns this prefix on the hash ringoverflow— preferred backend was at load cap, request went to the next ring nodefallback— no prefix key was extractable, fell back to the configured selection strategy
Anthropic Cache Control Injection¶
When anthropic_cache_control_injection: true is set, the router automatically adds cache_control: { type: "ephemeral" } markers to system prompt content blocks in Anthropic API requests. This activates Anthropic's server-side prompt caching, which is distinct from the router-level KV cache but complementary to it.
Tier 2: Response Cache¶
The response cache stores complete LLM responses for deterministic requests, allowing the router to serve repeated queries without contacting any backend.
Cache Eligibility¶
A request is eligible for caching when all of the following are true:
temperatureis0or absent- The request does not use streaming (or uses streaming with buffering enabled)
- The accumulated response size is within
max_response_size
Requests with non-zero temperature are probabilistic and are never cached. The response header X-Cache: HIT, X-Cache: MISS, or X-Cache: BYPASS indicates the cache disposition.
Cache Key Computation¶
The cache key is a SHA256 hash over all parameters that affect LLM output:
SHA256(
model,
"\x00",
SHA256(messages), // pre-hashed messages array
"\x00",
temperature_bytes,
"\x00",
SOME/NONE_tag + max_tokens_bytes,
"\x00",
SOME/NONE_tag + top_p_bytes,
"\x00",
SOME/NONE_tag + tenant_id_bytes,
)
SOME/NONE tag bytes prevent collisions between None and Some(0) for optional parameters. The tenant ID is included to provide multi-tenant isolation — tenants cannot read each other's cached responses.
Implementation: src/infrastructure/cache/response_cache.rs
Streaming Cache¶
For streaming responses, the router accumulates the SSE stream into a buffer up to max_stream_buffer_size (default: 10 MiB). If the complete stream fits within the limit, the buffer is stored as a single serialized blob and replayed as a synthetic SSE stream on cache hits.
Cache Eviction¶
The in-memory backend uses LRU eviction when the entry count reaches capacity. The Redis backend relies on TTL-based expiration managed by Redis itself.
Tier 3: Shared External Cache¶
The shared external cache provides a CacheStore trait abstraction over Redis/Valkey, allowing multiple router instances to share response cache state.
CacheStore Trait¶
pub trait CacheStore: Send + Sync + 'static {
async fn get(&self, key: &str) -> CacheStoreResult<Option<Vec<u8>>>;
async fn set(&self, key: &str, value: &[u8], ttl: Duration) -> CacheStoreResult<()>;
async fn delete(&self, key: &str) -> CacheStoreResult<()>;
async fn clear(&self) -> CacheStoreResult<()>;
async fn stats(&self) -> CacheStoreStats;
}
Implementations: InMemoryCacheStore (default, LRU + TTL), RedisCacheStore (Redis/Valkey with connection pooling).
Implementation: src/infrastructure/cache/store.rs
Redis Backend¶
RedisCacheStore uses deadpool-redis for connection pooling. All keys are namespaced with a configurable prefix (default: cr:resp:) to avoid collisions with other applications sharing the same Redis instance.
Key namespacing format:
Operations use SET EX for writes and GET for reads, with configurable command timeouts (default: 1s).
Automatic Fallback¶
When Redis is unreachable, RedisCacheStore transparently activates an in-memory fallback cache. The fallback is activated on the first connection failure and a background health-monitor task (running every 30 seconds) attempts to restore the Redis connection. On recovery, the flag is cleared and subsequent operations go back to Redis.
The continuum_cache_fallback_active metric (value 1) indicates that fallback mode is currently active.
Connection Pool Sharing¶
The deadpool_redis::Pool is stored as an Arc in AppState and shared between the response cache and the KV cache index (Tier 4). This avoids double-counting connections and simplifies configuration: both consumers reuse the same pool credentials.
Tier 4: Backend KV Cache Index¶
Tier 4 tracks in real time which backends hold GPU-resident KV tensors for specific token prefix hashes. This enables routing decisions based on actual GPU cache state rather than statistical affinity.
Event Consumption¶
Each vLLM backend exposes a KV cache event stream at an SSE endpoint (e.g., http://vllm-1:8000/v1/kv_events). The KvEventConsumerManager spawns a background Tokio task per backend that subscribes to this stream and processes events.
Event types:
| Event | Meaning |
|---|---|
cache_created | A KV block for a token prefix was created on this backend (data enters GPU VRAM) |
cache_evicted | A KV block was evicted from GPU memory on this backend |
cache_offloaded | A KV block was explicitly offloaded from GPU to external storage (e.g., S3-compatible storage) |
cache_reloaded | A KV block was reloaded from external storage back into GPU memory |
cache_purged | A KV block was permanently removed from all storage tiers |
Each event carries a prefix_hash (hex string) and an optional token_count indicating how many tokens are cached for that prefix on that backend.
SSE parsing details:
- The consumer preferentially uses the SSE
event:field to determine event type; the JSONeventfield is a fallback. - The buffer is capped at 1 MiB (
MAX_SSE_BUFFER_SIZE) to protect against malformed streams. - On connection failure or stream end, the consumer applies exponential backoff (initial: 1s, max: 60s) before reconnecting.
Implementation: src/infrastructure/kv_index/event_consumer.rs
Index Structure¶
Events are fed into a KvCacheIndex implementation, which maintains a mapping:
The token count acts as the score: a backend holding 2048 cached tokens for a prefix ranks higher than one holding 512.
Two implementations are provided:
InMemoryKvIndex¶
DashMap<String, PrefixEntry>for lock-free concurrent reads- LRU eviction when entry count reaches
max_entries(evicts the oldest 10% of entries) - TTL-based expiration checked lazily on
query_backends() - Periodic
cleanup_expired()removes stale entries proactively - Default: 100,000 max entries, 300s TTL
RedisKvIndex¶
- Stores each prefix as a Redis sorted set:
ZADD cr:kvidx:<prefix_hash> <token_count> <backend_id> EXPIREis pipelined withZADDin a single atomic round-tripZREVRANGEBYSCORE +inf -inf WITHSCORESreturns backends ranked by descending score- Enables sharing of KV index state across multiple router instances
- Key prefix:
cr:kvidx:
Implementation: src/infrastructure/kv_index/index.rs
Storage Tier Awareness¶
When storage_offloading.enabled is true, the index tracks two storage tiers for each (prefix, backend) entry:
| Tier | Name | Description |
|---|---|---|
| Hot | GpuHot | KV data is resident in GPU VRAM. Immediate cache hit, no reload latency. |
| Warm | StorageWarm | KV data has been offloaded to external storage (e.g., S3-compatible storage). Cache hit with additional reload latency. |
Tier transitions from events:
| Event | Resulting tier |
|---|---|
cache_created | GpuHot |
cache_offloaded | StorageWarm |
cache_reloaded | GpuHot |
cache_evicted (when treat_eviction_as_offload: true) | StorageWarm |
cache_evicted (when treat_eviction_as_offload: false) | Entry removed |
cache_purged | Entry removed |
The treat_eviction_as_offload option controls whether generic cache_evicted events (which do not carry tier information) are treated as offloads to warm storage or as permanent removals. This is useful when vLLM backends emit only cache_created and cache_evicted events without the explicit cache_offloaded event type.
Implementation: src/infrastructure/kv_index/types.rs
Overlap Scoring¶
The KvOverlapScorer implements the BackendScorer trait and computes a composite score for each backend:
final_score = overlap_weight * (raw_overlap * tier_multiplier)
+ load_weight * (1.0 - load_ratio)
+ health_weight * health_score
Where:
raw_overlap = backend_token_count / max_token_count_across_backends(0.0 to 1.0)tier_multiplier=gpu_tier_weightforGpuHotdata, orstorage_tier_weightforStorageWarmdataload_ratio = backend_in_flight / max_in_flight_across_backends(0.0 to 1.0)health_score = backend_success_rate(0.0 to 1.0)
Default weights:
| Parameter | Default | Description |
|---|---|---|
overlap_weight | 0.6 | Weight for the cache overlap signal |
load_weight | 0.3 | Weight for the backend load signal |
health_weight | 0.1 | Weight for the backend health signal |
gpu_tier_weight | 1.0 | Tier multiplier for GPU-resident (GpuHot) data |
storage_tier_weight | 0.6 | Tier multiplier for storage-offloaded (StorageWarm) data |
The three main weights (overlap_weight, load_weight, health_weight) must sum to exactly 1.0 (validated at startup).
The tier weight multipliers (gpu_tier_weight, storage_tier_weight) are independent — they scale the raw overlap score before the weighted sum is computed. A StorageWarm backend with 100 cached tokens scores lower than a GpuHot backend with 100 cached tokens because the tier multiplier reduces the effective overlap signal.
Minimum overlap threshold: If the best overlap score across all backends is below min_overlap_threshold (default: 0.3), the scorer returns 0.0 for all backends and the pool falls back to its default selection strategy. This prevents routing to a sub-optimal backend when no backend has meaningful cache coverage.
Async preparation model: KvCacheIndex.query_backends() is async but BackendScorer.score() must be synchronous. The scorer uses a two-phase design: prepare() fetches and caches the index query result (with a 100ms internal TTL), then score() reads synchronously from the cache.
Implementation: src/infrastructure/kv_index/scorer.rs
Backend Selection Pipeline¶
When a chat completion request arrives, the router executes the following pipeline:
sequenceDiagram
participant C as Client
participant R as Router
participant RC as Response Cache
participant PK as Prefix Key
participant CHWBL as CHWBL Ring
participant KVI as KV Index
participant B as Backend
C->>R: POST /v1/chat/completions
R->>RC: Lookup cache key (T=0 only)
alt Cache HIT
RC-->>R: Cached response
R-->>C: 200 OK (X-Cache: HIT)
else Cache MISS
R->>PK: Extract prefix key
alt Prefix key available
R->>CHWBL: Map prefix → preferred backend
R->>KVI: prepare(prefix_hash)
KVI-->>R: Backend scores from index
R->>R: Score = overlap + load + health
alt Best overlap ≥ threshold
R->>B: Forward request (KV-aware routing)
else Below threshold
R->>CHWBL: Use CHWBL result (prefix routing)
R->>B: Forward request
end
else No prefix key
R->>B: Forward request (default strategy)
end
B-->>R: Response
R->>RC: Store response (T=0)
R-->>C: 200 OK (X-Cache: MISS)
end The KV overlap scorer composes with prefix-aware routing rather than replacing it. When the KV index has data and the overlap exceeds the threshold, the overlap scorer steers the selection. When data is absent or insufficient, the CHWBL ring result is used.
Configuration Reference¶
Tier 1: Prefix-Aware Routing¶
prefix_routing:
enabled: true
# Maximum bytes of prompt content used in prefix hash (default: 1024)
max_prefix_length: 1024
# CHWBL load cap epsilon: backend can handle (1 + epsilon) * avg_load
# Range: 0.01 to 10.0 (default: 0.25)
load_factor_epsilon: 0.25
# Virtual nodes per backend on the consistent hash ring (default: 150)
# Higher values improve distribution uniformity
virtual_nodes: 150
# Inject Anthropic cache_control markers into system prompts (default: false)
anthropic_cache_control_injection: false
Tier 2: Response Cache¶
response_cache:
enabled: true
# Cache backend: "memory" (default) or "redis"
# Changing backend requires restart; other fields support hot-reload
backend: memory
# Maximum cached entries before LRU eviction (default: 1000)
capacity: 1000
# TTL for cached entries (default: "5m")
ttl: "5m"
# Maximum response body size eligible for caching (default: 1 MiB)
max_response_size: 1048576
# Maximum streaming buffer size eligible for caching (default: 10 MiB)
max_stream_buffer_size: 10485760
Tier 3: Shared External Cache (Redis backend)¶
response_cache:
enabled: true
backend: redis
redis:
# Redis/Valkey connection URL
url: "redis://redis:6379"
# rediss:// for TLS; or set tls: true with redis:// URL
# url: "rediss://redis:6380"
# Connection pool size (default: 8)
pool_size: 8
# Key namespace prefix (must not contain glob characters)
key_prefix: "cr:resp:"
# Connection timeout in milliseconds (default: 3000)
connect_timeout_ms: 3000
# Per-command timeout in milliseconds (default: 1000)
command_timeout_ms: 1000
# Fallback capacity for in-memory cache when Redis is unreachable (default: 1000)
fallback_capacity: 1000
# TTL for fallback in-memory entries in seconds (default: 300)
fallback_ttl_seconds: 300
# Fall back to in-memory cache on Redis failure (default: true)
fallback_to_memory: true
Tier 4: KV Cache Index¶
kv_cache_index:
enabled: true
# Index backend: "memory" (default) or "redis"
# When "redis", reuses the connection pool from response_cache.redis
backend: memory
# Maximum prefix hash entries tracked (default: 100000)
# Range: 100 to 10,000,000
max_entries: 100000
# TTL for index entries in seconds (default: 600)
# Range: 1 to 86400
entry_ttl_seconds: 600
# Backend scoring weights (overlap + load + health must sum to 1.0)
scoring:
overlap_weight: 0.6
load_weight: 0.3
health_weight: 0.1
# Minimum best-overlap to activate KV-aware routing (default: 0.3)
# If no backend exceeds this, falls back to configured strategy
min_overlap_threshold: 0.3
# Tier weight multipliers (independent of the three main weights)
# Applied to the raw overlap score before the weighted sum
gpu_tier_weight: 1.0 # Multiplier for GpuHot (GPU-resident) data
storage_tier_weight: 0.6 # Multiplier for StorageWarm (offloaded) data
# Tiered storage awareness (GPU hot vs. external storage warm)
storage_offloading:
enabled: false # Enable storage tier tracking (default: false)
treat_eviction_as_offload: true # Treat cache_evicted as offload to warm (default: true)
# vLLM backends to subscribe to for KV cache events
event_sources:
- backend_name: vllm-1
endpoint: "http://vllm-1:8000/v1/kv_events"
reconnect_interval_ms: 5000
- backend_name: vllm-2
endpoint: "http://vllm-2:8000/v1/kv_events"
reconnect_interval_ms: 5000
Supported endpoint schemes for event_sources[].endpoint: http, https, ws, wss.
Metrics¶
All KV cache metrics use the prefix continuum_. Label values are sanitized against allowlists to prevent cardinality explosion.
Tier 1: Prefix Routing¶
| Metric | Type | Labels | Description |
|---|---|---|---|
continuum_prefix_routing_requests_total | Counter | strategy | Routing decisions by strategy (prefix_hash, overflow, fallback) |
continuum_prefix_routing_backend_distribution | Gauge | backend | In-flight requests per backend under prefix routing |
continuum_prefix_routing_prefix_cardinality | Gauge | — | Approximate count of unique prefix keys seen |
Example PromQL:
# Overflow rate (CHWBL load cap activations)
rate(continuum_prefix_routing_requests_total{strategy="overflow"}[5m])
/ rate(continuum_prefix_routing_requests_total[5m])
# Per-backend load distribution
continuum_prefix_routing_backend_distribution
Tier 2: Response Cache¶
| Metric | Type | Labels | Description |
|---|---|---|---|
continuum_response_cache_requests_total | Counter | result | Cache lookups by result (hit, miss, skip) |
continuum_response_cache_entries | Gauge | — | Current number of cached entries |
continuum_response_cache_size_bytes | Gauge | — | Approximate memory usage in bytes |
continuum_response_cache_evictions_total | Counter | — | LRU evictions from the response cache |
continuum_response_cache_hit_rate | Gauge | — | Rolling hit rate (0.0 to 1.0) |
Example PromQL:
# Cache hit rate over 5 minutes
rate(continuum_response_cache_requests_total{result="hit"}[5m])
/ rate(continuum_response_cache_requests_total{result=~"hit|miss"}[5m])
# Cache bypass rate (non-deterministic requests)
rate(continuum_response_cache_requests_total{result="skip"}[5m])
Tier 3: Redis Backend¶
| Metric | Type | Labels | Description |
|---|---|---|---|
continuum_cache_backend_type | Gauge | backend | Active backend type (memory=1 or redis=1) |
continuum_cache_redis_connections_active | Gauge | — | Active (in-use) Redis connections |
continuum_cache_redis_connections_idle | Gauge | — | Idle Redis connections in pool |
continuum_cache_redis_latency_seconds | Histogram | operation | Redis operation latency (get, set, delete) |
continuum_cache_redis_errors_total | Counter | type | Redis errors (connection, timeout, other) |
continuum_cache_fallback_active | Gauge | — | 1 when in-memory fallback is active |
Example PromQL:
# Redis P99 GET latency
histogram_quantile(0.99,
rate(continuum_cache_redis_latency_seconds_bucket{operation="get"}[5m])
)
# Redis error rate
rate(continuum_cache_redis_errors_total[5m])
# Alert: fallback active
continuum_cache_fallback_active == 1
Tier 4: KV Cache Index¶
| Metric | Type | Labels | Description |
|---|---|---|---|
continuum_kv_event_received_total | Counter | backend | KV events received per backend |
continuum_kv_event_processed_total | Counter | backend | KV events successfully processed per backend |
continuum_kv_event_dropped_total | Counter | backend | KV events dropped due to channel backpressure |
continuum_kv_consumer_connected | Gauge | backend | 1 when consumer is connected to SSE stream |
continuum_kv_consumer_reconnects_total | Counter | backend | SSE reconnection attempts per backend |
continuum_kv_index_entries | Gauge | — | Current number of tracked (prefix, backend) pairs |
continuum_kv_index_events_total | Counter | backend, type | Index mutations (created, evicted) |
continuum_kv_index_query_latency_seconds | Histogram | — | KV index query latency in seconds |
continuum_kv_index_routing_decisions_total | Counter | decision | KV-aware routing decisions (kv_aware, fallback) |
continuum_kv_index_overlap_score | Histogram | — | Overlap scores for routed requests (0.0 to 1.0) |
continuum_kv_index_event_source_status | Gauge | backend, status | Event source connection status per backend |
Example PromQL:
# KV-aware routing activation rate
rate(continuum_kv_index_routing_decisions_total{decision="kv_aware"}[5m])
/ rate(continuum_kv_index_routing_decisions_total[5m])
# Event drop rate per backend (indicates backpressure)
rate(continuum_kv_event_dropped_total[5m])
# P50 overlap score for routed requests
histogram_quantile(0.50, rate(continuum_kv_index_overlap_score_bucket[5m]))
# Disconnected event consumers
continuum_kv_consumer_connected == 0
Admin Endpoints¶
All admin endpoints are under the /admin prefix and require authentication if configured.
Prefix Routing¶
GET /admin/prefix-routing/stats¶
Returns prefix routing statistics including routing decision counts, overflow rate, backend load distribution, and CHWBL configuration.
Example response:
{
"enabled": true,
"config": {
"max_prefix_length": 1024,
"load_factor_epsilon": 0.25,
"virtual_nodes": 150,
"anthropic_cache_control_injection": false
},
"routing_decisions": {
"total": 4926,
"prefix_hash": 4821,
"overflow": 93,
"fallback": 12,
"overflow_rate": "0.0189"
},
"backend_distribution": [
{ "backend": "vllm-1", "in_flight_requests": 4 },
{ "backend": "vllm-2", "in_flight_requests": 3 }
],
"unique_prefixes": 247
}
Response Cache¶
GET /admin/response-cache/stats¶
Returns response cache statistics: hit/miss/skip counts, hit rate, entry count, memory usage, and Redis connection info (when applicable).
Example response:
{
"enabled": true,
"backend_type": "redis",
"entries": 1243,
"capacity": 5000,
"requests": {
"hit": 8912,
"miss": 2341,
"skip": 441,
"total": 11694
},
"hit_rate": "0.7924",
"evictions": 0,
"size_bytes": 0,
"config": {
"backend": "redis",
"ttl": "30m",
"capacity": 5000,
"max_response_size": 1048576,
"max_stream_buffer_size": 10485760
},
"redis": {
"connections": {
"active": 3,
"idle": 5
},
"fallback_active": false,
"errors": {
"connection": 0,
"timeout": 2,
"other": 0
}
}
}
POST /admin/response-cache/invalidate¶
Invalidates cached responses. Accepts a JSON body:
Example response:
KV Cache Index¶
GET /admin/kv-index/stats¶
Returns KV cache index statistics: entry counts, routing decision breakdown, query latency counters, and overlap score counts.
Example response:
{
"enabled": true,
"config": {
"backend": "memory",
"max_entries": 100000,
"entry_ttl_seconds": 600,
"event_sources_count": 2,
"scoring": {
"overlap_weight": 0.6,
"load_weight": 0.3,
"health_weight": 0.1,
"min_overlap_threshold": 0.3
}
},
"index": {
"prefix_count": 312,
"entry_count": 618,
"total_hits": 12490,
"total_evictions": 83
},
"event_sources": [
{
"backend_name": "vllm-1",
"connected": true,
"events_received": 8412,
"events_dropped": 0,
"last_event_at": "2026-03-13T10:24:17Z",
"reconnect_count": 0
}
],
"routing_decisions": {
"kv_aware": 9841,
"fallback": 2649,
"total": 12490
},
"query_latency_count": 12490,
"overlap_score_count": 9841
}
GET /admin/kv-index/backends¶
Returns per-backend KV cache event statistics: events received, processed, dropped, connection status, and index event counts (created/evicted).
Example response:
{
"enabled": true,
"backends": [
{
"backend_name": "vllm-1",
"connection": {
"connected": true,
"reconnect_count": 0,
"last_event_at": "2026-03-13T10:24:17Z"
},
"events": {
"received": 8412,
"dropped": 0,
"index_created": 7981,
"index_evicted": 431
}
}
]
}
POST /admin/kv-index/clear¶
Clears all entries from the KV cache index. The index rebuilds automatically from incoming events. Intended for debugging.
Example response:
Deployment Guide¶
Tier 1 Only (Minimal Configuration)¶
Enable prefix routing for GPU KV cache locality without any external dependencies:
Effective when multiple instances of the same model run across backends and system prompts are long (>128 tokens).
Tier 1 + 2 (Response Cache)¶
Add response caching to eliminate repeated deterministic requests entirely:
prefix_routing:
enabled: true
response_cache:
enabled: true
backend: memory
capacity: 5000
ttl: "10m"
Effective when the application makes repeated identical requests (e.g., document QA with fixed system prompts and fixed queries).
Tier 1 + 2 + 3 (Distributed Response Cache)¶
For multi-instance deployments, share the response cache across all router instances:
prefix_routing:
enabled: true
response_cache:
enabled: true
backend: redis
ttl: "30m"
redis:
url: "redis://redis-service:6379"
pool_size: 16
key_prefix: "cr:resp:"
fallback_to_memory: true
fallback_capacity: 2000
Redis/Valkey must be accessible from all router pods. Use rediss:// URL or tls: true for encrypted connections.
All Tiers (Full KV-Aware Routing)¶
Enable all tiers for maximum GPU cache reuse:
prefix_routing:
enabled: true
load_factor_epsilon: 0.20
response_cache:
enabled: true
backend: redis
ttl: "30m"
redis:
url: "redis://redis-service:6379"
pool_size: 16
kv_cache_index:
enabled: true
backend: redis # shares the pool from response_cache.redis
max_entries: 500000
entry_ttl_seconds: 900
scoring:
overlap_weight: 0.6
load_weight: 0.3
health_weight: 0.1
min_overlap_threshold: 0.25
gpu_tier_weight: 1.0 # GPU-resident data gets full overlap credit
storage_tier_weight: 0.6 # Offloaded data is still valuable but discounted
storage_offloading:
enabled: true # Track GPU hot vs. storage warm tiers
treat_eviction_as_offload: true
event_sources:
- backend_name: vllm-1
endpoint: "http://vllm-1:8000/v1/kv_events"
- backend_name: vllm-2
endpoint: "http://vllm-2:8000/v1/kv_events"
vLLM Requirements¶
The kv_events SSE endpoint must be enabled on each vLLM backend. vLLM exposes this endpoint when launched with:
The event stream URL is typically http://<host>:<port>/v1/kv_events.
Redis/Valkey Sizing¶
For response cache sizing, estimate:
- Average serialized response size: 2–10 KB per entry
capacity = (target_hit_rate * rps * avg_unique_rate) / eviction_frequency
For the KV index:
- Each
(prefix, backend)entry consumes approximately 200 bytes in memory - Set
max_entriesto at leastnum_unique_prefixes * num_backends * 2for headroom
High Availability Considerations¶
- The response cache and KV index tolerate Redis failure via automatic in-memory fallback (Tier 3).
- The KV index rebuilds from the SSE streams on restart; no persistence is required.
- Prefix routing (Tier 1) has no external dependencies and is always available.
- Deploy Redis with replication (Sentinel or Cluster) if cache persistence across Redis restarts is required.
Performance Characteristics¶
Tier 1: Prefix Routing¶
- Prefix key extraction (SHA256): < 10 µs per request
- CHWBL ring lookup: O(log N) where N =
virtual_nodes * num_backends; < 5 µs for typical deployments - No network I/O; operates entirely in-process
Tier 2: Response Cache¶
- In-memory cache lookup: < 1 µs
- Cache key computation (SHA256): < 5 µs
- Cache hit serves the entire response without any backend latency
Tier 3: Redis Backend¶
- Redis GET latency (LAN): 0.1–2 ms typical; P99 < 5 ms
- Redis SET latency: similar to GET
- Command timeout default: 1 second; operations exceeding this activate fallback
Tier 4: KV Cache Index¶
InMemoryKvIndex.query_backends(): < 100 µs (DashMap read, no allocation on empty result)RedisKvIndex.query_backends(): same as Redis GET latency (0.1–2 ms)KvOverlapScorer.prepare(): onequery_backends()call per unique prefix per 100 ms windowKvOverlapScorer.score(): < 1 µs (synchronous read from pre-fetched cache)- 1000 scoring calls: < 100 ms total (verified by unit benchmark in
scorer.rs)
Expected Gains¶
The following are illustrative estimates based on typical LLM workload patterns:
| Scenario | Metric | Expected Improvement |
|---|---|---|
| Long system prompt (>512 tokens), repeated across requests | Time-to-first-token | 20–40% reduction via KV cache reuse |
| Fixed document QA (same doc + same questions) | Backend requests | Up to 100% elimination via response cache |
| Multi-replica vLLM, hot prefixes | Cache hit rate (Tier 4) | 60–80% of requests routed to backend with warm cache |
| Redis failure | Service availability | No degradation; fallback to in-memory within one request |
Actual gains depend on workload prefix overlap, GPU memory capacity, and backend configuration.