Skip to content

KV Cache Optimization

Continuum Router implements a four-tier KV cache optimization system that reduces redundant computation on LLM backends. Each tier is independently configurable and the tiers compose together during backend selection.

Table of Contents


Overview

Modern LLM inference engines (vLLM, TensorRT-LLM, SGLang) maintain a KV cache in GPU memory that stores attention key-value tensors computed for the token prefix of a request. When the same prefix is seen again, the engine can skip recomputing those tensors — a substantial GPU time saving for long system prompts or repeated context.

Continuum Router maximizes KV cache reuse across backends through four complementary mechanisms:

  1. Prefix-Aware Sticky Routing — routes requests sharing the same prompt prefix to the same backend via consistent hashing, keeping GPU KV cache warm.
  2. Response Cache — serves repeated deterministic requests directly from router memory or Redis without hitting any backend at all.
  3. Shared External Cache — stores response cache state in Redis/Valkey so multiple router instances share the same cache entries.
  4. Backend KV Cache Index — tracks which backends actually hold GPU-resident KV tensors for recent prefixes, enabling fine-grained routing decisions informed by real cache state.

Four-Tier Caching Strategy

flowchart TD
    Client([Client Request])
    RC{Response\nCache hit?}
    PRL{Prefix key\navailable?}
    CHWBL[CHWBL Hash Ring\nPrefix → Backend]
    KVI{KV Index\noverlap ≥ threshold?}
    SCORE[Composite Score\noverlap + load + health]
    FALLBACK[Default Selection\nStrategy]
    BACKEND([Selected Backend])
    STORE_RC[Store Response\nin Cache]

    Client --> RC
    RC -->|HIT| Client
    RC -->|MISS| PRL
    PRL -->|Yes| CHWBL
    CHWBL --> KVI
    KVI -->|Yes| SCORE
    SCORE --> BACKEND
    KVI -->|No| FALLBACK
    PRL -->|No| FALLBACK
    FALLBACK --> BACKEND
    BACKEND --> STORE_RC
    STORE_RC --> Client

The tiers are not mutually exclusive: Tiers 1 and 4 both participate in backend selection. Tier 2 intercepts the entire request before any backend is contacted. Tier 3 provides the storage substrate for Tier 2.


Tier 1: Prefix-Aware Sticky Routing

Tier 1 routes requests that share a common prompt prefix to the same backend, maximizing the probability that the GPU KV cache on that backend is already warm for those tokens.

Prefix Key Extraction

For each incoming chat completion request, the router extracts a prefix key — a 32-byte SHA256 digest that uniquely identifies the semantic anchor of the request.

The extraction logic handles both OpenAI and Anthropic request formats:

Format Preferred anchor Fallback anchor
OpenAI messages[].role == "system" content First non-system message
Anthropic Top-level system string or content-block array First non-system message

The hash is computed as:

# With system prompt:
SHA256(model_bytes ++ "\x00" ++ "S" ++ system_bytes[:max_prefix_length])

# Without system prompt (first message fallback):
SHA256(model_bytes ++ "\x00" ++ "M" ++ first_msg_bytes[:max_prefix_length])

The \x00 separator prevents length-extension collisions between the model name and content. The tag bytes S/M prevent identical text from hashing to the same value when it appears as a system prompt versus a first user message.

The max_prefix_length parameter (default: 1024 bytes) truncates the content before hashing, with UTF-8 boundary awareness to avoid splitting multibyte characters.

Implementation: src/core/prefix_key.rs, src/core/hashing.rs

Consistent Hashing with Bounded Loads (CHWBL)

The PrefixAwareHash selection strategy uses a consistent hash ring to map prefix keys to backends. Simple consistent hashing can produce uneven load distribution when some prefixes are far more popular than others. Continuum Router addresses this with the Consistent Hashing with Bounded Loads (CHWBL) algorithm.

CHWBL adds a load cap: a backend can handle at most (1 + epsilon) * average_load requests simultaneously. When a backend is at its load cap, the request overflows to the next node clockwise on the ring.

epsilon = 0.25  →  a backend can be at most 25% above average load

The ring is populated with virtual_nodes (default: 150) virtual replicas per backend to improve key distribution uniformity.

Routing decision labels:

  • prefix_hash — request was placed on the backend that owns this prefix on the hash ring
  • overflow — preferred backend was at load cap, request went to the next ring node
  • fallback — no prefix key was extractable, fell back to the configured selection strategy

Anthropic Cache Control Injection

When anthropic_cache_control_injection: true is set, the router automatically adds cache_control: { type: "ephemeral" } markers to system prompt content blocks in Anthropic API requests. This activates Anthropic's server-side prompt caching, which is distinct from the router-level KV cache but complementary to it.


Tier 2: Response Cache

The response cache stores complete LLM responses for deterministic requests, allowing the router to serve repeated queries without contacting any backend.

Cache Eligibility

A request is eligible for caching when all of the following are true:

  • temperature is 0 or absent
  • The request does not use streaming (or uses streaming with buffering enabled)
  • The accumulated response size is within max_response_size

Requests with non-zero temperature are probabilistic and are never cached. The response header X-Cache: HIT, X-Cache: MISS, or X-Cache: BYPASS indicates the cache disposition.

Cache Key Computation

The cache key is a SHA256 hash over all parameters that affect LLM output:

SHA256(
    model,
    "\x00",
    SHA256(messages),   // pre-hashed messages array
    "\x00",
    temperature_bytes,
    "\x00",
    SOME/NONE_tag + max_tokens_bytes,
    "\x00",
    SOME/NONE_tag + top_p_bytes,
    "\x00",
    SOME/NONE_tag + tenant_id_bytes,
)

SOME/NONE tag bytes prevent collisions between None and Some(0) for optional parameters. The tenant ID is included to provide multi-tenant isolation — tenants cannot read each other's cached responses.

Implementation: src/infrastructure/cache/response_cache.rs

Streaming Cache

For streaming responses, the router accumulates the SSE stream into a buffer up to max_stream_buffer_size (default: 10 MiB). If the complete stream fits within the limit, the buffer is stored as a single serialized blob and replayed as a synthetic SSE stream on cache hits.

Cache Eviction

The in-memory backend uses LRU eviction when the entry count reaches capacity. The Redis backend relies on TTL-based expiration managed by Redis itself.


Tier 3: Shared External Cache

The shared external cache provides a CacheStore trait abstraction over Redis/Valkey, allowing multiple router instances to share response cache state.

CacheStore Trait

pub trait CacheStore: Send + Sync + 'static {
    async fn get(&self, key: &str) -> CacheStoreResult<Option<Vec<u8>>>;
    async fn set(&self, key: &str, value: &[u8], ttl: Duration) -> CacheStoreResult<()>;
    async fn delete(&self, key: &str) -> CacheStoreResult<()>;
    async fn clear(&self) -> CacheStoreResult<()>;
    async fn stats(&self) -> CacheStoreStats;
}

Implementations: InMemoryCacheStore (default, LRU + TTL), RedisCacheStore (Redis/Valkey with connection pooling).

Implementation: src/infrastructure/cache/store.rs

Redis Backend

RedisCacheStore uses deadpool-redis for connection pooling. All keys are namespaced with a configurable prefix (default: cr:resp:) to avoid collisions with other applications sharing the same Redis instance.

Key namespacing format:

cr:resp:<sha256-cache-key-hex>

Operations use SET EX for writes and GET for reads, with configurable command timeouts (default: 1s).

Automatic Fallback

When Redis is unreachable, RedisCacheStore transparently activates an in-memory fallback cache. The fallback is activated on the first connection failure and a background health-monitor task (running every 30 seconds) attempts to restore the Redis connection. On recovery, the flag is cleared and subsequent operations go back to Redis.

The continuum_cache_fallback_active metric (value 1) indicates that fallback mode is currently active.

Connection Pool Sharing

The deadpool_redis::Pool is stored as an Arc in AppState and shared between the response cache and the KV cache index (Tier 4). This avoids double-counting connections and simplifies configuration: both consumers reuse the same pool credentials.


Tier 4: Backend KV Cache Index

Tier 4 tracks in real time which backends hold GPU-resident KV tensors for specific token prefix hashes. This enables routing decisions based on actual GPU cache state rather than statistical affinity.

Event Consumption

Each vLLM backend exposes a KV cache event stream at an SSE endpoint (e.g., http://vllm-1:8000/v1/kv_events). The KvEventConsumerManager spawns a background Tokio task per backend that subscribes to this stream and processes events.

Event types:

Event Meaning
cache_created A KV block for a token prefix was created on this backend (data enters GPU VRAM)
cache_evicted A KV block was evicted from GPU memory on this backend
cache_offloaded A KV block was explicitly offloaded from GPU to external storage (e.g., S3-compatible storage)
cache_reloaded A KV block was reloaded from external storage back into GPU memory
cache_purged A KV block was permanently removed from all storage tiers

Each event carries a prefix_hash (hex string) and an optional token_count indicating how many tokens are cached for that prefix on that backend.

SSE parsing details:

  • The consumer preferentially uses the SSE event: field to determine event type; the JSON event field is a fallback.
  • The buffer is capped at 1 MiB (MAX_SSE_BUFFER_SIZE) to protect against malformed streams.
  • On connection failure or stream end, the consumer applies exponential backoff (initial: 1s, max: 60s) before reconnecting.

Implementation: src/infrastructure/kv_index/event_consumer.rs

Index Structure

Events are fed into a KvCacheIndex implementation, which maintains a mapping:

prefix_hash → {backend_id → token_count}

The token count acts as the score: a backend holding 2048 cached tokens for a prefix ranks higher than one holding 512.

Two implementations are provided:

InMemoryKvIndex

  • DashMap<String, PrefixEntry> for lock-free concurrent reads
  • LRU eviction when entry count reaches max_entries (evicts the oldest 10% of entries)
  • TTL-based expiration checked lazily on query_backends()
  • Periodic cleanup_expired() removes stale entries proactively
  • Default: 100,000 max entries, 300s TTL

RedisKvIndex

  • Stores each prefix as a Redis sorted set: ZADD cr:kvidx:<prefix_hash> <token_count> <backend_id>
  • EXPIRE is pipelined with ZADD in a single atomic round-trip
  • ZREVRANGEBYSCORE +inf -inf WITHSCORES returns backends ranked by descending score
  • Enables sharing of KV index state across multiple router instances
  • Key prefix: cr:kvidx:

Implementation: src/infrastructure/kv_index/index.rs

Storage Tier Awareness

When storage_offloading.enabled is true, the index tracks two storage tiers for each (prefix, backend) entry:

Tier Name Description
Hot GpuHot KV data is resident in GPU VRAM. Immediate cache hit, no reload latency.
Warm StorageWarm KV data has been offloaded to external storage (e.g., S3-compatible storage). Cache hit with additional reload latency.

Tier transitions from events:

Event Resulting tier
cache_created GpuHot
cache_offloaded StorageWarm
cache_reloaded GpuHot
cache_evicted (when treat_eviction_as_offload: true) StorageWarm
cache_evicted (when treat_eviction_as_offload: false) Entry removed
cache_purged Entry removed

The treat_eviction_as_offload option controls whether generic cache_evicted events (which do not carry tier information) are treated as offloads to warm storage or as permanent removals. This is useful when vLLM backends emit only cache_created and cache_evicted events without the explicit cache_offloaded event type.

Implementation: src/infrastructure/kv_index/types.rs

Overlap Scoring

The KvOverlapScorer implements the BackendScorer trait and computes a composite score for each backend:

final_score = overlap_weight   * (raw_overlap * tier_multiplier)
            + load_weight      * (1.0 - load_ratio)
            + health_weight    * health_score

Where:

  • raw_overlap = backend_token_count / max_token_count_across_backends (0.0 to 1.0)
  • tier_multiplier = gpu_tier_weight for GpuHot data, or storage_tier_weight for StorageWarm data
  • load_ratio = backend_in_flight / max_in_flight_across_backends (0.0 to 1.0)
  • health_score = backend_success_rate (0.0 to 1.0)

Default weights:

Parameter Default Description
overlap_weight 0.6 Weight for the cache overlap signal
load_weight 0.3 Weight for the backend load signal
health_weight 0.1 Weight for the backend health signal
gpu_tier_weight 1.0 Tier multiplier for GPU-resident (GpuHot) data
storage_tier_weight 0.6 Tier multiplier for storage-offloaded (StorageWarm) data

The three main weights (overlap_weight, load_weight, health_weight) must sum to exactly 1.0 (validated at startup).

The tier weight multipliers (gpu_tier_weight, storage_tier_weight) are independent — they scale the raw overlap score before the weighted sum is computed. A StorageWarm backend with 100 cached tokens scores lower than a GpuHot backend with 100 cached tokens because the tier multiplier reduces the effective overlap signal.

Minimum overlap threshold: If the best overlap score across all backends is below min_overlap_threshold (default: 0.3), the scorer returns 0.0 for all backends and the pool falls back to its default selection strategy. This prevents routing to a sub-optimal backend when no backend has meaningful cache coverage.

Async preparation model: KvCacheIndex.query_backends() is async but BackendScorer.score() must be synchronous. The scorer uses a two-phase design: prepare() fetches and caches the index query result (with a 100ms internal TTL), then score() reads synchronously from the cache.

Implementation: src/infrastructure/kv_index/scorer.rs


Backend Selection Pipeline

When a chat completion request arrives, the router executes the following pipeline:

sequenceDiagram
    participant C as Client
    participant R as Router
    participant RC as Response Cache
    participant PK as Prefix Key
    participant CHWBL as CHWBL Ring
    participant KVI as KV Index
    participant B as Backend

    C->>R: POST /v1/chat/completions
    R->>RC: Lookup cache key (T=0 only)
    alt Cache HIT
        RC-->>R: Cached response
        R-->>C: 200 OK (X-Cache: HIT)
    else Cache MISS
        R->>PK: Extract prefix key
        alt Prefix key available
            R->>CHWBL: Map prefix → preferred backend
            R->>KVI: prepare(prefix_hash)
            KVI-->>R: Backend scores from index
            R->>R: Score = overlap + load + health
            alt Best overlap ≥ threshold
                R->>B: Forward request (KV-aware routing)
            else Below threshold
                R->>CHWBL: Use CHWBL result (prefix routing)
                R->>B: Forward request
            end
        else No prefix key
            R->>B: Forward request (default strategy)
        end
        B-->>R: Response
        R->>RC: Store response (T=0)
        R-->>C: 200 OK (X-Cache: MISS)
    end

The KV overlap scorer composes with prefix-aware routing rather than replacing it. When the KV index has data and the overlap exceeds the threshold, the overlap scorer steers the selection. When data is absent or insufficient, the CHWBL ring result is used.


Configuration Reference

Tier 1: Prefix-Aware Routing

prefix_routing:
  enabled: true

  # Maximum bytes of prompt content used in prefix hash (default: 1024)
  max_prefix_length: 1024

  # CHWBL load cap epsilon: backend can handle (1 + epsilon) * avg_load
  # Range: 0.01 to 10.0 (default: 0.25)
  load_factor_epsilon: 0.25

  # Virtual nodes per backend on the consistent hash ring (default: 150)
  # Higher values improve distribution uniformity
  virtual_nodes: 150

  # Inject Anthropic cache_control markers into system prompts (default: false)
  anthropic_cache_control_injection: false

Tier 2: Response Cache

response_cache:
  enabled: true

  # Cache backend: "memory" (default) or "redis"
  # Changing backend requires restart; other fields support hot-reload
  backend: memory

  # Maximum cached entries before LRU eviction (default: 1000)
  capacity: 1000

  # TTL for cached entries (default: "5m")
  ttl: "5m"

  # Maximum response body size eligible for caching (default: 1 MiB)
  max_response_size: 1048576

  # Maximum streaming buffer size eligible for caching (default: 10 MiB)
  max_stream_buffer_size: 10485760

Tier 3: Shared External Cache (Redis backend)

response_cache:
  enabled: true
  backend: redis

  redis:
    # Redis/Valkey connection URL
    url: "redis://redis:6379"

    # rediss:// for TLS; or set tls: true with redis:// URL
    # url: "rediss://redis:6380"

    # Connection pool size (default: 8)
    pool_size: 8

    # Key namespace prefix (must not contain glob characters)
    key_prefix: "cr:resp:"

    # Connection timeout in milliseconds (default: 3000)
    connect_timeout_ms: 3000

    # Per-command timeout in milliseconds (default: 1000)
    command_timeout_ms: 1000

    # Fallback capacity for in-memory cache when Redis is unreachable (default: 1000)
    fallback_capacity: 1000

    # TTL for fallback in-memory entries in seconds (default: 300)
    fallback_ttl_seconds: 300

    # Fall back to in-memory cache on Redis failure (default: true)
    fallback_to_memory: true

Tier 4: KV Cache Index

kv_cache_index:
  enabled: true

  # Index backend: "memory" (default) or "redis"
  # When "redis", reuses the connection pool from response_cache.redis
  backend: memory

  # Maximum prefix hash entries tracked (default: 100000)
  # Range: 100 to 10,000,000
  max_entries: 100000

  # TTL for index entries in seconds (default: 600)
  # Range: 1 to 86400
  entry_ttl_seconds: 600

  # Backend scoring weights (overlap + load + health must sum to 1.0)
  scoring:
    overlap_weight: 0.6
    load_weight: 0.3
    health_weight: 0.1

    # Minimum best-overlap to activate KV-aware routing (default: 0.3)
    # If no backend exceeds this, falls back to configured strategy
    min_overlap_threshold: 0.3

    # Tier weight multipliers (independent of the three main weights)
    # Applied to the raw overlap score before the weighted sum
    gpu_tier_weight: 1.0       # Multiplier for GpuHot (GPU-resident) data
    storage_tier_weight: 0.6   # Multiplier for StorageWarm (offloaded) data

  # Tiered storage awareness (GPU hot vs. external storage warm)
  storage_offloading:
    enabled: false               # Enable storage tier tracking (default: false)
    treat_eviction_as_offload: true  # Treat cache_evicted as offload to warm (default: true)

  # vLLM backends to subscribe to for KV cache events
  event_sources:
    - backend_name: vllm-1
      endpoint: "http://vllm-1:8000/v1/kv_events"
      reconnect_interval_ms: 5000

    - backend_name: vllm-2
      endpoint: "http://vllm-2:8000/v1/kv_events"
      reconnect_interval_ms: 5000

Supported endpoint schemes for event_sources[].endpoint: http, https, ws, wss.


Metrics

All KV cache metrics use the prefix continuum_. Label values are sanitized against allowlists to prevent cardinality explosion.

Tier 1: Prefix Routing

Metric Type Labels Description
continuum_prefix_routing_requests_total Counter strategy Routing decisions by strategy (prefix_hash, overflow, fallback)
continuum_prefix_routing_backend_distribution Gauge backend In-flight requests per backend under prefix routing
continuum_prefix_routing_prefix_cardinality Gauge Approximate count of unique prefix keys seen

Example PromQL:

# Overflow rate (CHWBL load cap activations)
rate(continuum_prefix_routing_requests_total{strategy="overflow"}[5m])
/ rate(continuum_prefix_routing_requests_total[5m])

# Per-backend load distribution
continuum_prefix_routing_backend_distribution

Tier 2: Response Cache

Metric Type Labels Description
continuum_response_cache_requests_total Counter result Cache lookups by result (hit, miss, skip)
continuum_response_cache_entries Gauge Current number of cached entries
continuum_response_cache_size_bytes Gauge Approximate memory usage in bytes
continuum_response_cache_evictions_total Counter LRU evictions from the response cache
continuum_response_cache_hit_rate Gauge Rolling hit rate (0.0 to 1.0)

Example PromQL:

# Cache hit rate over 5 minutes
rate(continuum_response_cache_requests_total{result="hit"}[5m])
/ rate(continuum_response_cache_requests_total{result=~"hit|miss"}[5m])

# Cache bypass rate (non-deterministic requests)
rate(continuum_response_cache_requests_total{result="skip"}[5m])

Tier 3: Redis Backend

Metric Type Labels Description
continuum_cache_backend_type Gauge backend Active backend type (memory=1 or redis=1)
continuum_cache_redis_connections_active Gauge Active (in-use) Redis connections
continuum_cache_redis_connections_idle Gauge Idle Redis connections in pool
continuum_cache_redis_latency_seconds Histogram operation Redis operation latency (get, set, delete)
continuum_cache_redis_errors_total Counter type Redis errors (connection, timeout, other)
continuum_cache_fallback_active Gauge 1 when in-memory fallback is active

Example PromQL:

# Redis P99 GET latency
histogram_quantile(0.99,
  rate(continuum_cache_redis_latency_seconds_bucket{operation="get"}[5m])
)

# Redis error rate
rate(continuum_cache_redis_errors_total[5m])

# Alert: fallback active
continuum_cache_fallback_active == 1

Tier 4: KV Cache Index

Metric Type Labels Description
continuum_kv_event_received_total Counter backend KV events received per backend
continuum_kv_event_processed_total Counter backend KV events successfully processed per backend
continuum_kv_event_dropped_total Counter backend KV events dropped due to channel backpressure
continuum_kv_consumer_connected Gauge backend 1 when consumer is connected to SSE stream
continuum_kv_consumer_reconnects_total Counter backend SSE reconnection attempts per backend
continuum_kv_index_entries Gauge Current number of tracked (prefix, backend) pairs
continuum_kv_index_events_total Counter backend, type Index mutations (created, evicted)
continuum_kv_index_query_latency_seconds Histogram KV index query latency in seconds
continuum_kv_index_routing_decisions_total Counter decision KV-aware routing decisions (kv_aware, fallback)
continuum_kv_index_overlap_score Histogram Overlap scores for routed requests (0.0 to 1.0)
continuum_kv_index_event_source_status Gauge backend, status Event source connection status per backend

Example PromQL:

# KV-aware routing activation rate
rate(continuum_kv_index_routing_decisions_total{decision="kv_aware"}[5m])
/ rate(continuum_kv_index_routing_decisions_total[5m])

# Event drop rate per backend (indicates backpressure)
rate(continuum_kv_event_dropped_total[5m])

# P50 overlap score for routed requests
histogram_quantile(0.50, rate(continuum_kv_index_overlap_score_bucket[5m]))

# Disconnected event consumers
continuum_kv_consumer_connected == 0

Admin Endpoints

All admin endpoints are under the /admin prefix and require authentication if configured.

Prefix Routing

GET /admin/prefix-routing/stats

Returns prefix routing statistics including routing decision counts, overflow rate, backend load distribution, and CHWBL configuration.

Example response:

{
  "enabled": true,
  "config": {
    "max_prefix_length": 1024,
    "load_factor_epsilon": 0.25,
    "virtual_nodes": 150,
    "anthropic_cache_control_injection": false
  },
  "routing_decisions": {
    "total": 4926,
    "prefix_hash": 4821,
    "overflow": 93,
    "fallback": 12,
    "overflow_rate": "0.0189"
  },
  "backend_distribution": [
    { "backend": "vllm-1", "in_flight_requests": 4 },
    { "backend": "vllm-2", "in_flight_requests": 3 }
  ],
  "unique_prefixes": 247
}

Response Cache

GET /admin/response-cache/stats

Returns response cache statistics: hit/miss/skip counts, hit rate, entry count, memory usage, and Redis connection info (when applicable).

Example response:

{
  "enabled": true,
  "backend_type": "redis",
  "entries": 1243,
  "capacity": 5000,
  "requests": {
    "hit": 8912,
    "miss": 2341,
    "skip": 441,
    "total": 11694
  },
  "hit_rate": "0.7924",
  "evictions": 0,
  "size_bytes": 0,
  "config": {
    "backend": "redis",
    "ttl": "30m",
    "capacity": 5000,
    "max_response_size": 1048576,
    "max_stream_buffer_size": 10485760
  },
  "redis": {
    "connections": {
      "active": 3,
      "idle": 5
    },
    "fallback_active": false,
    "errors": {
      "connection": 0,
      "timeout": 2,
      "other": 0
    }
  }
}

POST /admin/response-cache/invalidate

Invalidates cached responses. Accepts a JSON body:

{ "clear_all": true }

Example response:

{
  "success": true,
  "action": "clear_all",
  "cleared_entries": 1243
}

KV Cache Index

GET /admin/kv-index/stats

Returns KV cache index statistics: entry counts, routing decision breakdown, query latency counters, and overlap score counts.

Example response:

{
  "enabled": true,
  "config": {
    "backend": "memory",
    "max_entries": 100000,
    "entry_ttl_seconds": 600,
    "event_sources_count": 2,
    "scoring": {
      "overlap_weight": 0.6,
      "load_weight": 0.3,
      "health_weight": 0.1,
      "min_overlap_threshold": 0.3
    }
  },
  "index": {
    "prefix_count": 312,
    "entry_count": 618,
    "total_hits": 12490,
    "total_evictions": 83
  },
  "event_sources": [
    {
      "backend_name": "vllm-1",
      "connected": true,
      "events_received": 8412,
      "events_dropped": 0,
      "last_event_at": "2026-03-13T10:24:17Z",
      "reconnect_count": 0
    }
  ],
  "routing_decisions": {
    "kv_aware": 9841,
    "fallback": 2649,
    "total": 12490
  },
  "query_latency_count": 12490,
  "overlap_score_count": 9841
}

GET /admin/kv-index/backends

Returns per-backend KV cache event statistics: events received, processed, dropped, connection status, and index event counts (created/evicted).

Example response:

{
  "enabled": true,
  "backends": [
    {
      "backend_name": "vllm-1",
      "connection": {
        "connected": true,
        "reconnect_count": 0,
        "last_event_at": "2026-03-13T10:24:17Z"
      },
      "events": {
        "received": 8412,
        "dropped": 0,
        "index_created": 7981,
        "index_evicted": 431
      }
    }
  ]
}

POST /admin/kv-index/clear

Clears all entries from the KV cache index. The index rebuilds automatically from incoming events. Intended for debugging.

Example response:

{
  "success": true,
  "entries_before_clear": 618,
  "cleared_entries": 2
}

Deployment Guide

Tier 1 Only (Minimal Configuration)

Enable prefix routing for GPU KV cache locality without any external dependencies:

prefix_routing:
  enabled: true
  load_factor_epsilon: 0.25
  virtual_nodes: 150

Effective when multiple instances of the same model run across backends and system prompts are long (>128 tokens).

Tier 1 + 2 (Response Cache)

Add response caching to eliminate repeated deterministic requests entirely:

prefix_routing:
  enabled: true

response_cache:
  enabled: true
  backend: memory
  capacity: 5000
  ttl: "10m"

Effective when the application makes repeated identical requests (e.g., document QA with fixed system prompts and fixed queries).

Tier 1 + 2 + 3 (Distributed Response Cache)

For multi-instance deployments, share the response cache across all router instances:

prefix_routing:
  enabled: true

response_cache:
  enabled: true
  backend: redis
  ttl: "30m"
  redis:
    url: "redis://redis-service:6379"
    pool_size: 16
    key_prefix: "cr:resp:"
    fallback_to_memory: true
    fallback_capacity: 2000

Redis/Valkey must be accessible from all router pods. Use rediss:// URL or tls: true for encrypted connections.

All Tiers (Full KV-Aware Routing)

Enable all tiers for maximum GPU cache reuse:

prefix_routing:
  enabled: true
  load_factor_epsilon: 0.20

response_cache:
  enabled: true
  backend: redis
  ttl: "30m"
  redis:
    url: "redis://redis-service:6379"
    pool_size: 16

kv_cache_index:
  enabled: true
  backend: redis   # shares the pool from response_cache.redis
  max_entries: 500000
  entry_ttl_seconds: 900
  scoring:
    overlap_weight: 0.6
    load_weight: 0.3
    health_weight: 0.1
    min_overlap_threshold: 0.25
    gpu_tier_weight: 1.0       # GPU-resident data gets full overlap credit
    storage_tier_weight: 0.6   # Offloaded data is still valuable but discounted
  storage_offloading:
    enabled: true              # Track GPU hot vs. storage warm tiers
    treat_eviction_as_offload: true
  event_sources:
    - backend_name: vllm-1
      endpoint: "http://vllm-1:8000/v1/kv_events"
    - backend_name: vllm-2
      endpoint: "http://vllm-2:8000/v1/kv_events"

vLLM Requirements

The kv_events SSE endpoint must be enabled on each vLLM backend. vLLM exposes this endpoint when launched with:

vllm serve <model> \
  --enable-prefix-caching \
  --kv-cache-dtype auto

The event stream URL is typically http://<host>:<port>/v1/kv_events.

Redis/Valkey Sizing

For response cache sizing, estimate:

  • Average serialized response size: 2–10 KB per entry
  • capacity = (target_hit_rate * rps * avg_unique_rate) / eviction_frequency

For the KV index:

  • Each (prefix, backend) entry consumes approximately 200 bytes in memory
  • Set max_entries to at least num_unique_prefixes * num_backends * 2 for headroom

High Availability Considerations

  • The response cache and KV index tolerate Redis failure via automatic in-memory fallback (Tier 3).
  • The KV index rebuilds from the SSE streams on restart; no persistence is required.
  • Prefix routing (Tier 1) has no external dependencies and is always available.
  • Deploy Redis with replication (Sentinel or Cluster) if cache persistence across Redis restarts is required.

Performance Characteristics

Tier 1: Prefix Routing

  • Prefix key extraction (SHA256): < 10 µs per request
  • CHWBL ring lookup: O(log N) where N = virtual_nodes * num_backends; < 5 µs for typical deployments
  • No network I/O; operates entirely in-process

Tier 2: Response Cache

  • In-memory cache lookup: < 1 µs
  • Cache key computation (SHA256): < 5 µs
  • Cache hit serves the entire response without any backend latency

Tier 3: Redis Backend

  • Redis GET latency (LAN): 0.1–2 ms typical; P99 < 5 ms
  • Redis SET latency: similar to GET
  • Command timeout default: 1 second; operations exceeding this activate fallback

Tier 4: KV Cache Index

  • InMemoryKvIndex.query_backends(): < 100 µs (DashMap read, no allocation on empty result)
  • RedisKvIndex.query_backends(): same as Redis GET latency (0.1–2 ms)
  • KvOverlapScorer.prepare(): one query_backends() call per unique prefix per 100 ms window
  • KvOverlapScorer.score(): < 1 µs (synchronous read from pre-fetched cache)
  • 1000 scoring calls: < 100 ms total (verified by unit benchmark in scorer.rs)

Expected Gains

The following are illustrative estimates based on typical LLM workload patterns:

Scenario Metric Expected Improvement
Long system prompt (>512 tokens), repeated across requests Time-to-first-token 20–40% reduction via KV cache reuse
Fixed document QA (same doc + same questions) Backend requests Up to 100% elimination via response cache
Multi-replica vLLM, hot prefixes Cache hit rate (Tier 4) 60–80% of requests routed to backend with warm cache
Redis failure Service availability No degradation; fallback to in-memory within one request

Actual gains depend on workload prefix overlap, GPU memory capacity, and backend configuration.