VAST Data Integration Guide¶
Continuum Router integrates with VAST Data storage at three distinct points: the response cache L2 tier, the KV cache index for storage-offloaded tensor tracking, and the KV tensor transfer layer in disaggregated prefill/decode serving. This guide covers each integration point with practical deployment examples and a full production configuration.
Table of Contents¶
- VAST Data Connection Methods
- Prerequisites
- Example: Response Cache on VAST (S3 API)
- Example: KV Tensor Offloading (vLLM + VAST)
- Example: Disaggregated Prefill/Decode with VAST
- Example: Full Production Setup
- Performance Benchmarks
VAST Data Connection Methods¶
VAST Data exposes multiple access protocols. Continuum Router uses the S3-compatible API for response cache storage and the HTTP endpoint for KV tensor transfer in disaggregated serving.
| Protocol | Continuum Router Usage | VAST Data Feature | Typical Bandwidth | Notes |
|---|---|---|---|---|
| S3 API (HTTP/HTTPS) | Response cache L2 tier | VAST S3 Gateway | 10–100 Gbps | Standard AWS S3 SDK compatibility; used for cached response blobs |
| HTTP endpoint | KV tensor transfer (disaggregated serving) | VAST Element Store | 10–100 Gbps | Used by disaggregated_serving.default_external_storage.endpoint |
| NFS | Not directly used by router | VAST Universal Storage | 10–40 Gbps | Can be used by vLLM directly for model weights |
| NVMe-oF / RDMA | Not directly used by router | VAST NVMe-over-Fabric | 100–400 Gbps | Available to vLLM backends for ultra-low-latency tensor access |
For router-level integration, you need two things from VAST Data:
- An S3-compatible endpoint with a bucket and credentials (for response cache)
- An HTTP endpoint for KV tensor transfer (for disaggregated serving)
These can point to the same VAST cluster using different ports or paths.
Prerequisites¶
VAST Data Cluster Requirements¶
- VAST Data software version 4.0 or later
- S3 Gateway enabled for response cache integration
- Sufficient capacity for your workload:
- Response cache: estimate 2–20 KB per unique request; plan for millions of entries
- KV tensors: estimate
2 × num_layers × num_heads × head_dim × seq_len × 2 bytesper cached prefix
Credentials and Access¶
- S3 access key and secret key with read/write permissions on the target bucket
- Bucket created in advance (the router does not create buckets automatically)
- Network path from each router instance to the VAST cluster
Network Requirements¶
- L3 connectivity between router hosts and VAST cluster (jumbo frames recommended for large tensor transfers)
- Firewall rules allowing:
- TCP port 443 or 80 to the VAST S3 Gateway (response cache)
- TCP port 8080 (or your configured port) to the VAST HTTP endpoint (disaggregated serving)
- For NVMe-oF/RDMA: dedicated RDMA NIC and fabric (not required for router-level integration)
Environment Variables¶
Store sensitive credentials in environment variables rather than config files:
The router expands ${ENV_VAR} references in configuration values at startup.
Example: Response Cache on VAST (S3 API)¶
This configuration uses Redis as the hot L1 cache and VAST Data S3 as the durable L2 cache. Responses that miss L1 are checked in VAST before hitting a backend. L2 hits are promoted back to L1 for faster subsequent access.
How it works¶
- Client sends a deterministic request (temperature = 0)
- Router computes a cache key from the request parameters
- L1 (Redis) is checked — if hit, response is returned immediately
- On L1 miss, L2 (VAST S3) is checked — if hit, the response is returned and promoted to L1
- On both misses, the backend is called and the response is stored in both L1 and L2
Configuration¶
response_cache:
enabled: true
# "tiered" enables the L1 + L2 architecture
backend: tiered
# Global TTL for all cache entries
ttl: "24h"
# Maximum response body size eligible for caching
max_response_size: 1048576 # 1 MiB
max_stream_buffer_size: 10485760 # 10 MiB
# L1: Redis (hot cache — fast lookup, limited capacity)
l1:
type: memory # or "redis" for distributed L1
max_value_size: 1048576 # values larger than 1 MiB go directly to L2
# Redis config used by L1 when l1.type is "redis"
redis:
url: "redis://redis-service:6379"
pool_size: 16
key_prefix: "cr:resp:"
fallback_to_memory: true
# L2: VAST Data S3 (warm cache — high capacity, durable)
l2:
type: s3
endpoint: "https://vast-s3.example.com"
bucket: "llm-response-cache"
key_prefix: "response-cache/"
region: "us-east-1"
access_key: "${VAST_ACCESS_KEY}"
secret_key: "${VAST_SECRET_KEY}"
# Optional: override TTL for L2 entries (default: inherits global ttl)
ttl_override: "7d"
# Tiered cache promotion behavior
tiered:
promote_on_hit: true # Promote L2 hits back to L1
l1_promotion_ttl: "30m" # TTL for promoted L1 entries
Field Reference¶
| Field | Description |
|---|---|
l2.type | Must be "s3" for S3-compatible backend |
l2.endpoint | VAST S3 Gateway URL (HTTP or HTTPS) |
l2.bucket | Pre-created S3 bucket name |
l2.key_prefix | Key prefix within the bucket (default: "response-cache/") |
l2.region | AWS-compatible region string (default: "us-east-1") |
l2.access_key | S3 access key; supports ${ENV_VAR} expansion |
l2.secret_key | S3 secret key; supports ${ENV_VAR} expansion; redacted in logs |
l2.ttl_override | Optional TTL override for L2 entries (e.g., "7d", "24h") |
tiered.promote_on_hit | Whether to copy L2 hits into L1 (default: true) |
tiered.l1_promotion_ttl | TTL applied to L1 entries created by promotion (default: "5m") |
Example: KV Tensor Offloading (vLLM + VAST)¶
This configuration enables the KV cache index with storage offloading awareness. When vLLM offloads GPU KV tensors to VAST, the router tracks those tensors in the StorageWarm tier and applies a reduced scoring weight compared to GPU-resident (GpuHot) data.
How it works¶
- vLLM computes KV tensors for a prompt and reports a
cache_createdevent (tier:GpuHot) - Under GPU memory pressure, vLLM offloads tensors to VAST and reports
cache_offloaded - The KV index downgrades the entry to
StorageWarm - The router routes new requests with matching prefixes to the backend holding
StorageWarmdata, with a reduced overlap score relative toGpuHot - vLLM reloads the tensors from VAST and reports
cache_reloaded— index upgrades back toGpuHot
Continuum Router Configuration¶
kv_cache_index:
enabled: true
backend: memory # or "redis" for multi-instance deployments
# Scale max_entries with the number of unique prompts × number of backends
max_entries: 500000
# Entries expire after 15 minutes; adjust based on vLLM eviction rate
entry_ttl_seconds: 900
scoring:
overlap_weight: 0.6
load_weight: 0.3
health_weight: 0.1
# Only activate KV-aware routing when a backend holds ≥30% coverage
min_overlap_threshold: 0.30
# GPU-resident data gets full overlap credit
gpu_tier_weight: 1.0
# Storage-offloaded data is valuable but incurs reload latency — discount it
storage_tier_weight: 0.6
# Track GPU hot vs. VAST warm tiers
storage_offloading:
enabled: true
# When vLLM emits only cache_created/cache_evicted (no cache_offloaded),
# treat evictions as offloads to VAST rather than permanent removal
treat_eviction_as_offload: true
# Subscribe to KV events from each vLLM backend
event_sources:
- backend_name: vllm-1
endpoint: "http://vllm-1:8000/v1/kv_events"
reconnect_interval_ms: 5000
- backend_name: vllm-2
endpoint: "http://vllm-2:8000/v1/kv_events"
reconnect_interval_ms: 5000
vLLM Launch Command¶
vLLM must be configured to emit KV events and use VAST Data for tensor offloading:
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--enable-prefix-caching \
--kv-cache-dtype auto \
--max-model-len 32768 \
--kv-transfer-config '{"kv_connector":"VastKVConnector","kv_buffer_device":"cpu","kv_buffer_size":4e9,"kv_role":"kv_both","kv_connector_extra_config":{"vast_endpoint":"http://vast-cluster:8080","kv_namespace":"inference/kv-cache"}}'
Key flags:
--enable-prefix-caching: activates KV prefix caching on the backend--kv-cache-dtype auto: use the model's native dtype for cache storage--kv-transfer-config: configures the VAST KV connector for offloading
The KV events SSE stream is available at http://<vllm-host>:<port>/v1/kv_events once --enable-prefix-caching is active.
Scoring Behavior with VAST Offloading¶
| Scenario | Tier | Score Multiplier | Effect |
|---|---|---|---|
| Tensors in GPU VRAM | GpuHot | 1.0 (full credit) | Highest routing priority |
| Tensors offloaded to VAST | StorageWarm | 0.6 (discounted) | Medium routing priority; vLLM reloads on hit |
| No tensors cached | — | 0.0 | Falls back to configured selection strategy |
Example: Disaggregated Prefill/Decode with VAST¶
This configuration separates prefill (prompt processing) from decode (token generation). VAST Data is the KV tensor transit layer: prefill workers write tensors to VAST, decode workers read them. The router orchestrates the two-phase flow.
How it works¶
- Router checks KV index for the request's prefix hash
- If a decode worker holds GPU-resident tensors (
GpuHot): route directly to that worker (fast decode) - If tensors are in VAST (
StorageWarm): route to the least-loaded decode worker (it loads from VAST) - If no cache: select a prefill worker, run the prefill phase, tensors are written to VAST, then route to a decode worker for token generation
Configuration¶
disaggregated_serving:
enabled: true
# Timeout for the prefill computation phase
prefill_timeout: "60s"
# Timeout for KV tensor transfer between VAST and workers
kv_transfer_timeout: "15s"
# Fall back to unified backends if disaggregated backends are unavailable
fallback_to_unified: true
# Default VAST storage for backends that do not specify their own
default_external_storage:
endpoint: "http://vast-cluster:8080"
kv_namespace: "inference/kv-cache"
# Credentials optional if VAST is configured for anonymous access
# credentials: "${VAST_CREDENTIALS}"
backends:
# Prefill workers — computes KV tensors for input prompts
- name: prefill-worker-1
url: "http://vllm-prefill-1:8000"
role: prefill
external_storage:
endpoint: "http://vast-cluster:8080"
kv_namespace: "inference/kv-cache"
- name: prefill-worker-2
url: "http://vllm-prefill-2:8000"
role: prefill
# Inherits default_external_storage when external_storage is omitted
# Decode workers — generates tokens from cached KV data
- name: decode-worker-1
url: "http://vllm-decode-1:8000"
role: decode
weight: 2
- name: decode-worker-2
url: "http://vllm-decode-2:8000"
role: decode
weight: 2
- name: decode-worker-3
url: "http://vllm-decode-3:8000"
role: decode
weight: 2
# Unified fallback — handles both phases when disaggregated pool is unavailable
- name: unified-fallback
url: "http://vllm-unified:8000"
role: unified
vLLM Worker Launch Commands¶
Prefill worker:
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--enable-prefix-caching \
--kv-cache-dtype auto \
--kv-transfer-config '{"kv_connector":"VastKVConnector","kv_role":"kv_producer","kv_connector_extra_config":{"vast_endpoint":"http://vast-cluster:8080","kv_namespace":"inference/kv-cache"}}'
Decode worker:
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--enable-prefix-caching \
--kv-cache-dtype auto \
--kv-transfer-config '{"kv_connector":"VastKVConnector","kv_role":"kv_consumer","kv_connector_extra_config":{"vast_endpoint":"http://vast-cluster:8080","kv_namespace":"inference/kv-cache"}}'
The kv_role field distinguishes producer (prefill) from consumer (decode) workers.
Routing Path Headers¶
| Header | Values | Meaning |
|---|---|---|
X-Continuum-Routing-Path | prefill_then_decode, fast_decode, unified, fallback | Which routing path was taken |
X-Continuum-Prefill-Backend | backend name | Which prefill worker ran the prefill phase |
X-Continuum-Decode-Backend | backend name | Which decode worker generated the tokens |
Pool Sizing Recommendations¶
- Start with a 3:1 decode-to-prefill ratio — decode workers process tokens longer than prefill workers process prompts
- Increase the prefill pool for workloads with high prompt diversity (many unique system prompts)
- Increase the decode pool for workloads with long output sequences or high concurrency
Example: Full Production Setup¶
This configuration combines all three VAST integration points: tiered response cache (Redis L1 + VAST S3 L2), KV cache index with storage offloading awareness, and disaggregated prefill/decode serving.
server:
host: "0.0.0.0"
port: 8080
selection_strategy: LeastLatency
# ─── Prefix-Aware Routing (Tier 1) ───────────────────────────────────────────
prefix_routing:
enabled: true
max_prefix_length: 2048
load_factor_epsilon: 0.20
virtual_nodes: 200
# ─── Tiered Response Cache (Tier 2 + VAST S3 L2) ─────────────────────────────
response_cache:
enabled: true
backend: tiered
ttl: "24h"
max_response_size: 2097152 # 2 MiB
max_stream_buffer_size: 20971520 # 20 MiB
# L1: Redis (hot, fast, limited capacity)
l1:
type: redis
max_value_size: 524288 # Values >512 KiB go directly to L2
redis:
url: "redis://redis-cluster:6379"
pool_size: 32
key_prefix: "cr:resp:"
connect_timeout_ms: 3000
command_timeout_ms: 1000
fallback_to_memory: true
# L2: VAST Data S3 (warm, durable, high capacity)
l2:
type: s3
endpoint: "https://vast-s3.prod.example.com"
bucket: "prod-llm-response-cache"
key_prefix: "response-cache/"
region: "us-east-1"
access_key: "${VAST_ACCESS_KEY}"
secret_key: "${VAST_SECRET_KEY}"
ttl_override: "7d"
tiered:
promote_on_hit: true
l1_promotion_ttl: "30m"
# ─── KV Cache Index with VAST Offloading Awareness (Tier 4) ──────────────────
kv_cache_index:
enabled: true
backend: redis # Reuses the redis pool from response_cache.redis
max_entries: 1000000
entry_ttl_seconds: 1800
scoring:
overlap_weight: 0.6
load_weight: 0.3
health_weight: 0.1
min_overlap_threshold: 0.25
gpu_tier_weight: 1.0
storage_tier_weight: 0.6
storage_offloading:
enabled: true
treat_eviction_as_offload: true
event_sources:
- backend_name: prefill-worker-1
endpoint: "http://vllm-prefill-1:8000/v1/kv_events"
reconnect_interval_ms: 5000
- backend_name: prefill-worker-2
endpoint: "http://vllm-prefill-2:8000/v1/kv_events"
reconnect_interval_ms: 5000
- backend_name: decode-worker-1
endpoint: "http://vllm-decode-1:8000/v1/kv_events"
reconnect_interval_ms: 5000
- backend_name: decode-worker-2
endpoint: "http://vllm-decode-2:8000/v1/kv_events"
reconnect_interval_ms: 5000
- backend_name: decode-worker-3
endpoint: "http://vllm-decode-3:8000/v1/kv_events"
reconnect_interval_ms: 5000
# ─── Disaggregated Prefill/Decode Serving ────────────────────────────────────
disaggregated_serving:
enabled: true
prefill_timeout: "60s"
kv_transfer_timeout: "15s"
fallback_to_unified: true
default_external_storage:
endpoint: "http://vast-cluster.prod.example.com:8080"
kv_namespace: "prod/inference/kv-cache"
# ─── Backends ─────────────────────────────────────────────────────────────────
backends:
- name: prefill-worker-1
url: "http://vllm-prefill-1:8000"
role: prefill
models: ["meta-llama/Llama-3.1-70B-Instruct"]
- name: prefill-worker-2
url: "http://vllm-prefill-2:8000"
role: prefill
models: ["meta-llama/Llama-3.1-70B-Instruct"]
- name: decode-worker-1
url: "http://vllm-decode-1:8000"
role: decode
weight: 2
models: ["meta-llama/Llama-3.1-70B-Instruct"]
- name: decode-worker-2
url: "http://vllm-decode-2:8000"
role: decode
weight: 2
models: ["meta-llama/Llama-3.1-70B-Instruct"]
- name: decode-worker-3
url: "http://vllm-decode-3:8000"
role: decode
weight: 2
models: ["meta-llama/Llama-3.1-70B-Instruct"]
- name: unified-fallback
url: "http://vllm-unified:8000"
role: unified
models: ["meta-llama/Llama-3.1-70B-Instruct"]
# ─── Health Checks ────────────────────────────────────────────────────────────
health_checks:
interval: "15s"
timeout: "5s"
unhealthy_threshold: 3
healthy_threshold: 2
# ─── Circuit Breaker ──────────────────────────────────────────────────────────
circuit_breaker:
enabled: true
failure_threshold: 5
recovery_timeout: "30s"
vLLM Launch Commands for Full Production Setup¶
Prefill workers (2 instances):
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 4 \
--enable-prefix-caching \
--kv-cache-dtype auto \
--max-model-len 32768 \
--gpu-memory-utilization 0.85 \
--kv-transfer-config '{
"kv_connector": "VastKVConnector",
"kv_role": "kv_producer",
"kv_buffer_device": "cpu",
"kv_buffer_size": 8e9,
"kv_connector_extra_config": {
"vast_endpoint": "http://vast-cluster.prod.example.com:8080",
"kv_namespace": "prod/inference/kv-cache"
}
}'
Decode workers (3 instances):
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 4 \
--enable-prefix-caching \
--kv-cache-dtype auto \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--kv-transfer-config '{
"kv_connector": "VastKVConnector",
"kv_role": "kv_consumer",
"kv_buffer_device": "cpu",
"kv_buffer_size": 8e9,
"kv_connector_extra_config": {
"vast_endpoint": "http://vast-cluster.prod.example.com:8080",
"kv_namespace": "prod/inference/kv-cache"
}
}'
Unified fallback (1 instance):
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 4 \
--enable-prefix-caching \
--kv-cache-dtype auto \
--max-model-len 32768
The unified fallback does not need VAST configuration since it handles both phases independently.
Performance Benchmarks¶
The following figures are illustrative estimates based on typical LLM serving workloads. Actual results depend on prompt diversity, GPU memory capacity, network bandwidth to VAST, and concurrent request volume.
Response Cache Hit Rate (with VAST L2)¶
| Workload Pattern | L1 (Redis) Hit Rate | L2 (VAST S3) Hit Rate | Combined Hit Rate |
|---|---|---|---|
| Document QA (fixed doc + varied questions) | 30–50% | 20–35% | 50–75% |
| RAG pipeline (fixed system prompt + retrieval) | 15–30% | 10–20% | 25–45% |
| API with repeated identical requests | 70–95% | 3–10% | 75–98% |
| Fully unique requests | < 5% | < 2% | < 7% |
L2 benefits workloads where identical requests recur over hours or days beyond L1 TTL.
Disaggregated Serving Latency¶
Latency improvements compared to unified (single-worker) inference on a 70B parameter model:
| Routing Path | TTFT (Time to First Token) | Relative to Unified |
|---|---|---|
fast_decode (GPU-resident KV) | 50–100 ms | 40–60% reduction |
fast_decode (VAST-loaded KV) | 100–200 ms | 15–35% reduction |
prefill_then_decode | 200–500 ms | 0–10% overhead (one-time prefill cost) |
unified (fallback) | 300–600 ms | baseline |
The prefill_then_decode path adds overhead on the first request for a prefix, but subsequent requests with the same prefix benefit from the fast_decode path.
KV Cache Index Routing Effectiveness¶
| Scenario | KV-Aware Routing Rate | TTFT Reduction |
|---|---|---|
| High prefix overlap (same system prompt, many users) | 70–85% | 20–40% |
| Medium prefix overlap (varied system prompts) | 40–60% | 10–25% |
| Low prefix overlap (fully unique prompts) | < 10% | < 5% |
VAST Data Access Latency¶
| Operation | Latency Range | Bandwidth |
|---|---|---|
| S3 GET (response cache read) | 1–10 ms | Up to 10 Gbps per connection |
| S3 PUT (response cache write) | 2–15 ms | Up to 10 Gbps per connection |
| KV tensor write (prefill → VAST) | 5–50 ms | Network-limited; use RDMA for < 5 ms |
| KV tensor read (VAST → decode) | 5–50 ms | Network-limited; use RDMA for < 5 ms |
For production deployments with strict latency requirements on the KV transfer path, consider placing the VAST cluster in the same rack or using RDMA-capable network adapters with the NVMe-oF protocol accessed directly from vLLM.