Skip to content

VAST Data Integration Guide

Continuum Router integrates with VAST Data storage at three distinct points: the response cache L2 tier, the KV cache index for storage-offloaded tensor tracking, and the KV tensor transfer layer in disaggregated prefill/decode serving. This guide covers each integration point with practical deployment examples and a full production configuration.

Table of Contents


VAST Data Connection Methods

VAST Data exposes multiple access protocols. Continuum Router uses the S3-compatible API for response cache storage and the HTTP endpoint for KV tensor transfer in disaggregated serving.

Protocol Continuum Router Usage VAST Data Feature Typical Bandwidth Notes
S3 API (HTTP/HTTPS) Response cache L2 tier VAST S3 Gateway 10–100 Gbps Standard AWS S3 SDK compatibility; used for cached response blobs
HTTP endpoint KV tensor transfer (disaggregated serving) VAST Element Store 10–100 Gbps Used by disaggregated_serving.default_external_storage.endpoint
NFS Not directly used by router VAST Universal Storage 10–40 Gbps Can be used by vLLM directly for model weights
NVMe-oF / RDMA Not directly used by router VAST NVMe-over-Fabric 100–400 Gbps Available to vLLM backends for ultra-low-latency tensor access

For router-level integration, you need two things from VAST Data:

  1. An S3-compatible endpoint with a bucket and credentials (for response cache)
  2. An HTTP endpoint for KV tensor transfer (for disaggregated serving)

These can point to the same VAST cluster using different ports or paths.


Prerequisites

VAST Data Cluster Requirements

  • VAST Data software version 4.0 or later
  • S3 Gateway enabled for response cache integration
  • Sufficient capacity for your workload:
    • Response cache: estimate 2–20 KB per unique request; plan for millions of entries
    • KV tensors: estimate 2 × num_layers × num_heads × head_dim × seq_len × 2 bytes per cached prefix

Credentials and Access

  • S3 access key and secret key with read/write permissions on the target bucket
  • Bucket created in advance (the router does not create buckets automatically)
  • Network path from each router instance to the VAST cluster

Network Requirements

  • L3 connectivity between router hosts and VAST cluster (jumbo frames recommended for large tensor transfers)
  • Firewall rules allowing:
    • TCP port 443 or 80 to the VAST S3 Gateway (response cache)
    • TCP port 8080 (or your configured port) to the VAST HTTP endpoint (disaggregated serving)
  • For NVMe-oF/RDMA: dedicated RDMA NIC and fabric (not required for router-level integration)

Environment Variables

Store sensitive credentials in environment variables rather than config files:

export VAST_ACCESS_KEY="your-access-key"
export VAST_SECRET_KEY="your-secret-key"

The router expands ${ENV_VAR} references in configuration values at startup.


Example: Response Cache on VAST (S3 API)

This configuration uses Redis as the hot L1 cache and VAST Data S3 as the durable L2 cache. Responses that miss L1 are checked in VAST before hitting a backend. L2 hits are promoted back to L1 for faster subsequent access.

How it works

  1. Client sends a deterministic request (temperature = 0)
  2. Router computes a cache key from the request parameters
  3. L1 (Redis) is checked — if hit, response is returned immediately
  4. On L1 miss, L2 (VAST S3) is checked — if hit, the response is returned and promoted to L1
  5. On both misses, the backend is called and the response is stored in both L1 and L2

Configuration

response_cache:
  enabled: true

  # "tiered" enables the L1 + L2 architecture
  backend: tiered

  # Global TTL for all cache entries
  ttl: "24h"

  # Maximum response body size eligible for caching
  max_response_size: 1048576         # 1 MiB
  max_stream_buffer_size: 10485760   # 10 MiB

  # L1: Redis (hot cache — fast lookup, limited capacity)
  l1:
    type: memory   # or "redis" for distributed L1
    max_value_size: 1048576   # values larger than 1 MiB go directly to L2

  # Redis config used by L1 when l1.type is "redis"
  redis:
    url: "redis://redis-service:6379"
    pool_size: 16
    key_prefix: "cr:resp:"
    fallback_to_memory: true

  # L2: VAST Data S3 (warm cache — high capacity, durable)
  l2:
    type: s3
    endpoint: "https://vast-s3.example.com"
    bucket: "llm-response-cache"
    key_prefix: "response-cache/"
    region: "us-east-1"
    access_key: "${VAST_ACCESS_KEY}"
    secret_key: "${VAST_SECRET_KEY}"
    # Optional: override TTL for L2 entries (default: inherits global ttl)
    ttl_override: "7d"

  # Tiered cache promotion behavior
  tiered:
    promote_on_hit: true       # Promote L2 hits back to L1
    l1_promotion_ttl: "30m"    # TTL for promoted L1 entries

Field Reference

Field Description
l2.type Must be "s3" for S3-compatible backend
l2.endpoint VAST S3 Gateway URL (HTTP or HTTPS)
l2.bucket Pre-created S3 bucket name
l2.key_prefix Key prefix within the bucket (default: "response-cache/")
l2.region AWS-compatible region string (default: "us-east-1")
l2.access_key S3 access key; supports ${ENV_VAR} expansion
l2.secret_key S3 secret key; supports ${ENV_VAR} expansion; redacted in logs
l2.ttl_override Optional TTL override for L2 entries (e.g., "7d", "24h")
tiered.promote_on_hit Whether to copy L2 hits into L1 (default: true)
tiered.l1_promotion_ttl TTL applied to L1 entries created by promotion (default: "5m")

Example: KV Tensor Offloading (vLLM + VAST)

This configuration enables the KV cache index with storage offloading awareness. When vLLM offloads GPU KV tensors to VAST, the router tracks those tensors in the StorageWarm tier and applies a reduced scoring weight compared to GPU-resident (GpuHot) data.

How it works

  1. vLLM computes KV tensors for a prompt and reports a cache_created event (tier: GpuHot)
  2. Under GPU memory pressure, vLLM offloads tensors to VAST and reports cache_offloaded
  3. The KV index downgrades the entry to StorageWarm
  4. The router routes new requests with matching prefixes to the backend holding StorageWarm data, with a reduced overlap score relative to GpuHot
  5. vLLM reloads the tensors from VAST and reports cache_reloaded — index upgrades back to GpuHot

Continuum Router Configuration

kv_cache_index:
  enabled: true
  backend: memory   # or "redis" for multi-instance deployments

  # Scale max_entries with the number of unique prompts × number of backends
  max_entries: 500000

  # Entries expire after 15 minutes; adjust based on vLLM eviction rate
  entry_ttl_seconds: 900

  scoring:
    overlap_weight: 0.6
    load_weight: 0.3
    health_weight: 0.1

    # Only activate KV-aware routing when a backend holds ≥30% coverage
    min_overlap_threshold: 0.30

    # GPU-resident data gets full overlap credit
    gpu_tier_weight: 1.0

    # Storage-offloaded data is valuable but incurs reload latency — discount it
    storage_tier_weight: 0.6

  # Track GPU hot vs. VAST warm tiers
  storage_offloading:
    enabled: true
    # When vLLM emits only cache_created/cache_evicted (no cache_offloaded),
    # treat evictions as offloads to VAST rather than permanent removal
    treat_eviction_as_offload: true

  # Subscribe to KV events from each vLLM backend
  event_sources:
    - backend_name: vllm-1
      endpoint: "http://vllm-1:8000/v1/kv_events"
      reconnect_interval_ms: 5000

    - backend_name: vllm-2
      endpoint: "http://vllm-2:8000/v1/kv_events"
      reconnect_interval_ms: 5000

vLLM Launch Command

vLLM must be configured to emit KV events and use VAST Data for tensor offloading:

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --enable-prefix-caching \
  --kv-cache-dtype auto \
  --max-model-len 32768 \
  --kv-transfer-config '{"kv_connector":"VastKVConnector","kv_buffer_device":"cpu","kv_buffer_size":4e9,"kv_role":"kv_both","kv_connector_extra_config":{"vast_endpoint":"http://vast-cluster:8080","kv_namespace":"inference/kv-cache"}}'

Key flags:

  • --enable-prefix-caching: activates KV prefix caching on the backend
  • --kv-cache-dtype auto: use the model's native dtype for cache storage
  • --kv-transfer-config: configures the VAST KV connector for offloading

The KV events SSE stream is available at http://<vllm-host>:<port>/v1/kv_events once --enable-prefix-caching is active.

Scoring Behavior with VAST Offloading

Scenario Tier Score Multiplier Effect
Tensors in GPU VRAM GpuHot 1.0 (full credit) Highest routing priority
Tensors offloaded to VAST StorageWarm 0.6 (discounted) Medium routing priority; vLLM reloads on hit
No tensors cached 0.0 Falls back to configured selection strategy

Example: Disaggregated Prefill/Decode with VAST

This configuration separates prefill (prompt processing) from decode (token generation). VAST Data is the KV tensor transit layer: prefill workers write tensors to VAST, decode workers read them. The router orchestrates the two-phase flow.

How it works

  1. Router checks KV index for the request's prefix hash
  2. If a decode worker holds GPU-resident tensors (GpuHot): route directly to that worker (fast decode)
  3. If tensors are in VAST (StorageWarm): route to the least-loaded decode worker (it loads from VAST)
  4. If no cache: select a prefill worker, run the prefill phase, tensors are written to VAST, then route to a decode worker for token generation

Configuration

disaggregated_serving:
  enabled: true

  # Timeout for the prefill computation phase
  prefill_timeout: "60s"

  # Timeout for KV tensor transfer between VAST and workers
  kv_transfer_timeout: "15s"

  # Fall back to unified backends if disaggregated backends are unavailable
  fallback_to_unified: true

  # Default VAST storage for backends that do not specify their own
  default_external_storage:
    endpoint: "http://vast-cluster:8080"
    kv_namespace: "inference/kv-cache"
    # Credentials optional if VAST is configured for anonymous access
    # credentials: "${VAST_CREDENTIALS}"

backends:
  # Prefill workers — computes KV tensors for input prompts
  - name: prefill-worker-1
    url: "http://vllm-prefill-1:8000"
    role: prefill
    external_storage:
      endpoint: "http://vast-cluster:8080"
      kv_namespace: "inference/kv-cache"

  - name: prefill-worker-2
    url: "http://vllm-prefill-2:8000"
    role: prefill
    # Inherits default_external_storage when external_storage is omitted

  # Decode workers — generates tokens from cached KV data
  - name: decode-worker-1
    url: "http://vllm-decode-1:8000"
    role: decode
    weight: 2

  - name: decode-worker-2
    url: "http://vllm-decode-2:8000"
    role: decode
    weight: 2

  - name: decode-worker-3
    url: "http://vllm-decode-3:8000"
    role: decode
    weight: 2

  # Unified fallback — handles both phases when disaggregated pool is unavailable
  - name: unified-fallback
    url: "http://vllm-unified:8000"
    role: unified

vLLM Worker Launch Commands

Prefill worker:

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --enable-prefix-caching \
  --kv-cache-dtype auto \
  --kv-transfer-config '{"kv_connector":"VastKVConnector","kv_role":"kv_producer","kv_connector_extra_config":{"vast_endpoint":"http://vast-cluster:8080","kv_namespace":"inference/kv-cache"}}'

Decode worker:

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --enable-prefix-caching \
  --kv-cache-dtype auto \
  --kv-transfer-config '{"kv_connector":"VastKVConnector","kv_role":"kv_consumer","kv_connector_extra_config":{"vast_endpoint":"http://vast-cluster:8080","kv_namespace":"inference/kv-cache"}}'

The kv_role field distinguishes producer (prefill) from consumer (decode) workers.

Routing Path Headers

Header Values Meaning
X-Continuum-Routing-Path prefill_then_decode, fast_decode, unified, fallback Which routing path was taken
X-Continuum-Prefill-Backend backend name Which prefill worker ran the prefill phase
X-Continuum-Decode-Backend backend name Which decode worker generated the tokens

Pool Sizing Recommendations

  • Start with a 3:1 decode-to-prefill ratio — decode workers process tokens longer than prefill workers process prompts
  • Increase the prefill pool for workloads with high prompt diversity (many unique system prompts)
  • Increase the decode pool for workloads with long output sequences or high concurrency

Example: Full Production Setup

This configuration combines all three VAST integration points: tiered response cache (Redis L1 + VAST S3 L2), KV cache index with storage offloading awareness, and disaggregated prefill/decode serving.

server:
  host: "0.0.0.0"
  port: 8080

selection_strategy: LeastLatency

# ─── Prefix-Aware Routing (Tier 1) ───────────────────────────────────────────
prefix_routing:
  enabled: true
  max_prefix_length: 2048
  load_factor_epsilon: 0.20
  virtual_nodes: 200

# ─── Tiered Response Cache (Tier 2 + VAST S3 L2) ─────────────────────────────
response_cache:
  enabled: true
  backend: tiered
  ttl: "24h"
  max_response_size: 2097152       # 2 MiB
  max_stream_buffer_size: 20971520 # 20 MiB

  # L1: Redis (hot, fast, limited capacity)
  l1:
    type: redis
    max_value_size: 524288   # Values >512 KiB go directly to L2

  redis:
    url: "redis://redis-cluster:6379"
    pool_size: 32
    key_prefix: "cr:resp:"
    connect_timeout_ms: 3000
    command_timeout_ms: 1000
    fallback_to_memory: true

  # L2: VAST Data S3 (warm, durable, high capacity)
  l2:
    type: s3
    endpoint: "https://vast-s3.prod.example.com"
    bucket: "prod-llm-response-cache"
    key_prefix: "response-cache/"
    region: "us-east-1"
    access_key: "${VAST_ACCESS_KEY}"
    secret_key: "${VAST_SECRET_KEY}"
    ttl_override: "7d"

  tiered:
    promote_on_hit: true
    l1_promotion_ttl: "30m"

# ─── KV Cache Index with VAST Offloading Awareness (Tier 4) ──────────────────
kv_cache_index:
  enabled: true
  backend: redis   # Reuses the redis pool from response_cache.redis
  max_entries: 1000000
  entry_ttl_seconds: 1800

  scoring:
    overlap_weight: 0.6
    load_weight: 0.3
    health_weight: 0.1
    min_overlap_threshold: 0.25
    gpu_tier_weight: 1.0
    storage_tier_weight: 0.6

  storage_offloading:
    enabled: true
    treat_eviction_as_offload: true

  event_sources:
    - backend_name: prefill-worker-1
      endpoint: "http://vllm-prefill-1:8000/v1/kv_events"
      reconnect_interval_ms: 5000
    - backend_name: prefill-worker-2
      endpoint: "http://vllm-prefill-2:8000/v1/kv_events"
      reconnect_interval_ms: 5000
    - backend_name: decode-worker-1
      endpoint: "http://vllm-decode-1:8000/v1/kv_events"
      reconnect_interval_ms: 5000
    - backend_name: decode-worker-2
      endpoint: "http://vllm-decode-2:8000/v1/kv_events"
      reconnect_interval_ms: 5000
    - backend_name: decode-worker-3
      endpoint: "http://vllm-decode-3:8000/v1/kv_events"
      reconnect_interval_ms: 5000

# ─── Disaggregated Prefill/Decode Serving ────────────────────────────────────
disaggregated_serving:
  enabled: true
  prefill_timeout: "60s"
  kv_transfer_timeout: "15s"
  fallback_to_unified: true
  default_external_storage:
    endpoint: "http://vast-cluster.prod.example.com:8080"
    kv_namespace: "prod/inference/kv-cache"

# ─── Backends ─────────────────────────────────────────────────────────────────
backends:
  - name: prefill-worker-1
    url: "http://vllm-prefill-1:8000"
    role: prefill
    models: ["meta-llama/Llama-3.1-70B-Instruct"]

  - name: prefill-worker-2
    url: "http://vllm-prefill-2:8000"
    role: prefill
    models: ["meta-llama/Llama-3.1-70B-Instruct"]

  - name: decode-worker-1
    url: "http://vllm-decode-1:8000"
    role: decode
    weight: 2
    models: ["meta-llama/Llama-3.1-70B-Instruct"]

  - name: decode-worker-2
    url: "http://vllm-decode-2:8000"
    role: decode
    weight: 2
    models: ["meta-llama/Llama-3.1-70B-Instruct"]

  - name: decode-worker-3
    url: "http://vllm-decode-3:8000"
    role: decode
    weight: 2
    models: ["meta-llama/Llama-3.1-70B-Instruct"]

  - name: unified-fallback
    url: "http://vllm-unified:8000"
    role: unified
    models: ["meta-llama/Llama-3.1-70B-Instruct"]

# ─── Health Checks ────────────────────────────────────────────────────────────
health_checks:
  interval: "15s"
  timeout: "5s"
  unhealthy_threshold: 3
  healthy_threshold: 2

# ─── Circuit Breaker ──────────────────────────────────────────────────────────
circuit_breaker:
  enabled: true
  failure_threshold: 5
  recovery_timeout: "30s"

vLLM Launch Commands for Full Production Setup

Prefill workers (2 instances):

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 4 \
  --enable-prefix-caching \
  --kv-cache-dtype auto \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85 \
  --kv-transfer-config '{
    "kv_connector": "VastKVConnector",
    "kv_role": "kv_producer",
    "kv_buffer_device": "cpu",
    "kv_buffer_size": 8e9,
    "kv_connector_extra_config": {
      "vast_endpoint": "http://vast-cluster.prod.example.com:8080",
      "kv_namespace": "prod/inference/kv-cache"
    }
  }'

Decode workers (3 instances):

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 4 \
  --enable-prefix-caching \
  --kv-cache-dtype auto \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --kv-transfer-config '{
    "kv_connector": "VastKVConnector",
    "kv_role": "kv_consumer",
    "kv_buffer_device": "cpu",
    "kv_buffer_size": 8e9,
    "kv_connector_extra_config": {
      "vast_endpoint": "http://vast-cluster.prod.example.com:8080",
      "kv_namespace": "prod/inference/kv-cache"
    }
  }'

Unified fallback (1 instance):

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 4 \
  --enable-prefix-caching \
  --kv-cache-dtype auto \
  --max-model-len 32768

The unified fallback does not need VAST configuration since it handles both phases independently.


Performance Benchmarks

The following figures are illustrative estimates based on typical LLM serving workloads. Actual results depend on prompt diversity, GPU memory capacity, network bandwidth to VAST, and concurrent request volume.

Response Cache Hit Rate (with VAST L2)

Workload Pattern L1 (Redis) Hit Rate L2 (VAST S3) Hit Rate Combined Hit Rate
Document QA (fixed doc + varied questions) 30–50% 20–35% 50–75%
RAG pipeline (fixed system prompt + retrieval) 15–30% 10–20% 25–45%
API with repeated identical requests 70–95% 3–10% 75–98%
Fully unique requests < 5% < 2% < 7%

L2 benefits workloads where identical requests recur over hours or days beyond L1 TTL.

Disaggregated Serving Latency

Latency improvements compared to unified (single-worker) inference on a 70B parameter model:

Routing Path TTFT (Time to First Token) Relative to Unified
fast_decode (GPU-resident KV) 50–100 ms 40–60% reduction
fast_decode (VAST-loaded KV) 100–200 ms 15–35% reduction
prefill_then_decode 200–500 ms 0–10% overhead (one-time prefill cost)
unified (fallback) 300–600 ms baseline

The prefill_then_decode path adds overhead on the first request for a prefix, but subsequent requests with the same prefix benefit from the fast_decode path.

KV Cache Index Routing Effectiveness

Scenario KV-Aware Routing Rate TTFT Reduction
High prefix overlap (same system prompt, many users) 70–85% 20–40%
Medium prefix overlap (varied system prompts) 40–60% 10–25%
Low prefix overlap (fully unique prompts) < 10% < 5%

VAST Data Access Latency

Operation Latency Range Bandwidth
S3 GET (response cache read) 1–10 ms Up to 10 Gbps per connection
S3 PUT (response cache write) 2–15 ms Up to 10 Gbps per connection
KV tensor write (prefill → VAST) 5–50 ms Network-limited; use RDMA for < 5 ms
KV tensor read (VAST → decode) 5–50 ms Network-limited; use RDMA for < 5 ms

For production deployments with strict latency requirements on the KV transfer path, consider placing the VAST cluster in the same rack or using RDMA-capable network adapters with the NVMe-oF protocol accessed directly from vLLM.