VAST Data Integration Guide¶

Continuum Router integrates with VAST Data storage at three distinct points: the response cache L2 tier, the KV cache index for storage-offloaded tensor tracking, and the KV tensor transfer layer in disaggregated prefill/decode serving. This guide covers each integration point with practical deployment examples and a full production configuration.

Table of Contents¶

VAST Data Connection Methods
Prerequisites
Example: Response Cache on VAST (S3 API)
Example: KV Tensor Offloading (vLLM + VAST)
Example: Disaggregated Prefill/Decode with VAST
Example: Full Production Setup
Performance Benchmarks

VAST Data Connection Methods¶

VAST Data exposes multiple access protocols. Continuum Router uses the S3-compatible API for response cache storage and the HTTP endpoint for KV tensor transfer in disaggregated serving.

Protocol	Continuum Router Usage	VAST Data Feature	Typical Bandwidth	Notes
S3 API (HTTP/HTTPS)	Response cache L2 tier	VAST S3 Gateway	10–100 Gbps	Standard AWS S3 SDK compatibility; used for cached response blobs
HTTP endpoint	KV tensor transfer (disaggregated serving)	VAST Element Store	10–100 Gbps	Used by `disaggregated_serving.default_external_storage.endpoint`
NFS	Not directly used by router	VAST Universal Storage	10–40 Gbps	Can be used by vLLM directly for model weights
NVMe-oF / RDMA	Not directly used by router	VAST NVMe-over-Fabric	100–400 Gbps	Available to vLLM backends for ultra-low-latency tensor access

For router-level integration, you need two things from VAST Data:

An S3-compatible endpoint with a bucket and credentials (for response cache)
An HTTP endpoint for KV tensor transfer (for disaggregated serving)

These can point to the same VAST cluster using different ports or paths.

Prerequisites¶

VAST Data Cluster Requirements¶

VAST Data software version 4.0 or later
S3 Gateway enabled for response cache integration
Sufficient capacity for your workload:
- Response cache: estimate 2–20 KB per unique request; plan for millions of entries
- KV tensors: estimate 2 × num_layers × num_heads × head_dim × seq_len × 2 bytes per cached prefix

Credentials and Access¶

S3 access key and secret key with read/write permissions on the target bucket
Bucket created in advance (the router does not create buckets automatically)
Network path from each router instance to the VAST cluster

Network Requirements¶

L3 connectivity between router hosts and VAST cluster (jumbo frames recommended for large tensor transfers)
Firewall rules allowing:
- TCP port 443 or 80 to the VAST S3 Gateway (response cache)
- TCP port 8080 (or your configured port) to the VAST HTTP endpoint (disaggregated serving)
For NVMe-oF/RDMA: dedicated RDMA NIC and fabric (not required for router-level integration)

Environment Variables¶

Store sensitive credentials in environment variables rather than config files:

export VAST_ACCESS_KEY="your-access-key"
export VAST_SECRET_KEY="your-secret-key"

The router expands ${ENV_VAR} references in configuration values at startup.

Example: Response Cache on VAST (S3 API)¶

This configuration uses Redis as the hot L1 cache and VAST Data S3 as the durable L2 cache. Responses that miss L1 are checked in VAST before hitting a backend. L2 hits are promoted back to L1 for faster subsequent access.

How it works¶

Client sends a deterministic request (temperature = 0)
Router computes a cache key from the request parameters
L1 (Redis) is checked — if hit, response is returned immediately
On L1 miss, L2 (VAST S3) is checked — if hit, the response is returned and promoted to L1
On both misses, the backend is called and the response is stored in both L1 and L2

Configuration¶

response_cache:
  enabled: true

  # "tiered" enables the L1 + L2 architecture
  backend: tiered

  # Global TTL for all cache entries
  ttl: "24h"

  # Maximum response body size eligible for caching
  max_response_size: 1048576         # 1 MiB
  max_stream_buffer_size: 10485760   # 10 MiB

  # L1: Redis (hot cache — fast lookup, limited capacity)
  l1:
    type: memory   # or "redis" for distributed L1
    max_value_size: 1048576   # values larger than 1 MiB go directly to L2

  # Redis config used by L1 when l1.type is "redis"
  redis:
    url: "redis://redis-service:6379"
    pool_size: 16
    key_prefix: "cr:resp:"
    fallback_to_memory: true

  # L2: VAST Data S3 (warm cache — high capacity, durable)
  l2:
    type: s3
    endpoint: "https://vast-s3.example.com"
    bucket: "llm-response-cache"
    key_prefix: "response-cache/"
    region: "us-east-1"
    access_key: "${VAST_ACCESS_KEY}"
    secret_key: "${VAST_SECRET_KEY}"
    # Optional: override TTL for L2 entries (default: inherits global ttl)
    ttl_override: "7d"

  # Tiered cache promotion behavior
  tiered:
    promote_on_hit: true       # Promote L2 hits back to L1
    l1_promotion_ttl: "30m"    # TTL for promoted L1 entries

Field Reference¶

Field	Description
`l2.type`	Must be `"s3"` for S3-compatible backend
`l2.endpoint`	VAST S3 Gateway URL (HTTP or HTTPS)
`l2.bucket`	Pre-created S3 bucket name
`l2.key_prefix`	Key prefix within the bucket (default: `"response-cache/"`)
`l2.region`	AWS-compatible region string (default: `"us-east-1"`)
`l2.access_key`	S3 access key; supports `${ENV_VAR}` expansion
`l2.secret_key`	S3 secret key; supports `${ENV_VAR}` expansion; redacted in logs
`l2.ttl_override`	Optional TTL override for L2 entries (e.g., `"7d"`, `"24h"`)
`tiered.promote_on_hit`	Whether to copy L2 hits into L1 (default: `true`)
`tiered.l1_promotion_ttl`	TTL applied to L1 entries created by promotion (default: `"5m"`)

Example: KV Tensor Offloading (vLLM + VAST)¶

This configuration enables the KV cache index with storage offloading awareness. When vLLM offloads GPU KV tensors to VAST, the router tracks those tensors in the StorageWarm tier and applies a reduced scoring weight compared to GPU-resident (GpuHot) data.

How it works¶

vLLM computes KV tensors for a prompt and reports a cache_created event (tier: GpuHot)
Under GPU memory pressure, vLLM offloads tensors to VAST and reports cache_offloaded
The KV index downgrades the entry to StorageWarm
The router routes new requests with matching prefixes to the backend holding StorageWarm data, with a reduced overlap score relative to GpuHot
vLLM reloads the tensors from VAST and reports cache_reloaded — index upgrades back to GpuHot

Continuum Router Configuration¶

kv_cache_index:
  enabled: true
  backend: memory   # or "redis" for multi-instance deployments

  # Scale max_entries with the number of unique prompts × number of backends
  max_entries: 500000

  # Entries expire after 15 minutes; adjust based on vLLM eviction rate
  entry_ttl_seconds: 900

  scoring:
    overlap_weight: 0.6
    load_weight: 0.3
    health_weight: 0.1

    # Only activate KV-aware routing when a backend holds ≥30% coverage
    min_overlap_threshold: 0.30

    # GPU-resident data gets full overlap credit
    gpu_tier_weight: 1.0

    # Storage-offloaded data is valuable but incurs reload latency — discount it
    storage_tier_weight: 0.6

  # Track GPU hot vs. VAST warm tiers
  storage_offloading:
    enabled: true
    # When vLLM emits only cache_created/cache_evicted (no cache_offloaded),
    # treat evictions as offloads to VAST rather than permanent removal
    treat_eviction_as_offload: true

  # Subscribe to KV events from each vLLM backend
  event_sources:
    - backend_name: vllm-1
      endpoint: "http://vllm-1:8000/v1/kv_events"
      reconnect_interval_ms: 5000

    - backend_name: vllm-2
      endpoint: "http://vllm-2:8000/v1/kv_events"
      reconnect_interval_ms: 5000

vLLM Launch Command¶

vLLM must be configured to emit KV events and use VAST Data for tensor offloading:

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --enable-prefix-caching \
  --kv-cache-dtype auto \
  --max-model-len 32768 \
  --kv-transfer-config '{"kv_connector":"VastKVConnector","kv_buffer_device":"cpu","kv_buffer_size":4e9,"kv_role":"kv_both","kv_connector_extra_config":{"vast_endpoint":"http://vast-cluster:8080","kv_namespace":"inference/kv-cache"}}'

Key flags:

--enable-prefix-caching: activates KV prefix caching on the backend
--kv-cache-dtype auto: use the model's native dtype for cache storage
--kv-transfer-config: configures the VAST KV connector for offloading

The KV events SSE stream is available at http://<vllm-host>:<port>/v1/kv_events once --enable-prefix-caching is active.

Scoring Behavior with VAST Offloading¶

Scenario	Tier	Score Multiplier	Effect
Tensors in GPU VRAM	`GpuHot`	1.0 (full credit)	Highest routing priority
Tensors offloaded to VAST	`StorageWarm`	0.6 (discounted)	Medium routing priority; vLLM reloads on hit
No tensors cached	—	0.0	Falls back to configured selection strategy

Example: Disaggregated Prefill/Decode with VAST¶

This configuration separates prefill (prompt processing) from decode (token generation). VAST Data is the KV tensor transit layer: prefill workers write tensors to VAST, decode workers read them. The router orchestrates the two-phase flow.

How it works¶

Router checks KV index for the request's prefix hash
If a decode worker holds GPU-resident tensors (GpuHot): route directly to that worker (fast decode)
If tensors are in VAST (StorageWarm): route to the least-loaded decode worker (it loads from VAST)
If no cache: select a prefill worker, run the prefill phase, tensors are written to VAST, then route to a decode worker for token generation

Configuration¶

disaggregated_serving:
  enabled: true

  # Timeout for the prefill computation phase
  prefill_timeout: "60s"

  # Timeout for KV tensor transfer between VAST and workers
  kv_transfer_timeout: "15s"

  # Fall back to unified backends if disaggregated backends are unavailable
  fallback_to_unified: true

  # Default VAST storage for backends that do not specify their own
  default_external_storage:
    endpoint: "http://vast-cluster:8080"
    kv_namespace: "inference/kv-cache"
    # Credentials optional if VAST is configured for anonymous access
    # credentials: "${VAST_CREDENTIALS}"

backends:
  # Prefill workers — computes KV tensors for input prompts
  - name: prefill-worker-1
    url: "http://vllm-prefill-1:8000"
    role: prefill
    external_storage:
      endpoint: "http://vast-cluster:8080"
      kv_namespace: "inference/kv-cache"

  - name: prefill-worker-2
    url: "http://vllm-prefill-2:8000"
    role: prefill
    # Inherits default_external_storage when external_storage is omitted

  # Decode workers — generates tokens from cached KV data
  - name: decode-worker-1
    url: "http://vllm-decode-1:8000"
    role: decode
    weight: 2

  - name: decode-worker-2
    url: "http://vllm-decode-2:8000"
    role: decode
    weight: 2

  - name: decode-worker-3
    url: "http://vllm-decode-3:8000"
    role: decode
    weight: 2

  # Unified fallback — handles both phases when disaggregated pool is unavailable
  - name: unified-fallback
    url: "http://vllm-unified:8000"
    role: unified

vLLM Worker Launch Commands¶

Prefill worker:

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --enable-prefix-caching \
  --kv-cache-dtype auto \
  --kv-transfer-config '{"kv_connector":"VastKVConnector","kv_role":"kv_producer","kv_connector_extra_config":{"vast_endpoint":"http://vast-cluster:8080","kv_namespace":"inference/kv-cache"}}'

Decode worker:

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --enable-prefix-caching \
  --kv-cache-dtype auto \
  --kv-transfer-config '{"kv_connector":"VastKVConnector","kv_role":"kv_consumer","kv_connector_extra_config":{"vast_endpoint":"http://vast-cluster:8080","kv_namespace":"inference/kv-cache"}}'

The kv_role field distinguishes producer (prefill) from consumer (decode) workers.

Routing Path Headers¶

Header	Values	Meaning
`X-Continuum-Routing-Path`	`prefill_then_decode`, `fast_decode`, `unified`, `fallback`	Which routing path was taken
`X-Continuum-Prefill-Backend`	backend name	Which prefill worker ran the prefill phase
`X-Continuum-Decode-Backend`	backend name	Which decode worker generated the tokens

Pool Sizing Recommendations¶

Start with a 3:1 decode-to-prefill ratio — decode workers process tokens longer than prefill workers process prompts
Increase the prefill pool for workloads with high prompt diversity (many unique system prompts)
Increase the decode pool for workloads with long output sequences or high concurrency

Example: Full Production Setup¶

This configuration combines all three VAST integration points: tiered response cache (Redis L1 + VAST S3 L2), KV cache index with storage offloading awareness, and disaggregated prefill/decode serving.

server:
  host: "0.0.0.0"
  port: 8080

selection_strategy: LeastLatency

# ─── Prefix-Aware Routing (Tier 1) ───────────────────────────────────────────
prefix_routing:
  enabled: true
  max_prefix_length: 2048
  load_factor_epsilon: 0.20
  virtual_nodes: 200

# ─── Tiered Response Cache (Tier 2 + VAST S3 L2) ─────────────────────────────
response_cache:
  enabled: true
  backend: tiered
  ttl: "24h"
  max_response_size: 2097152       # 2 MiB
  max_stream_buffer_size: 20971520 # 20 MiB

  # L1: Redis (hot, fast, limited capacity)
  l1:
    type: redis
    max_value_size: 524288   # Values >512 KiB go directly to L2

  redis:
    url: "redis://redis-cluster:6379"
    pool_size: 32
    key_prefix: "cr:resp:"
    connect_timeout_ms: 3000
    command_timeout_ms: 1000
    fallback_to_memory: true

  # L2: VAST Data S3 (warm, durable, high capacity)
  l2:
    type: s3
    endpoint: "https://vast-s3.prod.example.com"
    bucket: "prod-llm-response-cache"
    key_prefix: "response-cache/"
    region: "us-east-1"
    access_key: "${VAST_ACCESS_KEY}"
    secret_key: "${VAST_SECRET_KEY}"
    ttl_override: "7d"

  tiered:
    promote_on_hit: true
    l1_promotion_ttl: "30m"

# ─── KV Cache Index with VAST Offloading Awareness (Tier 4) ──────────────────
kv_cache_index:
  enabled: true
  backend: redis   # Reuses the redis pool from response_cache.redis
  max_entries: 1000000
  entry_ttl_seconds: 1800

  scoring:
    overlap_weight: 0.6
    load_weight: 0.3
    health_weight: 0.1
    min_overlap_threshold: 0.25
    gpu_tier_weight: 1.0
    storage_tier_weight: 0.6

  storage_offloading:
    enabled: true
    treat_eviction_as_offload: true

  event_sources:
    - backend_name: prefill-worker-1
      endpoint: "http://vllm-prefill-1:8000/v1/kv_events"
      reconnect_interval_ms: 5000
    - backend_name: prefill-worker-2
      endpoint: "http://vllm-prefill-2:8000/v1/kv_events"
      reconnect_interval_ms: 5000
    - backend_name: decode-worker-1
      endpoint: "http://vllm-decode-1:8000/v1/kv_events"
      reconnect_interval_ms: 5000
    - backend_name: decode-worker-2
      endpoint: "http://vllm-decode-2:8000/v1/kv_events"
      reconnect_interval_ms: 5000
    - backend_name: decode-worker-3
      endpoint: "http://vllm-decode-3:8000/v1/kv_events"
      reconnect_interval_ms: 5000

# ─── Disaggregated Prefill/Decode Serving ────────────────────────────────────
disaggregated_serving:
  enabled: true
  prefill_timeout: "60s"
  kv_transfer_timeout: "15s"
  fallback_to_unified: true
  default_external_storage:
    endpoint: "http://vast-cluster.prod.example.com:8080"
    kv_namespace: "prod/inference/kv-cache"

# ─── Backends ─────────────────────────────────────────────────────────────────
backends:
  - name: prefill-worker-1
    url: "http://vllm-prefill-1:8000"
    role: prefill
    models: ["meta-llama/Llama-3.1-70B-Instruct"]

  - name: prefill-worker-2
    url: "http://vllm-prefill-2:8000"
    role: prefill
    models: ["meta-llama/Llama-3.1-70B-Instruct"]

  - name: decode-worker-1
    url: "http://vllm-decode-1:8000"
    role: decode
    weight: 2
    models: ["meta-llama/Llama-3.1-70B-Instruct"]

  - name: decode-worker-2
    url: "http://vllm-decode-2:8000"
    role: decode
    weight: 2
    models: ["meta-llama/Llama-3.1-70B-Instruct"]

  - name: decode-worker-3
    url: "http://vllm-decode-3:8000"
    role: decode
    weight: 2
    models: ["meta-llama/Llama-3.1-70B-Instruct"]

  - name: unified-fallback
    url: "http://vllm-unified:8000"
    role: unified
    models: ["meta-llama/Llama-3.1-70B-Instruct"]

# ─── Health Checks ────────────────────────────────────────────────────────────
health_checks:
  interval: "15s"
  timeout: "5s"
  unhealthy_threshold: 3
  healthy_threshold: 2

# ─── Circuit Breaker ──────────────────────────────────────────────────────────
circuit_breaker:
  enabled: true
  failure_threshold: 5
  recovery_timeout: "30s"

vLLM Launch Commands for Full Production Setup¶

Prefill workers (2 instances):

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 4 \
  --enable-prefix-caching \
  --kv-cache-dtype auto \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85 \
  --kv-transfer-config '{
    "kv_connector": "VastKVConnector",
    "kv_role": "kv_producer",
    "kv_buffer_device": "cpu",
    "kv_buffer_size": 8e9,
    "kv_connector_extra_config": {
      "vast_endpoint": "http://vast-cluster.prod.example.com:8080",
      "kv_namespace": "prod/inference/kv-cache"
    }
  }'

Decode workers (3 instances):

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 4 \
  --enable-prefix-caching \
  --kv-cache-dtype auto \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --kv-transfer-config '{
    "kv_connector": "VastKVConnector",
    "kv_role": "kv_consumer",
    "kv_buffer_device": "cpu",
    "kv_buffer_size": 8e9,
    "kv_connector_extra_config": {
      "vast_endpoint": "http://vast-cluster.prod.example.com:8080",
      "kv_namespace": "prod/inference/kv-cache"
    }
  }'

Unified fallback (1 instance):

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 4 \
  --enable-prefix-caching \
  --kv-cache-dtype auto \
  --max-model-len 32768

The unified fallback does not need VAST configuration since it handles both phases independently.

Performance Benchmarks¶

The following figures are illustrative estimates based on typical LLM serving workloads. Actual results depend on prompt diversity, GPU memory capacity, network bandwidth to VAST, and concurrent request volume.

Response Cache Hit Rate (with VAST L2)¶

Workload Pattern	L1 (Redis) Hit Rate	L2 (VAST S3) Hit Rate	Combined Hit Rate
Document QA (fixed doc + varied questions)	30–50%	20–35%	50–75%
RAG pipeline (fixed system prompt + retrieval)	15–30%	10–20%	25–45%
API with repeated identical requests	70–95%	3–10%	75–98%
Fully unique requests	< 5%	< 2%	< 7%

L2 benefits workloads where identical requests recur over hours or days beyond L1 TTL.

Disaggregated Serving Latency¶

Latency improvements compared to unified (single-worker) inference on a 70B parameter model:

Routing Path	TTFT (Time to First Token)	Relative to Unified
`fast_decode` (GPU-resident KV)	50–100 ms	40–60% reduction
`fast_decode` (VAST-loaded KV)	100–200 ms	15–35% reduction
`prefill_then_decode`	200–500 ms	0–10% overhead (one-time prefill cost)
`unified` (fallback)	300–600 ms	baseline

The prefill_then_decode path adds overhead on the first request for a prefix, but subsequent requests with the same prefix benefit from the fast_decode path.

KV Cache Index Routing Effectiveness¶

Scenario	KV-Aware Routing Rate	TTFT Reduction
High prefix overlap (same system prompt, many users)	70–85%	20–40%
Medium prefix overlap (varied system prompts)	40–60%	10–25%
Low prefix overlap (fully unique prompts)	< 10%	< 5%

VAST Data Access Latency¶

Operation	Latency Range	Bandwidth
S3 GET (response cache read)	1–10 ms	Up to 10 Gbps per connection
S3 PUT (response cache write)	2–15 ms	Up to 10 Gbps per connection
KV tensor write (prefill → VAST)	5–50 ms	Network-limited; use RDMA for < 5 ms
KV tensor read (VAST → decode)	5–50 ms	Network-limited; use RDMA for < 5 ms

For production deployments with strict latency requirements on the KV transfer path, consider placing the VAST cluster in the same rack or using RDMA-capable network adapters with the NVMe-oF protocol accessed directly from vLLM.