Disaggregated Prefill/Decode Serving¶
Continuum Router supports disaggregated inference, a deployment architecture where the prefill phase (prompt processing) and the decode phase (token generation) run on separate GPU workers. KV tensors computed during prefill are transferred between workers via external storage, eliminating redundant computation when the same prompt prefix is reused.
Overview¶
In standard (unified) inference, a single GPU worker handles both phases of a request:
- Prefill — compute key-value attention tensors for the full input prompt
- Decode — autoregressively generate output tokens, reading the KV cache computed in step 1
Disaggregated serving separates these phases across specialized workers:
- Prefill workers — high-throughput GPUs optimized for batch KV computation
- Decode workers — GPUs holding warm KV caches optimized for low-latency token generation
- External storage — high-bandwidth object storage used as the KV tensor transfer layer between workers (e.g., VAST Data, MinIO, AWS S3)
This separation is particularly effective when:
- Many requests share the same long system prompt (e.g., RAG documents, tool definitions)
- Prefill and decode workloads have different GPU memory requirements
- Token generation latency is the primary optimization target
Request Flow¶
The DisaggregatedOrchestrator selects a routing path for each incoming chat completion request:
Routing Paths¶
| Path | Description | Response Header Value |
|---|---|---|
FastDecode | KV data already in decode worker GPU (GpuHot) or loaded from external storage (StorageWarm) | fast_decode |
PrefillThenDecode | Full prefill phase executed, KV tensors written to external storage, then decode phase | prefill_then_decode |
Unified | No disaggregated backends configured; standard single-backend serving | unified |
Fallback | Disaggregated backends unavailable; fell back to unified serving | fallback |
The active routing path is reported in the X-Continuum-Routing-Path response header.
Response Headers¶
| Header | Description | Example |
|---|---|---|
X-Continuum-Routing-Path | Routing path taken | prefill_then_decode |
X-Continuum-Prefill-Backend | Backend that executed the prefill phase | prefill-worker-1 |
X-Continuum-Decode-Backend | Backend that executed the decode phase | decode-worker-2 |
Backend Roles¶
Each backend in the configuration can be assigned a role:
| Role | Description |
|---|---|
unified | Default. Backend handles both prefill and decode phases. Participates in all routing paths. |
prefill | Backend handles prefill computation only. Ineligible for decode-only routing. |
decode | Backend handles token generation only. Ineligible for prefill routing. |
Role assignment is enforced by the RoleFilterScorer, which gives f64::NEG_INFINITY scores to backends incompatible with the current inference phase. Backends with unified role are always eligible.
External Storage Integration¶
External storage is used as the KV tensor transfer layer between prefill and decode workers. Any S3-compatible or HTTP-accessible storage can be used (e.g., VAST Data, MinIO, AWS S3).
During a PrefillThenDecode flow:
- The prefill worker computes KV tensors for the prompt
- Tensors are written to external storage at a path derived from the prefix hash:
- The decode worker loads the tensors from external storage before beginning token generation
The KvReference structure carries the storage path, prefix hash, token count, and tensor format between orchestrator and workers.
Configuration¶
Top-Level Disaggregated Serving¶
disaggregated_serving:
enabled: false # Enable disaggregated prefill/decode serving
prefill_timeout: "30s" # Timeout for the prefill phase
kv_transfer_timeout: "10s" # Timeout for KV tensor transfer via external storage
fallback_to_unified: true # Fall back to unified when disaggregated unavailable
# Default external storage for backends that do not specify their own
default_external_storage:
endpoint: "http://storage-cluster:8080"
kv_namespace: "inference/kv-cache"
# credentials: # Optional access credentials
# access_key: "${STORAGE_ACCESS_KEY}"
# secret_key: "${STORAGE_SECRET_KEY}"
Per-Backend Role Assignment¶
Add role and optionally external_storage to each backend:
backends:
# Prefill worker - computes KV tensors
- name: prefill-worker-1
url: "http://vllm-prefill-1:8000"
role: prefill
external_storage:
endpoint: "http://storage-cluster:8080"
kv_namespace: "inference/kv-cache"
# Decode workers - generate tokens using cached KV data
- name: decode-worker-1
url: "http://vllm-decode-1:8000"
role: decode
weight: 2
- name: decode-worker-2
url: "http://vllm-decode-2:8000"
role: decode
weight: 2
# Unified fallback backend (optional)
- name: unified-fallback
url: "http://vllm-unified:8000"
role: unified # or omit - unified is the default
Minimal Example¶
disaggregated_serving:
enabled: true
default_external_storage:
endpoint: "http://storage-cluster:8080"
backends:
- name: prefill-gpu
url: "http://vllm-prefill:8000"
role: prefill
- name: decode-gpu-1
url: "http://vllm-decode-1:8000"
role: decode
- name: decode-gpu-2
url: "http://vllm-decode-2:8000"
role: decode
Configuration Reference¶
disaggregated_serving¶
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable disaggregated serving |
prefill_timeout | string | "30s" | Timeout for the prefill phase (supports ms, s, m suffixes) |
kv_transfer_timeout | string | "10s" | Timeout for KV tensor transfer between workers |
fallback_to_unified | bool | true | Fall back to unified serving when disaggregated backends are unavailable |
default_external_storage | object | null | Default external storage config used by backends that do not define their own |
external_storage (per-backend)¶
| Field | Type | Default | Description |
|---|---|---|---|
endpoint | string | required | External storage endpoint URL |
kv_namespace | string | "inference/kv-cache" | Namespace path for KV tensors |
credentials | object | null | Optional access credentials (redacted in logs and debug output) |
role (per-backend)¶
| Value | Description |
|---|---|
unified | Default. Backend participates in both prefill and decode routing. |
prefill | Backend receives prefill-phase requests only. |
decode | Backend receives decode-phase requests only. |
Metrics¶
Disaggregated serving metrics use the prefix disaggregated_. Label values are sanitized to prevent cardinality explosion.
| Metric | Type | Labels | Description |
|---|---|---|---|
disaggregated_requests_total | Counter | routing_path | Total requests by routing path (prefill_then_decode, fast_decode, unified, fallback) |
disaggregated_prefill_duration_seconds | Histogram | backend | Prefill phase duration in seconds |
disaggregated_decode_duration_seconds | Histogram | backend | Decode phase duration in seconds |
disaggregated_kv_transfer_duration_seconds | Histogram | — | KV tensor transfer duration in seconds |
disaggregated_fallback_total | Counter | — | Total fallback events (disaggregated → unified) |
disaggregated_errors_total | Counter | phase | Errors by phase (prefill, decode, kv_transfer, orchestration) |
Example PromQL:
# Fraction of requests using fast decode path
rate(disaggregated_requests_total{routing_path="fast_decode"}[5m])
/ rate(disaggregated_requests_total[5m])
# Prefill P95 latency per backend
histogram_quantile(0.95,
rate(disaggregated_prefill_duration_seconds_bucket[5m])
)
# KV transfer P99 latency
histogram_quantile(0.99,
rate(disaggregated_kv_transfer_duration_seconds_bucket[5m])
)
# Alert: high fallback rate
rate(disaggregated_fallback_total[5m]) > 0.1
Integration with KV Cache Index¶
Disaggregated serving works alongside the KV Cache Index (Tier 4). The KV index tracks which decode workers have warm GPU caches for specific prefix hashes, enabling the orchestrator to skip external storage transfer entirely when the data is already resident in a decode worker's GPU memory (GpuHot tier).
When storage_offloading.enabled is true in the KV cache index configuration, the orchestrator can also route to decode workers holding data in the StorageWarm tier (offloaded from GPU to external storage) and request an on-demand reload.
Backend Selection for Load Balancing¶
Within each phase, the orchestrator selects the least-loaded healthy backend:
- Prefill selection: Iterates backends with
role: prefillorrole: unified; picks the one with the lowestin_flightrequest count. - Decode selection: Iterates backends with
role: decodeorrole: unified; picks the one with the lowestin_flightrequest count.
This ensures even GPU utilization across prefill and decode pools independently.
Fallback Behavior¶
When fallback_to_unified: true (the default):
- If no healthy prefill or decode backends are available, the orchestrator routes to any healthy
unifiedbackend. - If no unified backends are available either, the request is served through the standard backend pool without phase separation.
- The routing path is reported as
fallbackin response headers and metrics.
When fallback_to_unified: false, requests fail with an error if disaggregated backends are unavailable.
Deployment Recommendations¶
- GPU allocation: Prefill workers benefit from high memory bandwidth; decode workers benefit from large GPU memory for KV cache residence.
- Storage sizing: Estimate KV tensor size as
2 * num_layers * num_heads * head_dim * seq_len * 2 bytes(fp16). For a 7B model with 32 layers and 4096-token context: ~256 MB per request. - Health checks: Configure backend health checks to detect GPU OOM and driver errors quickly; the circuit breaker activates fallback on repeated failures.
- Decode pool size: More decode workers reduce queuing latency for the token generation phase. A 3:1 decode-to-prefill ratio is a common starting point.