Skip to content

Disaggregated Prefill/Decode Serving

Continuum Router supports disaggregated inference, a deployment architecture where the prefill phase (prompt processing) and the decode phase (token generation) run on separate GPU workers. KV tensors computed during prefill are transferred between workers via external storage, eliminating redundant computation when the same prompt prefix is reused.

Overview

In standard (unified) inference, a single GPU worker handles both phases of a request:

  1. Prefill — compute key-value attention tensors for the full input prompt
  2. Decode — autoregressively generate output tokens, reading the KV cache computed in step 1

Disaggregated serving separates these phases across specialized workers:

  • Prefill workers — high-throughput GPUs optimized for batch KV computation
  • Decode workers — GPUs holding warm KV caches optimized for low-latency token generation
  • External storage — high-bandwidth object storage used as the KV tensor transfer layer between workers (e.g., VAST Data, MinIO, AWS S3)

This separation is particularly effective when:

  • Many requests share the same long system prompt (e.g., RAG documents, tool definitions)
  • Prefill and decode workloads have different GPU memory requirements
  • Token generation latency is the primary optimization target

Request Flow

The DisaggregatedOrchestrator selects a routing path for each incoming chat completion request:

Routing Paths

Path Description Response Header Value
FastDecode KV data already in decode worker GPU (GpuHot) or loaded from external storage (StorageWarm) fast_decode
PrefillThenDecode Full prefill phase executed, KV tensors written to external storage, then decode phase prefill_then_decode
Unified No disaggregated backends configured; standard single-backend serving unified
Fallback Disaggregated backends unavailable; fell back to unified serving fallback

The active routing path is reported in the X-Continuum-Routing-Path response header.

Response Headers

Header Description Example
X-Continuum-Routing-Path Routing path taken prefill_then_decode
X-Continuum-Prefill-Backend Backend that executed the prefill phase prefill-worker-1
X-Continuum-Decode-Backend Backend that executed the decode phase decode-worker-2

Backend Roles

Each backend in the configuration can be assigned a role:

Role Description
unified Default. Backend handles both prefill and decode phases. Participates in all routing paths.
prefill Backend handles prefill computation only. Ineligible for decode-only routing.
decode Backend handles token generation only. Ineligible for prefill routing.

Role assignment is enforced by the RoleFilterScorer, which gives f64::NEG_INFINITY scores to backends incompatible with the current inference phase. Backends with unified role are always eligible.

External Storage Integration

External storage is used as the KV tensor transfer layer between prefill and decode workers. Any S3-compatible or HTTP-accessible storage can be used (e.g., VAST Data, MinIO, AWS S3).

During a PrefillThenDecode flow:

  1. The prefill worker computes KV tensors for the prompt
  2. Tensors are written to external storage at a path derived from the prefix hash:
    {endpoint}/{kv_namespace}/{prefix_hash}
    
  3. The decode worker loads the tensors from external storage before beginning token generation

The KvReference structure carries the storage path, prefix hash, token count, and tensor format between orchestrator and workers.

Configuration

Top-Level Disaggregated Serving

disaggregated_serving:
  enabled: false                # Enable disaggregated prefill/decode serving
  prefill_timeout: "30s"        # Timeout for the prefill phase
  kv_transfer_timeout: "10s"    # Timeout for KV tensor transfer via external storage
  fallback_to_unified: true     # Fall back to unified when disaggregated unavailable

  # Default external storage for backends that do not specify their own
  default_external_storage:
    endpoint: "http://storage-cluster:8080"
    kv_namespace: "inference/kv-cache"
    # credentials:             # Optional access credentials
    #   access_key: "${STORAGE_ACCESS_KEY}"
    #   secret_key: "${STORAGE_SECRET_KEY}"

Per-Backend Role Assignment

Add role and optionally external_storage to each backend:

backends:
  # Prefill worker - computes KV tensors
  - name: prefill-worker-1
    url: "http://vllm-prefill-1:8000"
    role: prefill
    external_storage:
      endpoint: "http://storage-cluster:8080"
      kv_namespace: "inference/kv-cache"

  # Decode workers - generate tokens using cached KV data
  - name: decode-worker-1
    url: "http://vllm-decode-1:8000"
    role: decode
    weight: 2

  - name: decode-worker-2
    url: "http://vllm-decode-2:8000"
    role: decode
    weight: 2

  # Unified fallback backend (optional)
  - name: unified-fallback
    url: "http://vllm-unified:8000"
    role: unified   # or omit - unified is the default

Minimal Example

disaggregated_serving:
  enabled: true
  default_external_storage:
    endpoint: "http://storage-cluster:8080"

backends:
  - name: prefill-gpu
    url: "http://vllm-prefill:8000"
    role: prefill
  - name: decode-gpu-1
    url: "http://vllm-decode-1:8000"
    role: decode
  - name: decode-gpu-2
    url: "http://vllm-decode-2:8000"
    role: decode

Configuration Reference

disaggregated_serving

Field Type Default Description
enabled bool false Enable disaggregated serving
prefill_timeout string "30s" Timeout for the prefill phase (supports ms, s, m suffixes)
kv_transfer_timeout string "10s" Timeout for KV tensor transfer between workers
fallback_to_unified bool true Fall back to unified serving when disaggregated backends are unavailable
default_external_storage object null Default external storage config used by backends that do not define their own

external_storage (per-backend)

Field Type Default Description
endpoint string required External storage endpoint URL
kv_namespace string "inference/kv-cache" Namespace path for KV tensors
credentials object null Optional access credentials (redacted in logs and debug output)

role (per-backend)

Value Description
unified Default. Backend participates in both prefill and decode routing.
prefill Backend receives prefill-phase requests only.
decode Backend receives decode-phase requests only.

Metrics

Disaggregated serving metrics use the prefix disaggregated_. Label values are sanitized to prevent cardinality explosion.

Metric Type Labels Description
disaggregated_requests_total Counter routing_path Total requests by routing path (prefill_then_decode, fast_decode, unified, fallback)
disaggregated_prefill_duration_seconds Histogram backend Prefill phase duration in seconds
disaggregated_decode_duration_seconds Histogram backend Decode phase duration in seconds
disaggregated_kv_transfer_duration_seconds Histogram KV tensor transfer duration in seconds
disaggregated_fallback_total Counter Total fallback events (disaggregated → unified)
disaggregated_errors_total Counter phase Errors by phase (prefill, decode, kv_transfer, orchestration)

Example PromQL:

# Fraction of requests using fast decode path
rate(disaggregated_requests_total{routing_path="fast_decode"}[5m])
/ rate(disaggregated_requests_total[5m])

# Prefill P95 latency per backend
histogram_quantile(0.95,
  rate(disaggregated_prefill_duration_seconds_bucket[5m])
)

# KV transfer P99 latency
histogram_quantile(0.99,
  rate(disaggregated_kv_transfer_duration_seconds_bucket[5m])
)

# Alert: high fallback rate
rate(disaggregated_fallback_total[5m]) > 0.1

Integration with KV Cache Index

Disaggregated serving works alongside the KV Cache Index (Tier 4). The KV index tracks which decode workers have warm GPU caches for specific prefix hashes, enabling the orchestrator to skip external storage transfer entirely when the data is already resident in a decode worker's GPU memory (GpuHot tier).

When storage_offloading.enabled is true in the KV cache index configuration, the orchestrator can also route to decode workers holding data in the StorageWarm tier (offloaded from GPU to external storage) and request an on-demand reload.

Backend Selection for Load Balancing

Within each phase, the orchestrator selects the least-loaded healthy backend:

  • Prefill selection: Iterates backends with role: prefill or role: unified; picks the one with the lowest in_flight request count.
  • Decode selection: Iterates backends with role: decode or role: unified; picks the one with the lowest in_flight request count.

This ensures even GPU utilization across prefill and decode pools independently.

Fallback Behavior

When fallback_to_unified: true (the default):

  1. If no healthy prefill or decode backends are available, the orchestrator routes to any healthy unified backend.
  2. If no unified backends are available either, the request is served through the standard backend pool without phase separation.
  3. The routing path is reported as fallback in response headers and metrics.

When fallback_to_unified: false, requests fail with an error if disaggregated backends are unavailable.

Deployment Recommendations

  • GPU allocation: Prefill workers benefit from high memory bandwidth; decode workers benefit from large GPU memory for KV cache residence.
  • Storage sizing: Estimate KV tensor size as 2 * num_layers * num_heads * head_dim * seq_len * 2 bytes (fp16). For a 7B model with 32 layers and 4096-token context: ~256 MB per request.
  • Health checks: Configure backend health checks to detect GPU OOM and driver errors quickly; the circuit breaker activates fallback on repeated failures.
  • Decode pool size: More decode workers reduce queuing latency for the token generation phase. A 3:1 decode-to-prefill ratio is a common starting point.