분리형 Prefill/Decode 서빙¶

Continuum Router는 분리형 추론(disaggregated inference) 아키텍처를 지원합니다. 이 방식에서는 prefill 단계(프롬프트 처리)와 decode 단계(토큰 생성)가 별도의 GPU 워커에서 실행됩니다. Prefill 중에 계산된 KV 텐서는 VAST Data 오브젝트 스토리지를 통해 워커 간에 전송되어, 동일한 프롬프트 prefix가 재사용될 때 중복 연산을 제거합니다.

개요¶

표준(통합형) 추론에서는 단일 GPU 워커가 요청의 두 단계를 모두 처리합니다:

Prefill — 전체 입력 프롬프트에 대한 key-value attention 텐서 계산
Decode — prefill에서 계산된 KV cache를 읽어 출력 토큰을 자기회귀적으로 생성

분리형 서빙은 이 단계들을 전문화된 워커들로 분리합니다:

Prefill 워커 — 배치 KV 연산에 최적화된 고처리량 GPU
Decode 워커 — 저지연 토큰 생성에 최적화되어 warm KV cache를 유지하는 GPU
VAST Data — 워커 간 KV 텐서 전송 레이어로 사용되는 고대역폭 오브젝트 스토리지

이 방식은 다음과 같은 경우에 특히 효과적입니다:

많은 요청이 동일한 긴 system prompt를 공유하는 경우 (예: RAG 문서, 도구 정의)
Prefill과 decode 워크로드의 GPU 메모리 요구사항이 다른 경우
토큰 생성 지연 시간이 주요 최적화 목표인 경우

요청 흐름¶

DisaggregatedOrchestrator는 각 수신 chat completion 요청에 대해 라우팅 경로를 선택합니다:

disaggregated flow diagram

라우팅 경로¶

경로	설명	응답 헤더 값
`FastDecode`	decode 워커 GPU에 KV 데이터가 이미 있음(GpuHot) 또는 VAST에서 로드됨(StorageWarm)	`fast_decode`
`PrefillThenDecode`	전체 prefill 단계 실행, KV 텐서를 VAST에 쓴 후 decode 단계 실행	`prefill_then_decode`
`Unified`	분리형 백엔드 미설정; 표준 단일 백엔드 서빙	`unified`
`Fallback`	분리형 백엔드 사용 불가; 통합형 서빙으로 폴백	`fallback`

활성 라우팅 경로는 X-Continuum-Routing-Path 응답 헤더에 보고됩니다.

응답 헤더¶

헤더	설명	예시
`X-Continuum-Routing-Path`	선택된 라우팅 경로	`prefill_then_decode`
`X-Continuum-Prefill-Backend`	Prefill 단계를 실행한 백엔드	`prefill-worker-1`
`X-Continuum-Decode-Backend`	Decode 단계를 실행한 백엔드	`decode-worker-2`

백엔드 역할¶

설정의 각 백엔드에 역할을 할당할 수 있습니다:

역할	설명
`unified`	기본값. 백엔드가 prefill과 decode 단계 모두를 처리합니다. 모든 라우팅 경로에 참여합니다.
`prefill`	백엔드가 prefill 연산만 처리합니다. decode 전용 라우팅에는 부적합합니다.
`decode`	백엔드가 토큰 생성만 처리합니다. prefill 라우팅에는 부적합합니다.

역할 할당은 RoleFilterScorer에 의해 적용되며, 현재 추론 단계와 호환되지 않는 백엔드에 f64::NEG_INFINITY 점수를 부여합니다. unified 역할의 백엔드는 항상 적합합니다.

VAST Data 통합¶

VAST Data는 prefill과 decode 워커 간의 KV 텐서 전송 레이어로 사용됩니다.

PrefillThenDecode 흐름 중:

Prefill 워커가 프롬프트에 대한 KV 텐서를 계산합니다
텐서는 prefix hash에서 파생된 경로의 VAST에 기록됩니다:
```
{endpoint}/{kv_namespace}/{prefix_hash}
```
Decode 워커가 토큰 생성을 시작하기 전에 VAST에서 텐서를 로드합니다

KvReference 구조체는 스토리지 경로, prefix hash, 토큰 수, 텐서 형식을 오케스트레이터와 워커 간에 전달합니다.

설정¶

최상위 분리형 서빙¶

disaggregated_serving:
  enabled: false                # 분리형 prefill/decode 서빙 활성화
  prefill_timeout: "30s"        # Prefill 단계의 타임아웃
  kv_transfer_timeout: "10s"    # 외부 스토리지를 통한 KV 텐서 전송 타임아웃
  fallback_to_unified: true     # 분리형 사용 불가 시 통합형으로 폴백

  # 자체 설정을 지정하지 않는 백엔드의 기본 외부 스토리지
  default_external_storage:
    endpoint: "http://vast-cluster:8080"
    kv_namespace: "inference/kv-cache"
    # credentials:             # 선택적 접근 자격증명
    #   access_key: "${STORAGE_ACCESS_KEY}"
    #   secret_key: "${STORAGE_SECRET_KEY}"

백엔드별 역할 할당¶

각 백엔드에 role과 선택적으로 external_storage를 추가합니다:

backends:
  # Prefill 워커 - KV 텐서 계산
  - name: prefill-worker-1
    url: "http://vllm-prefill-1:8000"
    role: prefill
    external_storage:
      endpoint: "http://vast-cluster:8080"
      kv_namespace: "inference/kv-cache"

  # Decode 워커 - 캐시된 KV 데이터를 사용하여 토큰 생성
  - name: decode-worker-1
    url: "http://vllm-decode-1:8000"
    role: decode
    weight: 2

  - name: decode-worker-2
    url: "http://vllm-decode-2:8000"
    role: decode
    weight: 2

  # 통합형 폴백 백엔드 (선택적)
  - name: unified-fallback
    url: "http://vllm-unified:8000"
    role: unified   # 또는 생략 - unified가 기본값

최소 설정 예시¶

disaggregated_serving:
  enabled: true
  default_external_storage:
    endpoint: "http://vast-cluster:8080"

backends:
  - name: prefill-gpu
    url: "http://vllm-prefill:8000"
    role: prefill
  - name: decode-gpu-1
    url: "http://vllm-decode-1:8000"
    role: decode
  - name: decode-gpu-2
    url: "http://vllm-decode-2:8000"
    role: decode

설정 레퍼런스¶

`disaggregated_serving`¶

필드	타입	기본값	설명
`enabled`	bool	`false`	분리형 서빙 활성화
`prefill_timeout`	string	`"30s"`	Prefill 단계 타임아웃 (`ms`, `s`, `m` 접미사 지원)
`kv_transfer_timeout`	string	`"10s"`	워커 간 KV 텐서 전송 타임아웃
`fallback_to_unified`	bool	`true`	분리형 백엔드 사용 불가 시 통합형 서빙으로 폴백
`default_external_storage`	object	`null`	자체 설정을 정의하지 않는 백엔드의 기본 외부 스토리지 설정

`external_storage` (백엔드별)¶

필드	타입	기본값	설명
`endpoint`	string	필수	외부 스토리지 엔드포인트 URL
`kv_namespace`	string	`"inference/kv-cache"`	KV 텐서의 네임스페이스 경로
`credentials`	object	`null`	선택적 접근 자격증명 (로그 및 디버그 출력에서 편집됨)

`role` (백엔드별)¶

값	설명
`unified`	기본값. 백엔드가 prefill과 decode 라우팅 모두에 참여합니다.
`prefill`	백엔드가 prefill 단계 요청만 받습니다.
`decode`	백엔드가 decode 단계 요청만 받습니다.

메트릭¶

분리형 서빙 메트릭은 disaggregated_ 접두사를 사용합니다. 기수 폭발을 방지하기 위해 레이블 값이 검증됩니다.

메트릭	타입	레이블	설명
`disaggregated_requests_total`	Counter	`routing_path`	라우팅 경로별 총 요청 수 (`prefill_then_decode`, `fast_decode`, `unified`, `fallback`)
`disaggregated_prefill_duration_seconds`	Histogram	`backend`	초 단위 prefill 단계 시간
`disaggregated_decode_duration_seconds`	Histogram	`backend`	초 단위 decode 단계 시간
`disaggregated_kv_transfer_duration_seconds`	Histogram	—	초 단위 KV 텐서 전송 시간
`disaggregated_fallback_total`	Counter	—	총 폴백 이벤트 수 (분리형 → 통합형)
`disaggregated_errors_total`	Counter	`phase`	단계별 오류 수 (`prefill`, `decode`, `kv_transfer`, `orchestration`)

PromQL 예시:

# fast decode 경로를 사용하는 요청 비율
rate(disaggregated_requests_total{routing_path="fast_decode"}[5m])
/ rate(disaggregated_requests_total[5m])

# 백엔드별 prefill P95 지연 시간
histogram_quantile(0.95,
  rate(disaggregated_prefill_duration_seconds_bucket[5m])
)

# KV 전송 P99 지연 시간
histogram_quantile(0.99,
  rate(disaggregated_kv_transfer_duration_seconds_bucket[5m])
)

# 경보: 높은 폴백 비율
rate(disaggregated_fallback_total[5m]) > 0.1

KV Cache Index와의 통합¶

분리형 서빙은 KV Cache Index (Tier 4)와 함께 동작합니다. KV 인덱스는 특정 prefix hash에 대해 어떤 decode 워커가 warm GPU cache를 가지고 있는지 추적하여, 데이터가 이미 decode 워커의 GPU 메모리(GpuHot 티어)에 상주할 때 VAST 전송을 완전히 건너뛸 수 있게 합니다.

KV cache index 설정에서 storage_offloading.enabled가 true인 경우, 오케스트레이터는 StorageWarm 티어(GPU에서 VAST로 오프로드됨)에 데이터를 보유한 decode 워커로도 라우팅하여 온디맨드 리로드를 요청할 수 있습니다.

부하 분산을 위한 백엔드 선택¶

각 단계 내에서 오케스트레이터는 최소 부하의 건강한 백엔드를 선택합니다:

Prefill 선택: role: prefill 또는 role: unified 백엔드를 반복하여 in_flight 요청 수가 가장 적은 백엔드를 선택합니다.
Decode 선택: role: decode 또는 role: unified 백엔드를 반복하여 in_flight 요청 수가 가장 적은 백엔드를 선택합니다.

이를 통해 prefill과 decode 풀 전반에 걸쳐 GPU 활용도가 균등하게 유지됩니다.

폴백 동작¶

fallback_to_unified: true(기본값)인 경우:

건강한 prefill 또는 decode 백엔드가 없으면 오케스트레이터가 건강한 unified 백엔드로 라우팅합니다.
통합형 백엔드도 없으면 단계 분리 없이 표준 백엔드 풀을 통해 요청이 처리됩니다.
라우팅 경로는 응답 헤더와 메트릭에서 fallback으로 보고됩니다.

fallback_to_unified: false인 경우 분리형 백엔드를 사용할 수 없으면 요청이 오류와 함께 실패합니다.

배포 권고사항¶

GPU 할당: Prefill 워커는 높은 메모리 대역폭에서 이점을 얻고, decode 워커는 KV cache 상주를 위해 큰 GPU 메모리에서 이점을 얻습니다.
VAST 크기 조정: KV 텐서 크기를 2 × 레이어 수 × 헤드 수 × 헤드 차원 × 시퀀스 길이 × 2바이트(fp16)로 추정합니다. 32레이어, 4096 토큰 컨텍스트의 7B 모델의 경우 요청당 약 256MB입니다.
상태 확인: GPU OOM 및 드라이버 오류를 빠르게 감지하도록 백엔드 상태 확인을 구성합니다. 반복적인 실패 시 서킷 브레이커가 폴백을 활성화합니다.
Decode 풀 크기: Decode 워커가 많을수록 토큰 생성 단계의 대기 지연이 줄어듭니다. 3:1 decode-to-prefill 비율이 일반적인 시작점입니다.