성능 가이드¶

이 가이드는 Continuum Router의 성능 특성, 최적화 전략, 벤치마킹 및 튜닝을 다룹니다.

성능 특성¶

목표 메트릭¶

메트릭	목표	참고
라우팅 지연 시간	< 10ms	라우터가 추가하는 오버헤드
처리량	1000+ req/s	인스턴스당
메모리 사용량	< 100MB	캐싱 제외
CPU 사용량	< 20%	중간 부하에서
연결 풀	500 연결	백엔드당
동시 요청	1000+	적절한 튜닝 시
캐시 적중률	> 80%	모델 목록 캐싱
시작 시간	< 5s	콜드 스타트
헬스 체크 지연	< 100ms	백엔드당

확장성 한계¶

백엔드: 여러 백엔드 지원 테스트 완료
모델: 수백 개의 모델을 집계하고 캐시할 수 있음
요청 크기: 설정 가능한 최대 요청 크기
응답 스트리밍: 버퍼링 없음, 최소 메모리 오버헤드
동시 연결: 튜닝 시 수천 개의 동시 연결 지원

리소스 요구 사항¶

최소 (개발)¶

CPU: 1 코어
메모리: 256MB
네트워크: 100 Mbps
디스크: 10GB

권장 (프로덕션)¶

CPU: 4 코어
메모리: 2GB
네트워크: 1 Gbps
디스크: 50GB SSD

고성능¶

CPU: 16+ 코어
메모리: 8GB
네트워크: 10 Gbps
디스크: 100GB NVMe SSD

벤치마크¶

벤치마크 실행¶

프로젝트에는 benches/ 디렉토리에 성능 벤치마크가 포함되어 있습니다:

# 모든 벤치마크 실행
cargo bench

# 특정 벤치마크 실행
cargo bench performance

# 벤치마크 리포트 생성
cargo bench -- --save-baseline main

처리량 테스트 예제¶

wrk를 사용한 처리량 테스트:

# wrk 설치
brew install wrk  # macOS
apt-get install wrk  # Ubuntu

# 기본 처리량 테스트
wrk -t12 -c400 -d30s --latency \
  http://localhost:8080/v1/chat/completions \
  -s scripts/chat_completion.lua

Lua 스크립트 예제 (scripts/chat_completion.lua):

wrk.method = "POST"
wrk.body   = '{"model":"gpt-3.5-turbo","messages":[{"role":"user","content":"Hello"}]}'
wrk.headers["Content-Type"] = "application/json"

지연 시간 테스트 예제¶

hey를 사용한 지연 시간 테스트:

# hey 설치
go install github.com/rakyll/hey@latest

# 지연 시간 테스트
hey -z 30s -c 100 -m POST \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-3.5-turbo","messages":[{"role":"user","content":"Hi"}]}' \
  http://localhost:8080/v1/chat/completions

성능 튜닝¶

운영 체제 튜닝¶

Linux 커널 매개변수¶

# /etc/sysctl.conf

# 네트워크 튜닝
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 15

# 파일 디스크립터 제한
fs.file-max = 2097152
fs.nr_open = 2097152

# 메모리 튜닝
vm.swappiness = 10
vm.dirty_ratio = 15
vm.dirty_background_ratio = 5

# 설정 적용
sudo sysctl -p

사용자 제한¶

# /etc/security/limits.conf
continuum soft nofile 65535
continuum hard nofile 65535
continuum soft nproc 32768
continuum hard nproc 32768

애플리케이션 튜닝¶

고처리량 설정¶

# config-high-throughput.yaml
server:
  bind_address: "0.0.0.0:8080"
  connection_pool_size: 1000
  keepalive_timeout: 75

request:
  timeout: 30
  max_retries: 1

cache:
  model_cache_ttl: 900  # 15분

http_client:
  pool_idle_timeout: 90
  pool_max_idle_per_host: 100

저지연 설정¶

# config-low-latency.yaml
server:
  bind_address: "0.0.0.0:8080"

routing:
  strategy: "LeastLatency"
  fallback_strategy: "RoundRobin"

request:
  timeout: 10
  max_retries: 0  # 최저 지연 시간을 위해 재시도 없음

cache:
  model_cache_ttl: 3600  # 1시간 캐시로 일관성 유지

health_checks:
  interval: 10  # 정확한 라우팅을 위한 빈번한 체크

메모리 제한 설정¶

# config-low-memory.yaml
server:
  connection_pool_size: 10

cache:
  model_cache_ttl: 60  # 메모리 감소를 위한 짧은 TTL

request:
  timeout: 30

logging:
  level: "error"  # 로그 볼륨 감소

최적화 전략¶

요청 최적화¶

요청 중복 제거

deduplication:
  enabled: true
  ttl: 60  # 초

연결 풀링

server:
  connection_pool_size: 500  # 부하에 따라 조정

타임아웃 설정

request:
  timeout: 30  # 초
  streaming_timeout: 300  # 스트리밍용 5분

응답 최적화¶

스트리밍
"stream": true 요청에 대해 SSE 스트리밍이 자동으로 활성화됨
버퍼링 없음으로 최소 메모리 오버헤드 보장
청크가 즉시 전달됨
모델 캐싱
```
cache:
  model_cache_ttl: 300  # 5분
```

백엔드 최적화¶

로드 밸런싱 전략

사용 가능한 전략:

RoundRobin: 단순 라운드 로빈 선택
WeightedRoundRobin: 가중치 기반 분산
LeastLatency: 가장 빠른 백엔드로 라우팅
Random: 무작위 백엔드 선택
ConsistentHash: 세션 어피니티를 위한 해시 기반 라우팅

설정 예제:

routing:
  strategy: "LeastLatency"
  fallback_strategy: "RoundRobin"

  # WeightedRoundRobin용
  weights:
    "http://backend1:11434": 3
    "http://backend2:11434": 1

헬스 체크

health_checks:
  enabled: true
  interval: 30  # 초
  timeout: 5    # 초
  unhealthy_threshold: 3
  healthy_threshold: 2

TTFB 최적화¶

Time To First Byte (TTFB)는 스트리밍 LLM 응답에 중요합니다. 라우터는 TTFB 오버헤드를 최소화하기 위해 여러 최적화를 구현합니다.

연결 사전 워밍¶

라우터는 시작 시 모든 백엔드에 대한 연결을 사전 워밍합니다:

// 시작 시 자동 연결 사전 워밍
// - HTTP/2 연결을 미리 설정
// - 첫 번째 요청에 대한 콜드 스타트 지연 감소
// - 적절한 연결 상태를 위한 인증 헤더 포함

백엔드 유형별 사전 워밍 동작:

백엔드 유형	사전 워밍 엔드포인트	헤더
OpenAI	`GET /v1/models`	`Authorization: Bearer`
Anthropic	`POST /v1/messages` (빈 요청)	`x-api-key`, `anthropic-version`
Gemini	`GET /v1/models`	`Authorization: Bearer`

스트리밍 클라이언트 최적화¶

스트리밍 요청의 경우 라우터는 최적화된 HTTP 클라이언트를 사용합니다:

// HttpClientFactory::optimized_streaming()
// - 적극적인 keep-alive와 함께 HTTP/2
// - 큰 연결 풀 (호스트당 100개)
// - 확장 사고 모델용 600초 타임아웃
// - TCP keepalive 활성화

TTFB 테스트 스크립트¶

프로젝트에는 tests/scripts/에 TTFB 비교 스크립트가 포함되어 있습니다:

# 개별 백엔드 테스트
./tests/scripts/test_anthropic_ttfb.sh claude-haiku-4-5 5
./tests/scripts/test_openai_ttfb.sh gpt-4o-mini 5
./tests/scripts/test_gemini_ttfb.sh gemini-2.5-flash 5

# 모든 백엔드 테스트
./tests/scripts/test_all_ttfb.sh 5

TTFB 메트릭¶

백엔드	직접 API	라우터 경유	오버헤드
Anthropic (claude-haiku-4-5)	~1.0s	~1.1s	~0.1s
OpenAI (gpt-4o-mini)	~0.7s	~0.3s*	-0.4s
Gemini (gemini-2.5-flash)	~0.9s	~0.8s	-0.1s

*라우터가 연결 풀링과 HTTP/2 재사용으로 인해 직접 연결보다 더 빠른 경우가 많음

TTFB 감소¶

연결 사전 워밍 활성화: 시작 시 기본으로 활성화됨
HTTP/2 사용: 모든 백엔드가 HTTP/2 지원, 기본으로 활성화
TLS 핸드셰이크 최소화: 연결 풀링이 TLS 세션 재사용
백엔드 선택: 최적의 TTFB를 위해 LeastLatency 라우팅 전략 사용

부하 테스트¶

k6 사용¶

k6 설치:

# macOS
brew install k6

# Linux
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D69
echo "deb https://dl.k6.io/deb stable main" | sudo tee /etc/apt/sources.list.d/k6.list
sudo apt-get update
sudo apt-get install k6

테스트 스크립트 생성 (load-test.js):

import http from 'k6/http';
import { check, sleep } from 'k6';

export let options = {
  stages: [
    { duration: '2m', target: 100 },  // 램프 업
    { duration: '5m', target: 100 },  // 100 유지
    { duration: '2m', target: 200 },  // 200으로 램프
    { duration: '5m', target: 200 },  // 200 유지
    { duration: '2m', target: 0 },    // 램프 다운
  ],
  thresholds: {
    http_req_duration: ['p(95)<500'],  // 95%가 500ms 미만
    http_req_failed: ['rate<0.1'],     // 오류율 < 10%
  },
};

export default function() {
  let payload = JSON.stringify({
    model: 'gpt-3.5-turbo',
    messages: [{ role: 'user', content: 'Hello' }],
  });

  let params = {
    headers: { 'Content-Type': 'application/json' },
  };

  let res = http.post('http://localhost:8080/v1/chat/completions', payload, params);

  check(res, {
    'status is 200': (r) => r.status === 200,
    'response time < 500ms': (r) => r.timings.duration < 500,
  });

  sleep(1);
}

테스트 실행:

k6 run load-test.js

성능 모니터링¶

Prometheus 메트릭¶

라우터는 /metrics 엔드포인트에서 Prometheus 메트릭을 노출합니다 (활성화 시):

metrics:
  enabled: true
  endpoint: "/metrics"
  auth:
    enabled: true
    username: "metrics"
    password: "secure_password"

모니터링할 주요 메트릭:

http_requests_total: 총 요청 수
http_request_duration_seconds: 요청 지연 시간 히스토그램
backend_request_duration_seconds: 백엔드 지연 시간
backend_health_status: 백엔드 헬스 상태
active_connections: 현재 활성 연결

Prometheus 쿼리 예제¶

# 요청 속도
rate(http_requests_total[5m])

# 오류율
rate(http_requests_total{status=~"5.."}[5m])

# P95 지연 시간
histogram_quantile(0.95, http_request_duration_seconds_bucket)

# 백엔드 헬스
backend_health_status

성능 문제 해결¶

높은 지연 시간¶

백엔드 지연 시간 확인

# 직접 백엔드 테스트
time curl -X POST http://backend:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-3.5-turbo","messages":[{"role":"user","content":"test"}]}'

디버그 로깅 활성화

RUST_LOG=continuum_router=debug cargo run

헬스 상태 확인

curl http://localhost:8080/admin/health

높은 메모리 사용량¶

캐시 크기 확인

# 메모리 사용량 모니터링
ps aux | grep continuum-router

캐시 TTL 감소

cache:
  model_cache_ttl: 60  # 기본값에서 감소

낮은 처리량¶

연결 풀 확인

server:
  connection_pool_size: 1000  # 필요시 증가

시스템 제한 확인

# 파일 디스크립터 제한 확인
ulimit -n

# TCP 설정 확인
sysctl net.core.somaxconn

백엔드 직접 테스트

# 백엔드 성능 테스트를 위해 라우터 우회
ab -n 1000 -c 100 http://backend:11434/v1/models

모범 사례¶

개발¶

성능에 중요한 변경사항을 병합하기 전에 cargo bench를 사용하여 벤치마크 실행
개발 중 프로파일링 도구 사용
성능 회귀 테스트 설정
스테이징에서 리소스 사용량 모니터링

프로덕션¶

보수적인 설정으로 시작하고 점진적으로 튜닝
주요 메트릭을 지속적으로 모니터링
성능 저하에 대한 알림 설정
피크 부하 + 20% 버퍼를 기반으로 용량 계획
고가용성을 위해 수평 스케일링 사용
부하 시 그레이스풀 디그레이데이션 구현

테스트¶

현실적인 워크로드로 테스트
스트리밍 및 비스트리밍 요청 포함
다양한 모델 설정으로 테스트
네트워크 문제 및 백엔드 장애 시뮬레이션
스테이징에서 정기적인 부하 테스트 수행

향후 개선 사항¶

다음 기능들이 향후 릴리스에 계획되어 있습니다:

Redis 기반 분산 캐싱 (L2 캐시)
고급 캐시 워밍 전략
위치 인식 라우팅
양방향 스트리밍을 위한 WebSocket 지원
내장 분산 트레이싱
부하 메트릭 기반 자동 스케일링