메트릭 및 모니터링¶

이 문서는 Continuum Router의 메트릭 및 모니터링 기능을 설명합니다.

개요¶

Continuum Router는 시스템 상태, 성능 및 사용 패턴을 모니터링하기 위한 포괄적인 Prometheus 호환 메트릭을 제공합니다. 메트릭 시스템은 다음과 같이 설계되었습니다:

가벼움: 최소한의 성능 오버헤드
포괄적: 라우터의 모든 중요한 측면 포함
프로덕션 준비: 카디널리티 제한 및 적절한 레이블링 포함
쉬운 통합: 표준 Prometheus/Grafana 설정과 작동

빠른 시작¶

1. 메트릭 활성화¶

메트릭은 기본적으로 활성화되어 있습니다. 메트릭 엔드포인트는 /metrics에서 사용할 수 있습니다:

# 메트릭 보기
curl http://localhost:8000/metrics

2. Prometheus 설정¶

prometheus.yml에 라우터를 타겟으로 추가:

scrape_configs:
    - job_name: 'continuum-router'
    static_configs:
      - targets: ['localhost:8000']
    scrape_interval: 15s

3. Grafana 대시보드 가져오기¶

monitoring/grafana/dashboards/router-overview.json에서 제공된 대시보드를 가져옵니다.

설정¶

메트릭 설정은 메인 설정 파일을 통해 수행됩니다:

metrics:
  # 메트릭 수집 활성화/비활성화
  enabled: true

  # 메트릭 엔드포인트 경로
  endpoint: "/metrics"

  # 메트릭 폭발을 방지하기 위한 카디널리티 제한
  cardinality_limit:
    max_labels_per_metric: 100
    max_unique_label_values: 1000

  # 선택적 메트릭 (성능을 위해 기본적으로 비활성화)
  optional_metrics:
    enable_request_body_size: false
    enable_response_body_size: false
    enable_detailed_errors: true

환경 변수¶

환경 변수를 사용하여 메트릭을 설정할 수도 있습니다:

# 메트릭 활성화/비활성화
METRICS_ENABLED=true

# 메트릭 엔드포인트 변경
METRICS_ENDPOINT=/custom/metrics

# 선택적 메트릭 활성화
METRICS_ENABLE_BODY_SIZE=true

사용 가능한 메트릭¶

HTTP 메트릭¶

메트릭	유형	설명	레이블
`http_requests_total`	Counter	총 HTTP 요청 수	`method`, `endpoint`, `status`
`http_request_duration_seconds`	Histogram	요청 지연 시간	`method`, `endpoint`
`http_active_connections`	Gauge	현재 활성 연결	-
`http_request_size_bytes`	Histogram	요청 본문 크기	`method`, `endpoint`
`http_response_size_bytes`	Histogram	응답 본문 크기	`method`, `endpoint`

백엔드 메트릭¶

메트릭	유형	설명	레이블
`backend_health_status`	Gauge	백엔드 헬스 (1=정상, 0=비정상)	`backend_id`, `backend_url`
`backend_health_check_duration_seconds`	Histogram	헬스 체크 지속 시간	`backend_id`
`backend_health_check_failures_total`	Counter	총 헬스 체크 실패	`backend_id`, `error_type`
`backend_request_latency_seconds`	Histogram	백엔드 요청 지연 시간	`backend_id`, `endpoint`
`backend_connection_pool_size`	Gauge	연결 풀 크기	`backend_id`
`backend_connection_pool_active`	Gauge	풀의 활성 연결	`backend_id`

라우팅 메트릭¶

메트릭	유형	설명	레이블
`routing_decisions_total`	Counter	총 라우팅 결정	`strategy`, `selected_backend`
`routing_backend_selection_duration_seconds`	Histogram	백엔드 선택 시간	`strategy`
`routing_model_availability`	Gauge	백엔드별 모델 가용성	`model`, `backend_id`
`routing_retries_total`	Counter	총 재시도 횟수	`backend_id`, `reason`
`routing_circuit_breaker_state`	Gauge	서킷 브레이커 상태	`backend_id`

모델 서비스 메트릭¶

메트릭	유형	설명	레이블
`model_cache_hits_total`	Counter	모델 캐시 적중	`operation`
`model_cache_misses_total`	Counter	모델 캐시 미스	`operation`
`model_refresh_duration_seconds`	Histogram	모델 목록 새로고침 지속 시간	`backend_id`
`model_discovery_errors_total`	Counter	모델 검색 오류	`backend_id`, `error_type`

캐시 스탬피드 방지 메트릭¶

이 메트릭은 캐시 스탬피드 방지 메커니즘을 모니터링하는 데 도움이 됩니다:

메트릭	유형	설명	레이블
`model_stale_while_revalidate_total`	Counter	갱신이 진행 중일 때 오래된 데이터를 반환한 요청	-
`model_coalesced_requests_total`	Counter	새로운 집계를 트리거하는 대신 진행 중인 집계를 기다린 요청	-
`model_background_refreshes_total`	Counter	시작된 백그라운드 갱신 작업	-
`model_background_refresh_successes_total`	Counter	성공한 백그라운드 갱신 작업	-
`model_background_refresh_failures_total`	Counter	실패한 백그라운드 갱신 작업	-
`model_singleflight_lock_acquired_total`	Counter	싱글플라이트를 위해 집계 잠금이 획득된 횟수	-

캐시 스탬피드 메트릭 이해하기¶

높은 coalesced_requests: 싱글플라이트 패턴이 중복 집계를 효과적으로 방지하고 있음을 나타냄
높은 stale_while_revalidate: stale-while-revalidate 패턴이 갱신 중에 캐시된 데이터를 반환하고 있음을 보여줌
낮은 background_refresh_failures: 백그라운드 갱신이 올바르게 작동하고 있음을 확인
캐시 미스 시 블로킹 없음: background_refreshes > 0일 때, 요청은 캐시 갱신을 거의 기다리지 않아야 함

스트리밍 메트릭¶

메트릭	유형	설명	레이블
`streaming_active_connections`	Gauge	활성 스트리밍 연결	`endpoint`
`streaming_events_sent_total`	Counter	전송된 총 SSE 이벤트	`endpoint`, `event_type`
`streaming_connection_duration_seconds`	Histogram	스트리밍 연결 지속 시간	`endpoint`
`streaming_errors_total`	Counter	스트리밍 오류	`endpoint`, `error_type`

폴백 메트릭¶

메트릭	유형	설명	레이블
`fallback_attempts_total`	Counter	총 폴백 시도	`original_model`, `fallback_model`, `reason`
`fallback_success_total`	Counter	성공한 폴백	`original_model`, `fallback_model`
`fallback_exhausted_total`	Counter	소진된 폴백 체인	`original_model`
`fallback_cross_provider_total`	Counter	크로스 프로바이더 폴백	`from_provider`, `to_provider`
`fallback_duration_seconds`	Histogram	폴백 작업 지속 시간	`original_model`

비즈니스 메트릭¶

메트릭	유형	설명	레이블
`model_usage_total`	Counter	모델 사용 횟수	`model`, `backend_id`
`tokens_consumed_total`	Counter	소비된 총 토큰	`model`, `operation`

통합¶

Prometheus 설정¶

완전한 Prometheus 설정 예제:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
    - job_name: 'continuum-router'
    static_configs:
      - targets: ['router1:8000', 'router2:8000']
    metric_relabel_configs:
      # 필요시 높은 카디널리티 메트릭 삭제
      - source_labels: [__name__]
        regex: 'http_request_duration_seconds_bucket'
        action: drop

Kubernetes 통합¶

Kubernetes 배포의 경우 ServiceMonitor 사용:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: continuum-router
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: continuum-router
  endpoints:
    - port: metrics
    interval: 15s
    path: /metrics

Grafana 대시보드¶

제공된 Grafana 대시보드는 다음을 포함합니다:

개요 패널¶

요청 속도 및 오류율
P50, P95, P99 지연 시간
활성 연결
백엔드 헬스 상태

백엔드 성능¶

백엔드별 지연 시간
헬스 체크 성공률
연결 풀 활용률
서킷 브레이커 상태

모델 사용¶

모델 요청 분포
캐시 적중률
토큰 소비
모델 가용성 매트릭스

알림 개요¶

활성 알림
알림 기록
SLO 준수

대시보드 가져오기:

Grafana 열기
대시보드 -> 가져오기로 이동
monitoring/grafana/dashboards/router-overview.json 업로드
Prometheus 데이터 소스 선택
가져오기 클릭

알림¶

사전 설정된 알림 규칙이 monitoring/prometheus/alerts.yml에 있습니다:

중요 알림¶

- alert: BackendDown
  expr: backend_health_status == 0
  for: 1m
  annotations:
    summary: "백엔드 {{ $labels.backend_id }}이(가) 다운됨"

- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
  for: 5m
  annotations:
    summary: "높은 오류율: {{ $value | humanizePercentage }}"

경고 알림¶

- alert: HighLatency
  expr: histogram_quantile(0.95, http_request_duration_seconds) > 1
  for: 5m
  annotations:
    summary: "P95 지연 시간 1초 초과: {{ $value | humanizeDuration }}"

- alert: LowCacheHitRate
  expr: rate(model_cache_hits_total[5m]) / rate(model_cache_total[5m]) < 0.8
  for: 10m
  annotations:
    summary: "캐시 적중률 80% 미만: {{ $value | humanizePercentage }}"

예제¶

쿼리 예제¶

상태별 요청 속도¶

sum(rate(http_requests_total[5m])) by (status)

엔드포인트별 P95 지연 시간¶

histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (endpoint, le)
)

백엔드 헬스 개요¶

sum(backend_health_status) by (backend_id)

모델 사용량 순위¶

topk(10, sum(rate(model_usage_total[1h])) by (model))

오류율 백분율¶

sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100

프로그래밍 방식 접근¶

메트릭에 프로그래밍 방식으로 접근할 수도 있습니다:

import requests
from prometheus_client.parser import text_string_to_metric_families

# 메트릭 가져오기
response = requests.get('http://localhost:8000/metrics')
metrics = text_string_to_metric_families(response.text)

# 메트릭 처리
for family in metrics:
    for sample in family.samples:
        if sample.name == 'http_requests_total':
            print(f"엔드포인트: {sample.labels['endpoint']}, 카운트: {sample.value}")

사용자 정의 메트릭 수집¶

#!/bin/bash
# 30초마다 메트릭을 수집하고 파일에 저장

while true; do
  timestamp=$(date +%s)
  curl -s http://localhost:8000/metrics > "metrics_${timestamp}.txt"
  sleep 30
done

모범 사례¶

1. 레이블 카디널리티¶

메트릭 폭발을 방지하기 위해 레이블 카디널리티를 낮게 유지:

# 좋음: 낮은 카디널리티
labels:
  status: "200"  # ~5개 가능한 값
  method: "GET"  # ~7개 가능한 값

# 나쁨: 높은 카디널리티
labels:
  user_id: "12345"  # 무제한
  request_id: "abc-123"  # 요청당 고유

2. 메트릭 명명¶

Prometheus 명명 규칙 준수:

snake_case 사용
메트릭 이름에 단위 포함 (_seconds, _bytes, _total)
표준 접두사 사용 (http_, backend_, model_)

3. 대시보드 설계¶

관련 메트릭을 함께 그룹화
적절한 시각화 유형 사용 (현재 값에는 게이지, 시계열에는 그래프)
절대값과 비율 모두 포함
적절한 새로고침 간격 설정 (실시간에는 15-30초, 이력에는 1-5분)

4. 알림 설정¶

플래핑을 방지하기 위해 적절한 평가 기간 사용 (for: 5m)
알림 설명에 컨텍스트 포함
심각도에 따른 알림 라우팅 설정
프로덕션 전 스테이징에서 알림 테스트

5. 성능 고려 사항¶

필요하지 않은 경우 선택적 메트릭 비활성화
복잡한 쿼리에 레코딩 규칙 사용
적절한 메트릭 보존 정책 구현
장기 보존을 위해 원격 스토리지 고려

6. 보안¶

민감한 데이터가 노출된 경우 메트릭 엔드포인트 보호
프로덕션에서 Prometheus 스크래핑에 TLS 사용
Grafana 대시보드에 인증 구현
메트릭 접근 로그 감사

문제 해결¶

메트릭이 나타나지 않음¶

설정에서 메트릭이 활성화되어 있는지 확인
메트릭 엔드포인트에 접근 가능한지 확인
Prometheus 타겟 상태 확인
메트릭 초기화 오류에 대한 라우터 로그 검토

높은 메모리 사용량¶

카디널리티 제한 검토
무제한 레이블 확인
필요시 히스토그램 버킷 감소
메트릭 만료 활성화

잘못된 값¶

메트릭 유형 확인 (카운터 vs 게이지)
집계 함수 확인
레이블 선택기 검토
시간 범위 검증