Metrics and Monitoring¶

This document describes the metrics and monitoring capabilities of the Continuum Router.

Table of Contents¶

Overview
Quick Start
Configuration
Available Metrics
Integration
Grafana Dashboard
Alerting
Examples
Best Practices

Overview¶

The Continuum Router exposes Prometheus-compatible metrics for monitoring system health, performance, and usage patterns. The metrics system is designed to be:

Lightweight: Minimal performance overhead
Broad coverage: Covers HTTP, backend, routing, model, streaming, and cache subsystems
Production-ready: Includes cardinality limits and proper labeling
Easy to integrate: Works with standard Prometheus/Grafana setups

For restart-survival history without standing up a full Prometheus stack, see the Persistent Metrics Log. It snapshots the registry to a local SQLite store and exposes recent history via GET /admin/metrics/history.

Quick Start¶

1. Enable Metrics¶

Metrics are enabled by default. The metrics endpoint is available at /metrics:

# View metrics
curl http://localhost:8000/metrics

2. Configure Prometheus¶

Add the router as a target in your prometheus.yml:

scrape_configs:
    - job_name: 'continuum-router'
    static_configs:
      - targets: ['localhost:8000']
    scrape_interval: 15s

3. Import Grafana Dashboard¶

Import the provided dashboard from monitoring/grafana/dashboards/router-overview.json.

Configuration¶

Metrics configuration is done through the main config file:

metrics:
  # Enable/disable metrics collection
  enabled: true

  # Metrics endpoint path
  endpoint: "/metrics"

  # Cardinality limits to prevent metric explosion
  cardinality_limit:
    max_labels_per_metric: 100
    max_unique_label_values: 1000

  # Optional metrics (disabled by default for performance)
  optional_metrics:
    enable_request_body_size: false
    enable_response_body_size: false
    enable_detailed_errors: true

Environment Variables¶

You can also configure metrics using environment variables:

# Enable/disable metrics
METRICS_ENABLED=true

# Change metrics endpoint
METRICS_ENDPOINT=/custom/metrics

# Enable optional metrics
METRICS_ENABLE_BODY_SIZE=true

Available Metrics¶

HTTP Metrics¶

Metric	Type	Description	Labels
`http_requests_total`	Counter	Total number of HTTP requests	`method`, `endpoint`, `status`
`http_request_duration_seconds`	Histogram	Request latency	`method`, `endpoint`
`http_active_connections`	Gauge	Current active connections	-
`http_request_size_bytes`	Histogram	Request body size	`method`, `endpoint`
`http_response_size_bytes`	Histogram	Response body size	`method`, `endpoint`

Backend Metrics¶

Metric	Type	Description	Labels
`backend_health_status`	Gauge	Backend health (1=healthy, 0=unhealthy)	`backend_id`, `backend_url`
`backend_health_check_duration_seconds`	Histogram	Health check duration	`backend_id`
`backend_health_check_failures_total`	Counter	Total health check failures	`backend_id`, `error_type`
`backend_request_latency_seconds`	Histogram	Backend request latency	`backend_id`, `endpoint`
`backend_connection_pool_size`	Gauge	Connection pool size	`backend_id`
`backend_connection_pool_active`	Gauge	Active connections in pool	`backend_id`

Routing Metrics¶

Metric	Type	Description	Labels
`routing_decisions_total`	Counter	Total routing decisions	`strategy`, `selected_backend`
`routing_backend_selection_duration_seconds`	Histogram	Time to select backend	`strategy`
`routing_model_availability`	Gauge	Model availability per backend	`model`, `backend_id`
`routing_retries_total`	Counter	Total retry attempts	`backend_id`, `reason`
`routing_circuit_breaker_state`	Gauge	Circuit breaker state	`backend_id`

Model Service Metrics¶

Metric	Type	Description	Labels
`model_cache_hits_total`	Counter	Model cache hits	`operation`
`model_cache_misses_total`	Counter	Model cache misses	`operation`
`model_refresh_duration_seconds`	Histogram	Model list refresh duration	`backend_id`
`model_discovery_errors_total`	Counter	Model discovery errors	`backend_id`, `error_type`

Cache Stampede Prevention Metrics¶

These metrics help monitor the cache stampede prevention mechanisms:

Metric	Type	Description	Labels
`model_stale_while_revalidate_total`	Counter	Requests that returned stale data while refresh was in progress	-
`model_coalesced_requests_total`	Counter	Requests that waited for ongoing aggregation instead of triggering new one	-
`model_background_refreshes_total`	Counter	Background refresh operations initiated	-
`model_background_refresh_successes_total`	Counter	Successful background refresh operations	-
`model_background_refresh_failures_total`	Counter	Failed background refresh operations	-
`model_singleflight_lock_acquired_total`	Counter	Times the aggregation lock was acquired for singleflight	-

Understanding Cache Stampede Metrics¶

High coalesced_requests: Indicates the singleflight pattern is effectively preventing duplicate aggregations
High stale_while_revalidate: Shows the stale-while-revalidate pattern is returning cached data during refresh
Low background_refresh_failures: Confirms background refresh is working correctly
Zero blocking on cache miss: When background_refreshes > 0, requests should rarely block on cache refresh

Streaming Metrics¶

Metric	Type	Description	Labels
`streaming_active_connections`	Gauge	Active streaming connections	`endpoint`
`streaming_events_sent_total`	Counter	Total SSE events sent	`endpoint`, `event_type`
`streaming_connection_duration_seconds`	Histogram	Streaming connection duration	`endpoint`
`streaming_errors_total`	Counter	Streaming errors	`endpoint`, `error_type`

Mid-Stream Fallback Metrics¶

These metrics are emitted when the mid-stream fallback feature is enabled (streaming.mid_stream_fallback.enabled: true).

Metric	Type	Description	Labels
`streaming_fallback_total`	Counter	Total mid-stream fallback attempts	`reason`
`streaming_fallback_success_total`	Counter	Successful mid-stream fallback recoveries	`original_backend`, `fallback_backend`
`streaming_fallback_accumulated_tokens`	Histogram	Estimated tokens accumulated before fallback	`outcome` (`success`, `failure`)

Reason Label Values for `streaming_fallback_total`¶

Value	Description
`timeout`	Backend inactivity timeout exceeded
`connection_error`	TCP/TLS connection error
`stream_read_error`	Error reading bytes from stream
`stream_ended_unexpectedly`	Stream closed without `[DONE]` marker
`too_many_stream_errors`	Consecutive error event threshold reached
`other`	Other failure reason

Key PromQL Queries¶

# Mid-stream fallback rate
rate(streaming_fallback_total[5m])

# Fallback recovery success rate
sum(rate(streaming_fallback_success_total[5m])) /
sum(rate(streaming_fallback_total[5m]))

# Median accumulated tokens at fallback trigger
histogram_quantile(0.5, rate(streaming_fallback_accumulated_tokens_bucket[5m]))

Fallback Metrics¶

Metric	Type	Description	Labels
`fallback_attempts_total`	Counter	Total fallback attempts	`original_model`, `fallback_model`, `reason`
`fallback_success_total`	Counter	Successful fallbacks	`original_model`, `fallback_model`
`fallback_exhausted_total`	Counter	Exhausted fallback chains	`original_model`
`fallback_cross_provider_total`	Counter	Cross-provider fallbacks	`from_provider`, `to_provider`
`fallback_duration_seconds`	Histogram	Fallback operation duration	`original_model`

Response Cache Metrics¶

Metric	Type	Description	Labels
`continuum_response_cache_requests_total`	Counter	Cache lookups by result	`result` (`hit`, `miss`, `skip`)
`continuum_response_cache_entries`	Gauge	Current number of cached entries	--
`continuum_response_cache_size_bytes`	Gauge	Approximate cache memory usage	--
`continuum_response_cache_evictions_total`	Counter	LRU evictions	--
`continuum_response_cache_hit_rate`	Gauge	Rolling cache hit rate (0.0--1.0)	--
`continuum_cache_backend_type`	Gauge	Active cache backend (1 = active)	`backend` (`memory`, `redis`)

Redis Cache Backend Metrics¶

These metrics are populated when the Redis cache backend is active (backend: redis).

Metric	Type	Description	Labels
`continuum_cache_redis_connections_active`	Gauge	Active Redis connections in the pool	--
`continuum_cache_redis_connections_idle`	Gauge	Idle Redis connections in the pool	--
`continuum_cache_redis_latency_seconds`	Histogram	Redis operation latency	`operation` (`get`, `set`, `delete`)
`continuum_cache_redis_errors_total`	Counter	Redis errors by type	`type` (`connection`, `timeout`, `other`)
`continuum_cache_fallback_active`	Gauge	Whether in-memory fallback is active (0 or 1)	--

KV Event Consumer Metrics¶

These metrics are populated when vLLM KV event consumers are active (src/infrastructure/kv_index/). All backend label values are sanitized to prevent cardinality explosion.

Metric	Type	Description	Labels
`continuum_kv_event_received_total`	Counter	KV cache events received from each backend	`backend`
`continuum_kv_event_processed_total`	Counter	KV cache events successfully forwarded via channel	`backend`
`continuum_kv_event_dropped_total`	Counter	KV cache events dropped due to backpressure	`backend`
`continuum_kv_consumer_connected`	Gauge	Whether the KV event consumer is connected (1 = connected, 0 = disconnected)	`backend`
`continuum_kv_consumer_reconnects_total`	Counter	Total reconnection attempts for each backend consumer	`backend`

Prefix Routing Metrics¶

These metrics track prefix-aware sticky routing decisions and backend distribution.

Metric	Type	Description	Labels
`continuum_prefix_routing_requests_total`	Counter	Total prefix routing decisions by strategy type	`strategy` (`prefix_hash`, `overflow`, `fallback`, `unknown`)
`continuum_prefix_routing_backend_distribution`	Gauge	In-flight requests per backend (for load balancing)	`backend`
`continuum_prefix_routing_prefix_cardinality`	Gauge	Approximate number of unique prefix keys seen	--

Key PromQL Queries¶

# Prefix routing hit rate (% of requests using prefix hash vs fallback)
sum(rate(continuum_prefix_routing_requests_total{strategy="prefix_hash"}[5m])) /
sum(rate(continuum_prefix_routing_requests_total[5m]))

# Overflow rate (CHWBL load balancing activations)
rate(continuum_prefix_routing_requests_total{strategy="overflow"}[5m])

# Backend load distribution (should be roughly even)
continuum_prefix_routing_backend_distribution

KV Cache Index Metrics¶

These metrics track the KV cache index subsystem including index state, query performance, routing decisions, and overlap scoring.

Metric	Type	Description	Labels
`continuum_kv_index_entries`	Gauge	Current number of entries in the KV cache index	--
`continuum_kv_index_events_total`	Counter	KV cache index mutation events (created/evicted)	`backend`, `type` (`created`, `evicted`)
`continuum_kv_index_query_latency_seconds`	Histogram	Latency of KV index query operations	--
`continuum_kv_index_routing_decisions_total`	Counter	KV-aware routing decisions by outcome	`decision` (`kv_aware`, `fallback`)
`continuum_kv_index_overlap_score`	Histogram	Distribution of overlap scores for routed requests	--
`continuum_kv_index_event_source_status`	Gauge	Event source connection status (1 = connected, 0 = disconnected)	`backend`, `status`

Key PromQL Queries¶

# KV-aware routing ratio
sum(rate(continuum_kv_index_routing_decisions_total{decision="kv_aware"}[5m])) /
sum(rate(continuum_kv_index_routing_decisions_total[5m]))

# Average overlap score for routed requests
histogram_quantile(0.5, rate(continuum_kv_index_overlap_score_bucket[5m]))

# KV index query P99 latency
histogram_quantile(0.99, rate(continuum_kv_index_query_latency_seconds_bucket[5m]))

# Event source connection health
continuum_kv_index_event_source_status{status="connected"}

Smart Routing Metrics¶

These metrics cover the smart routing pipeline, including the LLM-based classifier.

Classification and Routing¶

Metric	Type	Description	Labels
`smart_routing_classifications_total`	Counter	Total classifications performed	`complexity`, `domain`, `classifier_type`
`smart_routing_decisions_total`	Counter	Total routing decisions made	`source_model`, `target_model`, `policy`, `tier`
`smart_routing_classifier_duration_seconds`	Histogram	Classifier latency	`classifier_type`
`smart_routing_policy_no_match_total`	Counter	Requests with no matching policy	-
`smart_routing_tier_no_model_total`	Counter	Policy matched but no model available in tier	`tier`

Load Management¶

Metric	Type	Description	Labels
`smart_routing_load_state`	Gauge	Current load state: 0=Normal, 1=Warning, 2=Critical	-
`smart_routing_tier_degradation_total`	Counter	Routing degraded due to load	`load_state`
`smart_routing_load_transitions_total`	Counter	Load state transitions	`from_state`, `to_state`

LLM Classifier¶

Metric	Type	Description	Labels
`smart_routing_llm_classifier_calls_total`	Counter	Total LLM classifier invocations	-
`smart_routing_llm_classifier_cache_hits_total`	Counter	Classification results served from cache	-
`smart_routing_llm_classifier_duration_seconds`	Histogram	End-to-end LLM classification latency (buckets: 50ms–5s)	-
`smart_routing_llm_classifier_fallbacks_total`	Counter	Times the LLM result was discarded and rule-based result used	-
`smart_routing_llm_classifier_parse_errors_total`	Counter	Response parse failures before retry	-
`smart_routing_llm_classifier_retries_total`	Counter	Retry attempts after initial parse failure	-

Aggregate and Operational¶

Metric	Type	Description	Labels
`smart_routing_requests_total`	Counter	Total smart-routed requests	`source_model`, `target_model`, `policy`, `load_state`
`smart_routing_tier_usage_total`	Counter	Tier usage distribution	`tier`, `domain`
`smart_routing_cost_estimate_total`	Counter	Estimated cost from tier optimization	`tier`
`smart_routing_policy_evaluations_total`	Counter	Policy evaluation frequency	`policy_name`, `result`
`smart_routing_model_availability`	Gauge	Available models per tier	`model`, `tier`

Key PromQL Queries¶

# Smart routing request rate by policy
rate(smart_routing_requests_total[5m])

# Tier usage distribution
sum by(tier) (rate(smart_routing_tier_usage_total[5m]))

# LLM classifier cache hit rate
rate(smart_routing_llm_classifier_cache_hits_total[5m]) /
rate(smart_routing_llm_classifier_calls_total[5m])

# LLM classifier P95 latency
histogram_quantile(0.95, rate(smart_routing_llm_classifier_duration_seconds_bucket[5m]))

# LLM classifier fallback rate (reliability indicator)
rate(smart_routing_llm_classifier_fallbacks_total[5m]) /
rate(smart_routing_llm_classifier_calls_total[5m])

# Fraction of requests classified by LLM vs rule-based
rate(smart_routing_classifications_total{classifier_type="llm_based"}[5m]) /
rate(smart_routing_classifications_total[5m])

# Policy evaluation success rate
sum by(policy_name) (rate(smart_routing_policy_evaluations_total{result="matched"}[5m]))

Business Metrics¶

Metric	Type	Description	Labels
`model_usage_total`	Counter	Model usage count	`model`, `backend_id`
`tokens_consumed_total`	Counter	Total tokens consumed	`model`, `operation`

Guardrail Metrics¶

Exported when guardrails are configured and the metrics feature is enabled. Every guardrail decision is recorded so operators can observe what a policy does (or would do, in monitor mode) before and after enforcement.

Metric	Type	Description	Labels
`guardrail_checks_total`	Counter	Per-provider checks by stage and verdict result	`stage`, `provider`, `result`
`guardrail_blocks_total`	Counter	Block verdicts by stage, provider, and category	`stage`, `provider`, `category`
`guardrail_check_duration_seconds`	Histogram	Per-provider check latency in seconds	`stage`, `provider`
`guardrail_errors_total`	Counter	Provider errors (timeout / hard failure)	`provider`, `kind`
`guardrail_fail_open_total`	Counter	Provider failures resolved fail-open (request allowed)	`provider`
`guardrail_fail_closed_total`	Counter	Provider failures resolved fail-closed (request blocked)	`provider`
`guardrail_verdicts_total`	Counter	Aggregated verdict per request after applying mode semantics	`stage`, `mode`, `result`

Label values:

stage is input, output, or streaming.
result is allow, block, transform, or flag.
kind is timeout or error.
mode is monitor or enforce. Because guardrail_verdicts_total carries mode, monitor-mode verdicts are visible even though they never gate a request, which is what makes the monitor-then-enforce rollout observable.

Key PromQL Queries¶

# What would be blocked, broken down by category (monitor-mode tuning)
sum by (category) (rate(guardrail_blocks_total[1h]))

# Block rate per stage after enforcement
sum by (stage) (rate(guardrail_verdicts_total{result="block", mode="enforce"}[5m]))

# Provider error rate (timeouts vs hard failures)
sum by (provider, kind) (rate(guardrail_errors_total[5m]))

# P95 guardrail check latency per provider
histogram_quantile(0.95, sum by (le, provider) (rate(guardrail_check_duration_seconds_bucket[5m])))

For the full guardrail guide (concepts, providers, configuration, and the threshold-tuning workflow), see Guardrails.

Per-API-Key LLM Token Usage¶

The router publishes a per-API-key breakdown of LLM token consumption so operators can answer questions like "which key consumed the most completion tokens last hour?" or "how many prompt tokens did team X spend on model Y today?". This data is independent of the legacy aggregate counter and is intended for capacity planning, fair-use enforcement, and (eventually) cost attribution.

Metric Definition¶

Metric	Type	Description	Labels
`llm_tokens_total`	Counter	LLM tokens consumed per request	`api_key_id`, `model`, `backend`, `kind`
`api_key_info`	Gauge (constant 1)	Info-metric exposing configured API-key annotations as labels	`api_key_id`, plus the configured annotation allowlist

kind is one of:

prompt — tokens in the upstream request prompt
completion — tokens in the upstream response completion

Both OpenAI-compatible (prompt_tokens / completion_tokens) and Anthropic (input_tokens / output_tokens) response shapes are normalized into the same counter. The router also injects stream_options.include_usage=true on OpenAI-compat streaming requests so usage data arrives in the final SSE chunk regardless of client behavior.

`api_key_id` Derivation¶

api_key_id is never the raw API key. The router derives a stable, non-reversible identifier in this priority order:

If the request's bearer token matches a configured API-key entry, the entry's id field is used (e.g., key-production-1).
Otherwise, the router computes SHA-256 over the raw token and uses the first 12 hex characters prefixed with k_ (e.g., k_3f5a7c9b1e2d).
If no token is presented, the literal value anonymous is used.

All label values flow through the existing CardinalityManager so a runaway/rotating-key attack cannot exhaust Prometheus series.

Annotation Labels and `api_key_info`¶

Each configured API key may carry an optional free-form annotations: { key: value } map. Operators declare which annotation keys become Prometheus labels via the global metrics.annotation_labels allowlist; everything else stays internal.

Configuration schema (under the existing api_keys block):

api_keys:
  api_keys:
      - key: "${API_KEY_1}"
        id: "key-production-1"
        user_id: "user-admin"
        organization_id: "org-main"
        annotations:
            email: "ops@example.com"
            team: "platform"
            environment: "prod"
            owner: "alice"

metrics:
    enabled: true
    annotation_labels: [email, team]            # Allowlist of label keys

Reserved annotation keys (recommended canonical names, not enforced): email, uuid, owner, team, environment. Operators may add custom keys.

When metrics.annotation_labels is non-empty, the router publishes api_key_info{api_key_id, email, team, ...} = 1 once per known key. Use PromQL joins to project the metadata onto llm_tokens_total without bloating its label set:

# Tokens per email (sums prompt + completion, last 24h)
sum by (email) (
  increase(llm_tokens_total[24h])
  * on (api_key_id) group_left(email) api_key_info
)

Cardinality and Hot-Reload¶

api_key_id cardinality is bounded at 1000 by default.
Hot-reload of API-key annotations is supported via the existing config-reload pipeline. The api_key_info info-metric is republished atomically on every reload; counter values for llm_tokens_total are never reset.
The label set on api_key_info (i.e., the contents of annotation_labels) is frozen at startup. Adding or removing keys from the allowlist requires a restart — Prometheus does not allow renaming labels on a registered metric.

Example PromQL Queries¶

# Total prompt tokens consumed per API key in the last hour
sum by (api_key_id) (
  increase(llm_tokens_total{kind="prompt"}[1h])
)

# Top 10 keys by completion tokens in the last 24h
topk(10,
  sum by (api_key_id) (
    increase(llm_tokens_total{kind="completion"}[24h])
  )
)

# Tokens grouped by team (requires team in annotation_labels)
sum by (team) (
  increase(llm_tokens_total[24h])
  * on (api_key_id) group_left(team) api_key_info
)

# Combined prompt+completion rate per model (tokens/sec)
sum by (model) (rate(llm_tokens_total[5m]))

# Per-key consumption by backend (useful for cost attribution)
sum by (api_key_id, backend) (
  increase(llm_tokens_total[24h])
)

Grafana Panel Example¶

A simple Grafana stat panel showing the top 10 teams by completion tokens over the last 24 hours:

{
    "title": "Top 10 teams by completion tokens (24h)",
    "type": "stat",
    "targets": [
        {
            "expr": "topk(10, sum by (team) (increase(llm_tokens_total{kind=\"completion\"}[24h]) * on (api_key_id) group_left(team) api_key_info))",
            "legendFormat": "{{team}}"
        }
    ],
    "options": {
        "reduceOptions": {
            "values": false,
            "calcs": ["lastNotNull"]
        }
    }
}

For tracking spend trends, pair this with a time-series panel using rate(llm_tokens_total[5m]) grouped by team or model.

Verification Steps¶

After enabling the feature:

Issue a chat-completion request with a configured API key.
Scrape /metrics and confirm llm_tokens_total{...} and api_key_info{...} series appear.
For streaming, verify the counter still increments — usage is captured from the final SSE chunk. The router injects stream_options.include_usage=true automatically for OpenAI-compat backends so this works regardless of client behavior.
Inspect /metrics cardinality on a typical workload (e.g., wc -l < /metrics) to confirm no regression versus the prior baseline.

Integration¶

Prometheus Configuration¶

Complete Prometheus configuration example:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
    - job_name: 'continuum-router'
    static_configs:
      - targets: ['router1:8000', 'router2:8000']
    metric_relabel_configs:
      # Drop high-cardinality metrics if needed
      - source_labels: [__name__]
        regex: 'http_request_duration_seconds_bucket'
        action: drop

Kubernetes Integration¶

For Kubernetes deployments, use ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: continuum-router
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: continuum-router
  endpoints:
    - port: metrics
    interval: 15s
    path: /metrics

Grafana Dashboard¶

The provided Grafana dashboard includes:

Overview Panel¶

Request rate and error rate
P50, P95, P99 latencies
Active connections
Backend health status

Backend Performance¶

Backend-specific latencies
Health check success rate
Connection pool utilization
Circuit breaker status

Model Usage¶

Model request distribution
Cache hit rates
Token consumption
Model availability matrix

Alerts Overview¶

Active alerts
Alert history
SLO compliance

To import the dashboard:

Open Grafana
Go to Dashboards → Import
Upload monitoring/grafana/dashboards/router-overview.json
Select your Prometheus data source
Click Import

Alerting¶

Pre-configured alert rules are available in monitoring/prometheus/alerts.yml:

Critical Alerts¶

{% raw %}

- alert: BackendDown
  expr: backend_health_status == 0
  for: 1m
  annotations:
    summary: "Backend {{ $labels.backend_id }} is down"

- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
  for: 5m
  annotations:
    summary: "High error rate: {{ $value | humanizePercentage }}"

Warning Alerts¶

{% raw %}

- alert: HighLatency
  expr: histogram_quantile(0.95, http_request_duration_seconds) > 1
  for: 5m
  annotations:
    summary: "P95 latency above 1s: {{ $value | humanizeDuration }}"

- alert: LowCacheHitRate
  expr: rate(model_cache_hits_total[5m]) / rate(model_cache_total[5m]) < 0.8
  for: 10m
  annotations:
    summary: "Cache hit rate below 80%: {{ $value | humanizePercentage }}"

Examples¶

Query Examples¶

Request Rate by Status¶

sum(rate(http_requests_total[5m])) by (status)

P95 Latency by Endpoint¶

histogram_quantile(0.95, 
  sum(rate(http_request_duration_seconds_bucket[5m])) by (endpoint, le)
)

Backend Health Overview¶

sum(backend_health_status) by (backend_id)

Model Usage Ranking¶

topk(10, sum(rate(model_usage_total[1h])) by (model))

Error Rate Percentage¶

sum(rate(http_requests_total{status=~"5.."}[5m])) / 
sum(rate(http_requests_total[5m])) * 100

Programmatic Access¶

You can also access metrics programmatically:

import requests
from prometheus_client.parser import text_string_to_metric_families

# Fetch metrics
response = requests.get('http://localhost:8000/metrics')
metrics = text_string_to_metric_families(response.text)

# Process metrics
for family in metrics:
    for sample in family.samples:
        if sample.name == 'http_requests_total':
            print(f"Endpoint: {sample.labels['endpoint']}, Count: {sample.value}")

Custom Metrics Collection¶

#!/bin/bash
# Collect metrics every 30 seconds and save to file

while true; do
  timestamp=$(date +%s)
  curl -s http://localhost:8000/metrics > "metrics_${timestamp}.txt"
  sleep 30
done

Best Practices¶

1. Label Cardinality¶

Keep label cardinality low to prevent metric explosion:

# Good: Low cardinality
labels:
  status: "200"  # ~5 possible values
  method: "GET"  # ~7 possible values

# Bad: High cardinality
labels:
  user_id: "12345"  # Unbounded
  request_id: "abc-123"  # Unique per request

2. Metric Naming¶

Follow Prometheus naming conventions:

Use snake_case
Include units in metric names (_seconds, _bytes, _total)
Use standard prefixes (http_, backend_, model_)

3. Dashboard Design¶

Group related metrics together
Use appropriate visualization types (gauge for current values, graph for time series)
Include both absolute values and rates
Set reasonable refresh intervals (15-30s for real-time, 1-5m for historical)

4. Alert Configuration¶

Use appropriate evaluation periods (for: 5m to avoid flapping)
Include context in alert descriptions
Set up alert routing based on severity
Test alerts in staging before production

5. Performance Considerations¶

Disable optional metrics if not needed
Use recording rules for complex queries
Implement proper metric retention policies
Consider using remote storage for long-term retention

6. Security¶

Protect metrics endpoint if sensitive data is exposed
Use TLS for Prometheus scraping in production
Implement authentication for Grafana dashboards
Audit metric access logs

Troubleshooting¶

Metrics Not Appearing¶

Check if metrics are enabled in configuration
Verify the metrics endpoint is accessible
Check Prometheus target status
Review router logs for metric initialization errors

High Memory Usage¶

Review cardinality limits
Check for unbounded labels
Reduce histogram buckets if needed
Enable metric expiration

Incorrect Values¶

Verify metric types (counter vs gauge)
Check aggregation functions
Review label selectors
Validate time ranges

Metrics and Monitoring¶

Table of Contents¶

Overview¶

Quick Start¶

1. Enable Metrics¶

2. Configure Prometheus¶

3. Import Grafana Dashboard¶

Configuration¶

Environment Variables¶

Available Metrics¶

HTTP Metrics¶

Backend Metrics¶

Routing Metrics¶

Model Service Metrics¶

Cache Stampede Prevention Metrics¶

Understanding Cache Stampede Metrics¶

Streaming Metrics¶

Mid-Stream Fallback Metrics¶

Reason Label Values for streaming_fallback_total¶

Key PromQL Queries¶

Fallback Metrics¶

Response Cache Metrics¶

Redis Cache Backend Metrics¶

KV Event Consumer Metrics¶

Prefix Routing Metrics¶

Key PromQL Queries¶

KV Cache Index Metrics¶

Key PromQL Queries¶

Smart Routing Metrics¶

Classification and Routing¶

Load Management¶

LLM Classifier¶

Aggregate and Operational¶

Key PromQL Queries¶

Business Metrics¶

Guardrail Metrics¶

Key PromQL Queries¶

Per-API-Key LLM Token Usage¶

Metric Definition¶

api_key_id Derivation¶

Annotation Labels and api_key_info¶

Cardinality and Hot-Reload¶

Example PromQL Queries¶

Grafana Panel Example¶

Verification Steps¶

Integration¶

Prometheus Configuration¶

Kubernetes Integration¶

Grafana Dashboard¶

Overview Panel¶

Backend Performance¶

Model Usage¶

Alerts Overview¶

Alerting¶

Critical Alerts¶

Warning Alerts¶

Examples¶

Query Examples¶

Request Rate by Status¶

P95 Latency by Endpoint¶

Backend Health Overview¶

Model Usage Ranking¶

Error Rate Percentage¶

Programmatic Access¶

Custom Metrics Collection¶

Best Practices¶

1. Label Cardinality¶

2. Metric Naming¶

3. Dashboard Design¶

4. Alert Configuration¶

5. Performance Considerations¶

6. Security¶

Troubleshooting¶

Metrics Not Appearing¶

High Memory Usage¶

Incorrect Values¶

Additional Resources¶

Reason Label Values for `streaming_fallback_total`¶

`api_key_id` Derivation¶

Annotation Labels and `api_key_info`¶