Skip to content

Metrics and Monitoring

This document describes the metrics and monitoring capabilities of the Continuum Router.

Table of Contents

Overview

The Continuum Router exposes Prometheus-compatible metrics for monitoring system health, performance, and usage patterns. The metrics system is designed to be:

  • Lightweight: Minimal performance overhead
  • Broad coverage: Covers HTTP, backend, routing, model, streaming, and cache subsystems
  • Production-ready: Includes cardinality limits and proper labeling
  • Easy to integrate: Works with standard Prometheus/Grafana setups

For restart-survival history without standing up a full Prometheus stack, see the Persistent Metrics Log. It snapshots the registry to a local SQLite store and exposes recent history via GET /admin/metrics/history.

Quick Start

1. Enable Metrics

Metrics are enabled by default. The metrics endpoint is available at /metrics:

# View metrics
curl http://localhost:8000/metrics

2. Configure Prometheus

Add the router as a target in your prometheus.yml:

scrape_configs:
    - job_name: 'continuum-router'
    static_configs:
      - targets: ['localhost:8000']
    scrape_interval: 15s

3. Import Grafana Dashboard

Import the provided dashboard from monitoring/grafana/dashboards/router-overview.json.

Configuration

Metrics configuration is done through the main config file:

metrics:
  # Enable/disable metrics collection
  enabled: true

  # Metrics endpoint path
  endpoint: "/metrics"

  # Cardinality limits to prevent metric explosion
  cardinality_limit:
    max_labels_per_metric: 100
    max_unique_label_values: 1000

  # Optional metrics (disabled by default for performance)
  optional_metrics:
    enable_request_body_size: false
    enable_response_body_size: false
    enable_detailed_errors: true

Environment Variables

You can also configure metrics using environment variables:

# Enable/disable metrics
METRICS_ENABLED=true

# Change metrics endpoint
METRICS_ENDPOINT=/custom/metrics

# Enable optional metrics
METRICS_ENABLE_BODY_SIZE=true

Available Metrics

HTTP Metrics

Metric Type Description Labels
http_requests_total Counter Total number of HTTP requests method, endpoint, status
http_request_duration_seconds Histogram Request latency method, endpoint
http_active_connections Gauge Current active connections -
http_request_size_bytes Histogram Request body size method, endpoint
http_response_size_bytes Histogram Response body size method, endpoint

Backend Metrics

Metric Type Description Labels
backend_health_status Gauge Backend health (1=healthy, 0=unhealthy) backend_id, backend_url
backend_health_check_duration_seconds Histogram Health check duration backend_id
backend_health_check_failures_total Counter Total health check failures backend_id, error_type
backend_request_latency_seconds Histogram Backend request latency backend_id, endpoint
backend_connection_pool_size Gauge Connection pool size backend_id
backend_connection_pool_active Gauge Active connections in pool backend_id

Routing Metrics

Metric Type Description Labels
routing_decisions_total Counter Total routing decisions strategy, selected_backend
routing_backend_selection_duration_seconds Histogram Time to select backend strategy
routing_model_availability Gauge Model availability per backend model, backend_id
routing_retries_total Counter Total retry attempts backend_id, reason
routing_circuit_breaker_state Gauge Circuit breaker state backend_id

Model Service Metrics

Metric Type Description Labels
model_cache_hits_total Counter Model cache hits operation
model_cache_misses_total Counter Model cache misses operation
model_refresh_duration_seconds Histogram Model list refresh duration backend_id
model_discovery_errors_total Counter Model discovery errors backend_id, error_type

Cache Stampede Prevention Metrics

These metrics help monitor the cache stampede prevention mechanisms:

Metric Type Description Labels
model_stale_while_revalidate_total Counter Requests that returned stale data while refresh was in progress -
model_coalesced_requests_total Counter Requests that waited for ongoing aggregation instead of triggering new one -
model_background_refreshes_total Counter Background refresh operations initiated -
model_background_refresh_successes_total Counter Successful background refresh operations -
model_background_refresh_failures_total Counter Failed background refresh operations -
model_singleflight_lock_acquired_total Counter Times the aggregation lock was acquired for singleflight -

Understanding Cache Stampede Metrics

  • High coalesced_requests: Indicates the singleflight pattern is effectively preventing duplicate aggregations
  • High stale_while_revalidate: Shows the stale-while-revalidate pattern is returning cached data during refresh
  • Low background_refresh_failures: Confirms background refresh is working correctly
  • Zero blocking on cache miss: When background_refreshes > 0, requests should rarely block on cache refresh

Streaming Metrics

Metric Type Description Labels
streaming_active_connections Gauge Active streaming connections endpoint
streaming_events_sent_total Counter Total SSE events sent endpoint, event_type
streaming_connection_duration_seconds Histogram Streaming connection duration endpoint
streaming_errors_total Counter Streaming errors endpoint, error_type

Mid-Stream Fallback Metrics

These metrics are emitted when the mid-stream fallback feature is enabled (streaming.mid_stream_fallback.enabled: true).

Metric Type Description Labels
streaming_fallback_total Counter Total mid-stream fallback attempts reason
streaming_fallback_success_total Counter Successful mid-stream fallback recoveries original_backend, fallback_backend
streaming_fallback_accumulated_tokens Histogram Estimated tokens accumulated before fallback outcome (success, failure)

Reason Label Values for streaming_fallback_total

Value Description
timeout Backend inactivity timeout exceeded
connection_error TCP/TLS connection error
stream_read_error Error reading bytes from stream
stream_ended_unexpectedly Stream closed without [DONE] marker
too_many_stream_errors Consecutive error event threshold reached
other Other failure reason

Key PromQL Queries

# Mid-stream fallback rate
rate(streaming_fallback_total[5m])

# Fallback recovery success rate
sum(rate(streaming_fallback_success_total[5m])) /
sum(rate(streaming_fallback_total[5m]))

# Median accumulated tokens at fallback trigger
histogram_quantile(0.5, rate(streaming_fallback_accumulated_tokens_bucket[5m]))

Fallback Metrics

Metric Type Description Labels
fallback_attempts_total Counter Total fallback attempts original_model, fallback_model, reason
fallback_success_total Counter Successful fallbacks original_model, fallback_model
fallback_exhausted_total Counter Exhausted fallback chains original_model
fallback_cross_provider_total Counter Cross-provider fallbacks from_provider, to_provider
fallback_duration_seconds Histogram Fallback operation duration original_model

Response Cache Metrics

Metric Type Description Labels
continuum_response_cache_requests_total Counter Cache lookups by result result (hit, miss, skip)
continuum_response_cache_entries Gauge Current number of cached entries --
continuum_response_cache_size_bytes Gauge Approximate cache memory usage --
continuum_response_cache_evictions_total Counter LRU evictions --
continuum_response_cache_hit_rate Gauge Rolling cache hit rate (0.0--1.0) --
continuum_cache_backend_type Gauge Active cache backend (1 = active) backend (memory, redis)

Redis Cache Backend Metrics

These metrics are populated when the Redis cache backend is active (backend: redis).

Metric Type Description Labels
continuum_cache_redis_connections_active Gauge Active Redis connections in the pool --
continuum_cache_redis_connections_idle Gauge Idle Redis connections in the pool --
continuum_cache_redis_latency_seconds Histogram Redis operation latency operation (get, set, delete)
continuum_cache_redis_errors_total Counter Redis errors by type type (connection, timeout, other)
continuum_cache_fallback_active Gauge Whether in-memory fallback is active (0 or 1) --

KV Event Consumer Metrics

These metrics are populated when vLLM KV event consumers are active (src/infrastructure/kv_index/). All backend label values are sanitized to prevent cardinality explosion.

Metric Type Description Labels
continuum_kv_event_received_total Counter KV cache events received from each backend backend
continuum_kv_event_processed_total Counter KV cache events successfully forwarded via channel backend
continuum_kv_event_dropped_total Counter KV cache events dropped due to backpressure backend
continuum_kv_consumer_connected Gauge Whether the KV event consumer is connected (1 = connected, 0 = disconnected) backend
continuum_kv_consumer_reconnects_total Counter Total reconnection attempts for each backend consumer backend

Prefix Routing Metrics

These metrics track prefix-aware sticky routing decisions and backend distribution.

Metric Type Description Labels
continuum_prefix_routing_requests_total Counter Total prefix routing decisions by strategy type strategy (prefix_hash, overflow, fallback, unknown)
continuum_prefix_routing_backend_distribution Gauge In-flight requests per backend (for load balancing) backend
continuum_prefix_routing_prefix_cardinality Gauge Approximate number of unique prefix keys seen --

Key PromQL Queries

# Prefix routing hit rate (% of requests using prefix hash vs fallback)
sum(rate(continuum_prefix_routing_requests_total{strategy="prefix_hash"}[5m])) /
sum(rate(continuum_prefix_routing_requests_total[5m]))

# Overflow rate (CHWBL load balancing activations)
rate(continuum_prefix_routing_requests_total{strategy="overflow"}[5m])

# Backend load distribution (should be roughly even)
continuum_prefix_routing_backend_distribution

KV Cache Index Metrics

These metrics track the KV cache index subsystem including index state, query performance, routing decisions, and overlap scoring.

Metric Type Description Labels
continuum_kv_index_entries Gauge Current number of entries in the KV cache index --
continuum_kv_index_events_total Counter KV cache index mutation events (created/evicted) backend, type (created, evicted)
continuum_kv_index_query_latency_seconds Histogram Latency of KV index query operations --
continuum_kv_index_routing_decisions_total Counter KV-aware routing decisions by outcome decision (kv_aware, fallback)
continuum_kv_index_overlap_score Histogram Distribution of overlap scores for routed requests --
continuum_kv_index_event_source_status Gauge Event source connection status (1 = connected, 0 = disconnected) backend, status

Key PromQL Queries

# KV-aware routing ratio
sum(rate(continuum_kv_index_routing_decisions_total{decision="kv_aware"}[5m])) /
sum(rate(continuum_kv_index_routing_decisions_total[5m]))

# Average overlap score for routed requests
histogram_quantile(0.5, rate(continuum_kv_index_overlap_score_bucket[5m]))

# KV index query P99 latency
histogram_quantile(0.99, rate(continuum_kv_index_query_latency_seconds_bucket[5m]))

# Event source connection health
continuum_kv_index_event_source_status{status="connected"}

Smart Routing Metrics

These metrics cover the smart routing pipeline, including the LLM-based classifier.

Classification and Routing

Metric Type Description Labels
smart_routing_classifications_total Counter Total classifications performed complexity, domain, classifier_type
smart_routing_decisions_total Counter Total routing decisions made source_model, target_model, policy, tier
smart_routing_classifier_duration_seconds Histogram Classifier latency classifier_type
smart_routing_policy_no_match_total Counter Requests with no matching policy -
smart_routing_tier_no_model_total Counter Policy matched but no model available in tier tier

Load Management

Metric Type Description Labels
smart_routing_load_state Gauge Current load state: 0=Normal, 1=Warning, 2=Critical -
smart_routing_tier_degradation_total Counter Routing degraded due to load load_state
smart_routing_load_transitions_total Counter Load state transitions from_state, to_state

LLM Classifier

Metric Type Description Labels
smart_routing_llm_classifier_calls_total Counter Total LLM classifier invocations -
smart_routing_llm_classifier_cache_hits_total Counter Classification results served from cache -
smart_routing_llm_classifier_duration_seconds Histogram End-to-end LLM classification latency (buckets: 50ms–5s) -
smart_routing_llm_classifier_fallbacks_total Counter Times the LLM result was discarded and rule-based result used -
smart_routing_llm_classifier_parse_errors_total Counter Response parse failures before retry -
smart_routing_llm_classifier_retries_total Counter Retry attempts after initial parse failure -

Aggregate and Operational

Metric Type Description Labels
smart_routing_requests_total Counter Total smart-routed requests source_model, target_model, policy, load_state
smart_routing_tier_usage_total Counter Tier usage distribution tier, domain
smart_routing_cost_estimate_total Counter Estimated cost from tier optimization tier
smart_routing_policy_evaluations_total Counter Policy evaluation frequency policy_name, result
smart_routing_model_availability Gauge Available models per tier model, tier

Key PromQL Queries

# Smart routing request rate by policy
rate(smart_routing_requests_total[5m])

# Tier usage distribution
sum by(tier) (rate(smart_routing_tier_usage_total[5m]))

# LLM classifier cache hit rate
rate(smart_routing_llm_classifier_cache_hits_total[5m]) /
rate(smart_routing_llm_classifier_calls_total[5m])

# LLM classifier P95 latency
histogram_quantile(0.95, rate(smart_routing_llm_classifier_duration_seconds_bucket[5m]))

# LLM classifier fallback rate (reliability indicator)
rate(smart_routing_llm_classifier_fallbacks_total[5m]) /
rate(smart_routing_llm_classifier_calls_total[5m])

# Fraction of requests classified by LLM vs rule-based
rate(smart_routing_classifications_total{classifier_type="llm_based"}[5m]) /
rate(smart_routing_classifications_total[5m])

# Policy evaluation success rate
sum by(policy_name) (rate(smart_routing_policy_evaluations_total{result="matched"}[5m]))

Business Metrics

Metric Type Description Labels
model_usage_total Counter Model usage count model, backend_id
tokens_consumed_total Counter Total tokens consumed model, operation

Guardrail Metrics

Exported when guardrails are configured and the metrics feature is enabled. Every guardrail decision is recorded so operators can observe what a policy does (or would do, in monitor mode) before and after enforcement.

Metric Type Description Labels
guardrail_checks_total Counter Per-provider checks by stage and verdict result stage, provider, result
guardrail_blocks_total Counter Block verdicts by stage, provider, and category stage, provider, category
guardrail_check_duration_seconds Histogram Per-provider check latency in seconds stage, provider
guardrail_errors_total Counter Provider errors (timeout / hard failure) provider, kind
guardrail_fail_open_total Counter Provider failures resolved fail-open (request allowed) provider
guardrail_fail_closed_total Counter Provider failures resolved fail-closed (request blocked) provider
guardrail_verdicts_total Counter Aggregated verdict per request after applying mode semantics stage, mode, result

Label values:

  • stage is input, output, or streaming.
  • result is allow, block, transform, or flag.
  • kind is timeout or error.
  • mode is monitor or enforce. Because guardrail_verdicts_total carries mode, monitor-mode verdicts are visible even though they never gate a request, which is what makes the monitor-then-enforce rollout observable.

Key PromQL Queries

# What would be blocked, broken down by category (monitor-mode tuning)
sum by (category) (rate(guardrail_blocks_total[1h]))

# Block rate per stage after enforcement
sum by (stage) (rate(guardrail_verdicts_total{result="block", mode="enforce"}[5m]))

# Provider error rate (timeouts vs hard failures)
sum by (provider, kind) (rate(guardrail_errors_total[5m]))

# P95 guardrail check latency per provider
histogram_quantile(0.95, sum by (le, provider) (rate(guardrail_check_duration_seconds_bucket[5m])))

For the full guardrail guide (concepts, providers, configuration, and the threshold-tuning workflow), see Guardrails.

Per-API-Key LLM Token Usage

The router publishes a per-API-key breakdown of LLM token consumption so operators can answer questions like "which key consumed the most completion tokens last hour?" or "how many prompt tokens did team X spend on model Y today?". This data is independent of the legacy aggregate counter and is intended for capacity planning, fair-use enforcement, and (eventually) cost attribution.

Metric Definition

Metric Type Description Labels
llm_tokens_total Counter LLM tokens consumed per request api_key_id, model, backend, kind
api_key_info Gauge (constant 1) Info-metric exposing configured API-key annotations as labels api_key_id, plus the configured annotation allowlist

kind is one of:

  • prompt — tokens in the upstream request prompt
  • completion — tokens in the upstream response completion

Both OpenAI-compatible (prompt_tokens / completion_tokens) and Anthropic (input_tokens / output_tokens) response shapes are normalized into the same counter. The router also injects stream_options.include_usage=true on OpenAI-compat streaming requests so usage data arrives in the final SSE chunk regardless of client behavior.

api_key_id Derivation

api_key_id is never the raw API key. The router derives a stable, non-reversible identifier in this priority order:

  1. If the request's bearer token matches a configured API-key entry, the entry's id field is used (e.g., key-production-1).
  2. Otherwise, the router computes SHA-256 over the raw token and uses the first 12 hex characters prefixed with k_ (e.g., k_3f5a7c9b1e2d).
  3. If no token is presented, the literal value anonymous is used.

All label values flow through the existing CardinalityManager so a runaway/rotating-key attack cannot exhaust Prometheus series.

Annotation Labels and api_key_info

Each configured API key may carry an optional free-form annotations: { key: value } map. Operators declare which annotation keys become Prometheus labels via the global metrics.annotation_labels allowlist; everything else stays internal.

Configuration schema (under the existing api_keys block):

api_keys:
  api_keys:
      - key: "${API_KEY_1}"
        id: "key-production-1"
        user_id: "user-admin"
        organization_id: "org-main"
        annotations:
            email: "ops@example.com"
            team: "platform"
            environment: "prod"
            owner: "alice"

metrics:
    enabled: true
    annotation_labels: [email, team]            # Allowlist of label keys

Reserved annotation keys (recommended canonical names, not enforced): email, uuid, owner, team, environment. Operators may add custom keys.

When metrics.annotation_labels is non-empty, the router publishes api_key_info{api_key_id, email, team, ...} = 1 once per known key. Use PromQL joins to project the metadata onto llm_tokens_total without bloating its label set:

# Tokens per email (sums prompt + completion, last 24h)
sum by (email) (
  increase(llm_tokens_total[24h])
  * on (api_key_id) group_left(email) api_key_info
)

Cardinality and Hot-Reload

  • api_key_id cardinality is bounded at 1000 by default.
  • Hot-reload of API-key annotations is supported via the existing config-reload pipeline. The api_key_info info-metric is republished atomically on every reload; counter values for llm_tokens_total are never reset.
  • The label set on api_key_info (i.e., the contents of annotation_labels) is frozen at startup. Adding or removing keys from the allowlist requires a restart — Prometheus does not allow renaming labels on a registered metric.

Example PromQL Queries

# Total prompt tokens consumed per API key in the last hour
sum by (api_key_id) (
  increase(llm_tokens_total{kind="prompt"}[1h])
)
# Top 10 keys by completion tokens in the last 24h
topk(10,
  sum by (api_key_id) (
    increase(llm_tokens_total{kind="completion"}[24h])
  )
)
# Tokens grouped by team (requires team in annotation_labels)
sum by (team) (
  increase(llm_tokens_total[24h])
  * on (api_key_id) group_left(team) api_key_info
)
# Combined prompt+completion rate per model (tokens/sec)
sum by (model) (rate(llm_tokens_total[5m]))
# Per-key consumption by backend (useful for cost attribution)
sum by (api_key_id, backend) (
  increase(llm_tokens_total[24h])
)

Grafana Panel Example

A simple Grafana stat panel showing the top 10 teams by completion tokens over the last 24 hours:

{
    "title": "Top 10 teams by completion tokens (24h)",
    "type": "stat",
    "targets": [
        {
            "expr": "topk(10, sum by (team) (increase(llm_tokens_total{kind=\"completion\"}[24h]) * on (api_key_id) group_left(team) api_key_info))",
            "legendFormat": "{{team}}"
        }
    ],
    "options": {
        "reduceOptions": {
            "values": false,
            "calcs": ["lastNotNull"]
        }
    }
}

For tracking spend trends, pair this with a time-series panel using rate(llm_tokens_total[5m]) grouped by team or model.

Verification Steps

After enabling the feature:

  1. Issue a chat-completion request with a configured API key.
  2. Scrape /metrics and confirm llm_tokens_total{...} and api_key_info{...} series appear.
  3. For streaming, verify the counter still increments — usage is captured from the final SSE chunk. The router injects stream_options.include_usage=true automatically for OpenAI-compat backends so this works regardless of client behavior.
  4. Inspect /metrics cardinality on a typical workload (e.g., wc -l < /metrics) to confirm no regression versus the prior baseline.

Integration

Prometheus Configuration

Complete Prometheus configuration example:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
    - job_name: 'continuum-router'
    static_configs:
      - targets: ['router1:8000', 'router2:8000']
    metric_relabel_configs:
      # Drop high-cardinality metrics if needed
      - source_labels: [__name__]
        regex: 'http_request_duration_seconds_bucket'
        action: drop

Kubernetes Integration

For Kubernetes deployments, use ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: continuum-router
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: continuum-router
  endpoints:
    - port: metrics
    interval: 15s
    path: /metrics

Grafana Dashboard

The provided Grafana dashboard includes:

Overview Panel

  • Request rate and error rate
  • P50, P95, P99 latencies
  • Active connections
  • Backend health status

Backend Performance

  • Backend-specific latencies
  • Health check success rate
  • Connection pool utilization
  • Circuit breaker status

Model Usage

  • Model request distribution
  • Cache hit rates
  • Token consumption
  • Model availability matrix

Alerts Overview

  • Active alerts
  • Alert history
  • SLO compliance

To import the dashboard:

  1. Open Grafana
  2. Go to Dashboards → Import
  3. Upload monitoring/grafana/dashboards/router-overview.json
  4. Select your Prometheus data source
  5. Click Import

Alerting

Pre-configured alert rules are available in monitoring/prometheus/alerts.yml:

Critical Alerts

{% raw %}

- alert: BackendDown
  expr: backend_health_status == 0
  for: 1m
  annotations:
    summary: "Backend {{ $labels.backend_id }} is down"

- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
  for: 5m
  annotations:
    summary: "High error rate: {{ $value | humanizePercentage }}"

Warning Alerts

{% raw %}

- alert: HighLatency
  expr: histogram_quantile(0.95, http_request_duration_seconds) > 1
  for: 5m
  annotations:
    summary: "P95 latency above 1s: {{ $value | humanizeDuration }}"

- alert: LowCacheHitRate
  expr: rate(model_cache_hits_total[5m]) / rate(model_cache_total[5m]) < 0.8
  for: 10m
  annotations:
    summary: "Cache hit rate below 80%: {{ $value | humanizePercentage }}"

Examples

Query Examples

Request Rate by Status

sum(rate(http_requests_total[5m])) by (status)

P95 Latency by Endpoint

histogram_quantile(0.95, 
  sum(rate(http_request_duration_seconds_bucket[5m])) by (endpoint, le)
)

Backend Health Overview

sum(backend_health_status) by (backend_id)

Model Usage Ranking

topk(10, sum(rate(model_usage_total[1h])) by (model))

Error Rate Percentage

sum(rate(http_requests_total{status=~"5.."}[5m])) / 
sum(rate(http_requests_total[5m])) * 100

Programmatic Access

You can also access metrics programmatically:

import requests
from prometheus_client.parser import text_string_to_metric_families

# Fetch metrics
response = requests.get('http://localhost:8000/metrics')
metrics = text_string_to_metric_families(response.text)

# Process metrics
for family in metrics:
    for sample in family.samples:
        if sample.name == 'http_requests_total':
            print(f"Endpoint: {sample.labels['endpoint']}, Count: {sample.value}")

Custom Metrics Collection

#!/bin/bash
# Collect metrics every 30 seconds and save to file

while true; do
  timestamp=$(date +%s)
  curl -s http://localhost:8000/metrics > "metrics_${timestamp}.txt"
  sleep 30
done

Best Practices

1. Label Cardinality

Keep label cardinality low to prevent metric explosion:

# Good: Low cardinality
labels:
  status: "200"  # ~5 possible values
  method: "GET"  # ~7 possible values

# Bad: High cardinality
labels:
  user_id: "12345"  # Unbounded
  request_id: "abc-123"  # Unique per request

2. Metric Naming

Follow Prometheus naming conventions:

  • Use snake_case
  • Include units in metric names (_seconds, _bytes, _total)
  • Use standard prefixes (http_, backend_, model_)

3. Dashboard Design

  • Group related metrics together
  • Use appropriate visualization types (gauge for current values, graph for time series)
  • Include both absolute values and rates
  • Set reasonable refresh intervals (15-30s for real-time, 1-5m for historical)

4. Alert Configuration

  • Use appropriate evaluation periods (for: 5m to avoid flapping)
  • Include context in alert descriptions
  • Set up alert routing based on severity
  • Test alerts in staging before production

5. Performance Considerations

  • Disable optional metrics if not needed
  • Use recording rules for complex queries
  • Implement proper metric retention policies
  • Consider using remote storage for long-term retention

6. Security

  • Protect metrics endpoint if sensitive data is exposed
  • Use TLS for Prometheus scraping in production
  • Implement authentication for Grafana dashboards
  • Audit metric access logs

Troubleshooting

Metrics Not Appearing

  1. Check if metrics are enabled in configuration
  2. Verify the metrics endpoint is accessible
  3. Check Prometheus target status
  4. Review router logs for metric initialization errors

High Memory Usage

  1. Review cardinality limits
  2. Check for unbounded labels
  3. Reduce histogram buckets if needed
  4. Enable metric expiration

Incorrect Values

  1. Verify metric types (counter vs gauge)
  2. Check aggregation functions
  3. Review label selectors
  4. Validate time ranges

Additional Resources