Skip to content

Persistent Local Metrics Log

Continuum Router can persist its Prometheus registry to a local store so that recent metric history survives restarts. The feature is embedded — no external time-series database required for the default deployment.

This page covers what the persistent log is, how to configure it, the disk-usage tradeoff, and how to query history through the admin API.

Overview

The Prometheus counters and gauges that back the /metrics endpoint normally live only in process memory. Restarting the binary wipes them. The persistent metrics log fixes that by snapshotting the entire registry to disk at a configurable interval. Recent history is then queryable via a dedicated admin endpoint.

What this is NOT

  • It is not a replacement for Prometheus / Grafana / Thanos. Operators who already run a TSDB should keep doing so — the log targets single-node deployments and the "what happened just before the restart" diagnostic case.
  • It does not restore counter or gauge state into the live registry on startup. Counters are required to be monotonic by Prometheus semantics, and clients detect resets. The persistent log is a separate read path consumed via GET /admin/metrics/history.
  • It is not PromQL. The query interface is a simple time-range filter.

Default behaviour

The feature defaults to enabled: true when the metrics-persistence Cargo feature is compiled in (which is part of the full default). To opt out at runtime, set metrics.persistence.enabled: false.

Configuration

Add a persistence block under metrics: in your config.yaml:

metrics:
  persistence:
    enabled: true
    backend: sqlite
    path: ./data/metrics.db
    snapshot_interval_seconds: 60
    retention_days: 15
    compaction:
      enabled: true
      schedule: "0 3 * * *"

Field reference

Field Default Notes
enabled true Setting to false skips opening the DB and spawning the background task.
backend sqlite sqlite is the only supported backend; redb and duckdb are reserved values.
path ./data/metrics.db Parent directories are created automatically.
snapshot_interval_seconds 60 Matches typical Prometheus scrape cadence. Range 1..=86_400.
retention_days 15 Range 1..=365. See disk-usage formula below.
compaction.enabled true Toggles backend-specific compaction (SQLite VACUUM).
compaction.schedule "0 3 * * *" Subset-of-cron syntax: only minute hour * * * is honored.

Hot-reload

The following fields support hot-reload via src/admin_config/:

  • snapshot_interval_seconds
  • retention_days
  • compaction.enabled and compaction.schedule

Retention updates atomically rebuild the prune cutoff without dropping in-flight snapshots.

Changing backend, path, or toggling enabled requires a restart.

Disk usage

The persistent log writes one row per sample on each snapshot tick. Histograms and summaries explode into multiple rows (sum, count, and per-bucket / per-quantile). Counters and gauges emit a single row per series.

Formula

Rough estimate:

bytes ≈ series × (86_400 / snapshot_interval_seconds) × retention_days × bytes_per_sample

Empirically on the SQLite backend, bytes_per_sample ≈ 70-120 bytes once the WAL has been checkpointed (measured on a synthetic workload of 100 series × 10 snapshots; see tests/metrics_persistence_test.rs::disk_usage_smoke_check_under_synthetic_load). Real workloads will land in that band depending on label-set size.

Worked example

For a deployment with 5,000 active series, 60s snapshots, and 15 days of retention:

5_000 × (86_400 / 60) × 15 × 100 = ~10.8 GB

If 15 days of retention is too aggressive, drop retention_days to 7 (≈ 5 GB at the same series count). If snapshot frequency matters less than disk cost, lengthen snapshot_interval_seconds to 120 or 300.

Admin endpoint

GET /admin/metrics/history

Query the persistent log for a specific metric over a time range.

Query parameters:

  • metric (required): metric family name, e.g. http_requests_total. Histograms and summaries return multiple kind rows per family (see below).
  • from (optional): inclusive lower bound, either Unix milliseconds (int) or RFC 3339 (2026-05-11T00:00:00Z). Defaults to 24 hours ago.
  • to (optional): exclusive upper bound, same encoding as from. Defaults to now.
  • limit (optional): cap returned rows. Defaults to 10,000; hard ceiling is 100,000.

Example requests

# Last 24 hours of http_requests_total
curl -s 'http://localhost:8080/admin/metrics/history?metric=http_requests_total' | jq .

# Specific RFC 3339 range, capped at 500 rows
curl -s 'http://localhost:8080/admin/metrics/history?metric=model_tokens_processed&from=2026-05-10T00:00:00Z&to=2026-05-11T00:00:00Z&limit=500' | jq .

# Unix-millis range
curl -s 'http://localhost:8080/admin/metrics/history?metric=errors_total&from=1715385600000&to=1715472000000' | jq .

Response shape

{
  "metric": "http_requests_total",
  "from_ms": 1715385600000,
  "to_ms": 1715472000000,
  "row_count": 2,
  "limit": 10000,
  "samples": [
    {
      "ts_ms": 1715385600000,
      "labels": {"backend": "openai", "endpoint": "/v1/chat/completions"},
      "value": 42.0,
      "kind": "counter"
    }
  ]
}

Sample kinds

kind Source Notes
counter Counter family Cumulative monotonic value.
gauge Gauge family Instantaneous value.
histogram_sum Histogram family Sum of all observations.
histogram_count Histogram family Count of all observations.
histogram_bucket Histogram family Cumulative count per bucket. labels.le carries the upper bound.
summary_sum Summary family Sum of all observations.
summary_count Summary family Count of all observations.
summary_quantile Summary family Per-quantile value. labels.quantile carries the q.
untyped Unknown / future kinds Forward-compat catch-all.

Error responses

  • 400 Bad Requestmetric is missing, empty, oversized, or the time range is non-positive.
  • 404 Not Found — persistence is disabled (metrics.persistence.enabled: false).
  • 500 Internal Server Error — storage error; details in router logs.
  • 503 Service Unavailablemetrics-persistence feature was not compiled in.

Storage layout

The SQLite schema is intentionally minimal, keeping the row shape portable to other storage engines:

CREATE TABLE IF NOT EXISTS metric_samples (
    ts     INTEGER NOT NULL,   -- Unix milliseconds
    metric TEXT    NOT NULL,   -- metric family name
    labels TEXT    NOT NULL,   -- canonical sorted-key JSON
    value  REAL    NOT NULL,   -- sample value (cumulative for buckets)
    kind   TEXT    NOT NULL    -- counter | gauge | histogram_* | summary_* | untyped
);
CREATE INDEX IF NOT EXISTS idx_metric_samples_metric_ts
    ON metric_samples (metric, ts);
CREATE INDEX IF NOT EXISTS idx_metric_samples_ts
    ON metric_samples (ts);

The labels column stores label sets as canonical sorted-key JSON (e.g. {"backend":"openai","model":"gpt-4o"}). This makes (metric, labels) a deterministic equality key for ad-hoc joins.

PRAGMA journal_mode = WAL is set so the admin endpoint can read concurrently with the snapshot writer.

Operational notes

  • First-run permissions: ensure the process can create the parent directory of path. The router will log an error and continue serving /metrics without history if it cannot.
  • Backup: with PRAGMA journal_mode = WAL, copy the .db, .db-wal, and .db-shm files together (or run sqlite3 metrics.db .backup) for a consistent snapshot.
  • Inspection: sqlite3 ./data/metrics.db 'select count(*) from metric_samples;' is safe to run against a live router.
  • Disabling at runtime: set metrics.persistence.enabled: false and hot-reload — the snapshot task is not torn down by hot-reload alone; pair the change with a restart if you need to fully release the file handle.
  • Metrics and Monitoring covers the live /metrics endpoint, Grafana/Prometheus integration, and the per-API-key token-usage metrics whose long-term analysis this persistence layer enables.
  • The admin stats snapshot (admin.stats.persistence in the Admin API) separately persists aggregate /admin/stats counters across restarts; it stores a different shape and serves a different read path.