Persistent Local Metrics Log¶

Continuum Router can persist its Prometheus registry to a local store so that recent metric history survives restarts. The feature is embedded — no external time-series database required for the default deployment.

This page covers what the persistent log is, how to configure it, the disk-usage tradeoff, and how to query history through the admin API.

Overview¶

The Prometheus counters and gauges that back the /metrics endpoint normally live only in process memory. Restarting the binary wipes them. The persistent metrics log fixes that by snapshotting the entire registry to disk at a configurable interval. Recent history is then queryable via a dedicated admin endpoint.

What this is NOT¶

It is not a replacement for Prometheus / Grafana / Thanos. Operators who already run a TSDB should keep doing so — the log targets single-node deployments and the "what happened just before the restart" diagnostic case.
It does not restore counter or gauge state into the live registry on startup. Counters are required to be monotonic by Prometheus semantics, and clients detect resets. The persistent log is a separate read path consumed via GET /admin/metrics/history.
It is not PromQL. The query interface is a simple time-range filter.

Default behaviour¶

The feature defaults to enabled: true when the metrics-persistence Cargo feature is compiled in (which is part of the full default). To opt out at runtime, set metrics.persistence.enabled: false.

Configuration¶

Add a persistence block under metrics: in your config.yaml:

metrics:
  persistence:
    enabled: true
    backend: sqlite
    path: ./data/metrics.db
    snapshot_interval_seconds: 60
    retention_days: 15
    compaction:
      enabled: true
      schedule: "0 3 * * *"

Field reference¶

Field	Default	Notes
`enabled`	`true`	Setting to `false` skips opening the DB and spawning the background task.
`backend`	`sqlite`	`sqlite` is the only supported backend; `redb` and `duckdb` are reserved values.
`path`	`./data/metrics.db`	Parent directories are created automatically.
`snapshot_interval_seconds`	`60`	Matches typical Prometheus scrape cadence. Range `1..=86_400`.
`retention_days`	`15`	Range `1..=365`. See disk-usage formula below.
`compaction.enabled`	`true`	Toggles backend-specific compaction (SQLite `VACUUM`).
`compaction.schedule`	`"0 3 * * *"`	Subset-of-cron syntax: only `minute hour * * *` is honored.

Hot-reload¶

The following fields support hot-reload via src/admin_config/:

snapshot_interval_seconds
retention_days
compaction.enabled and compaction.schedule

Retention updates atomically rebuild the prune cutoff without dropping in-flight snapshots.

Changing backend, path, or toggling enabled requires a restart.

Disk usage¶

The persistent log writes one row per sample on each snapshot tick. Histograms and summaries explode into multiple rows (sum, count, and per-bucket / per-quantile). Counters and gauges emit a single row per series.

Formula¶

Rough estimate:

bytes ≈ series × (86_400 / snapshot_interval_seconds) × retention_days × bytes_per_sample

Empirically on the SQLite backend, bytes_per_sample ≈ 70-120 bytes once the WAL has been checkpointed (measured on a synthetic workload of 100 series × 10 snapshots; see tests/metrics_persistence_test.rs::disk_usage_smoke_check_under_synthetic_load). Real workloads will land in that band depending on label-set size.

Worked example¶

For a deployment with 5,000 active series, 60s snapshots, and 15 days of retention:

5_000 × (86_400 / 60) × 15 × 100 = ~10.8 GB

If 15 days of retention is too aggressive, drop retention_days to 7 (≈ 5 GB at the same series count). If snapshot frequency matters less than disk cost, lengthen snapshot_interval_seconds to 120 or 300.

Admin endpoint¶

`GET /admin/metrics/history`¶

Query the persistent log for a specific metric over a time range.

Query parameters:

metric (required): metric family name, e.g. http_requests_total. Histograms and summaries return multiple kind rows per family (see below).
from (optional): inclusive lower bound, either Unix milliseconds (int) or RFC 3339 (2026-05-11T00:00:00Z). Defaults to 24 hours ago.
to (optional): exclusive upper bound, same encoding as from. Defaults to now.
limit (optional): cap returned rows. Defaults to 10,000; hard ceiling is 100,000.

Example requests¶

# Last 24 hours of http_requests_total
curl -s 'http://localhost:8080/admin/metrics/history?metric=http_requests_total' | jq .

# Specific RFC 3339 range, capped at 500 rows
curl -s 'http://localhost:8080/admin/metrics/history?metric=model_tokens_processed&from=2026-05-10T00:00:00Z&to=2026-05-11T00:00:00Z&limit=500' | jq .

# Unix-millis range
curl -s 'http://localhost:8080/admin/metrics/history?metric=errors_total&from=1715385600000&to=1715472000000' | jq .

Response shape¶

{
  "metric": "http_requests_total",
  "from_ms": 1715385600000,
  "to_ms": 1715472000000,
  "row_count": 2,
  "limit": 10000,
  "samples": [
    {
      "ts_ms": 1715385600000,
      "labels": {"backend": "openai", "endpoint": "/v1/chat/completions"},
      "value": 42.0,
      "kind": "counter"
    }
  ]
}

Sample kinds¶

`kind`	Source	Notes
`counter`	Counter family	Cumulative monotonic value.
`gauge`	Gauge family	Instantaneous value.
`histogram_sum`	Histogram family	Sum of all observations.
`histogram_count`	Histogram family	Count of all observations.
`histogram_bucket`	Histogram family	Cumulative count per bucket. `labels.le` carries the upper bound.
`summary_sum`	Summary family	Sum of all observations.
`summary_count`	Summary family	Count of all observations.
`summary_quantile`	Summary family	Per-quantile value. `labels.quantile` carries the q.
`untyped`	Unknown / future kinds	Forward-compat catch-all.

Error responses¶

400 Bad Request — metric is missing, empty, oversized, or the time range is non-positive.
404 Not Found — persistence is disabled (metrics.persistence.enabled: false).
500 Internal Server Error — storage error; details in router logs.
503 Service Unavailable — metrics-persistence feature was not compiled in.

Storage layout¶

The SQLite schema is intentionally minimal, keeping the row shape portable to other storage engines:

CREATE TABLE IF NOT EXISTS metric_samples (
    ts     INTEGER NOT NULL,   -- Unix milliseconds
    metric TEXT    NOT NULL,   -- metric family name
    labels TEXT    NOT NULL,   -- canonical sorted-key JSON
    value  REAL    NOT NULL,   -- sample value (cumulative for buckets)
    kind   TEXT    NOT NULL    -- counter | gauge | histogram_* | summary_* | untyped
);
CREATE INDEX IF NOT EXISTS idx_metric_samples_metric_ts
    ON metric_samples (metric, ts);
CREATE INDEX IF NOT EXISTS idx_metric_samples_ts
    ON metric_samples (ts);

The labels column stores label sets as canonical sorted-key JSON (e.g. {"backend":"openai","model":"gpt-4o"}). This makes (metric, labels) a deterministic equality key for ad-hoc joins.

PRAGMA journal_mode = WAL is set so the admin endpoint can read concurrently with the snapshot writer.

Operational notes¶

First-run permissions: ensure the process can create the parent directory of path. The router will log an error and continue serving /metrics without history if it cannot.
Backup: with PRAGMA journal_mode = WAL, copy the .db, .db-wal, and .db-shm files together (or run sqlite3 metrics.db .backup) for a consistent snapshot.
Inspection: sqlite3 ./data/metrics.db 'select count(*) from metric_samples;' is safe to run against a live router.
Disabling at runtime: set metrics.persistence.enabled: false and hot-reload — the snapshot task is not torn down by hot-reload alone; pair the change with a restart if you need to fully release the file handle.

Metrics and Monitoring covers the live /metrics endpoint, Grafana/Prometheus integration, and the per-API-key token-usage metrics whose long-term analysis this persistence layer enables.
The admin stats snapshot (admin.stats.persistence in the Admin API) separately persists aggregate /admin/stats counters across restarts; it stores a different shape and serves a different read path.