Persistent Local Metrics Log¶
Continuum Router can persist its Prometheus registry to a local store so that recent metric history survives restarts. The feature is embedded — no external time-series database required for the default deployment.
This page covers what the persistent log is, how to configure it, the disk-usage tradeoff, and how to query history through the admin API.
Overview¶
The Prometheus counters and gauges that back the /metrics endpoint normally live only in process memory. Restarting the binary wipes them. The persistent metrics log fixes that by snapshotting the entire registry to disk at a configurable interval. Recent history is then queryable via a dedicated admin endpoint.
What this is NOT¶
- It is not a replacement for Prometheus / Grafana / Thanos. Operators who already run a TSDB should keep doing so — the log targets single-node deployments and the "what happened just before the restart" diagnostic case.
- It does not restore counter or gauge state into the live registry on startup. Counters are required to be monotonic by Prometheus semantics, and clients detect resets. The persistent log is a separate read path consumed via
GET /admin/metrics/history. - It is not PromQL. The query interface is a simple time-range filter.
Default behaviour¶
The feature defaults to enabled: true when the metrics-persistence Cargo feature is compiled in (which is part of the full default). To opt out at runtime, set metrics.persistence.enabled: false.
Configuration¶
Add a persistence block under metrics: in your config.yaml:
metrics:
persistence:
enabled: true
backend: sqlite
path: ./data/metrics.db
snapshot_interval_seconds: 60
retention_days: 15
compaction:
enabled: true
schedule: "0 3 * * *"
Field reference¶
| Field | Default | Notes |
|---|---|---|
enabled |
true |
Setting to false skips opening the DB and spawning the background task. |
backend |
sqlite |
sqlite is the only supported backend; redb and duckdb are reserved values. |
path |
./data/metrics.db |
Parent directories are created automatically. |
snapshot_interval_seconds |
60 |
Matches typical Prometheus scrape cadence. Range 1..=86_400. |
retention_days |
15 |
Range 1..=365. See disk-usage formula below. |
compaction.enabled |
true |
Toggles backend-specific compaction (SQLite VACUUM). |
compaction.schedule |
"0 3 * * *" |
Subset-of-cron syntax: only minute hour * * * is honored. |
Hot-reload¶
The following fields support hot-reload via src/admin_config/:
snapshot_interval_secondsretention_dayscompaction.enabledandcompaction.schedule
Retention updates atomically rebuild the prune cutoff without dropping in-flight snapshots.
Changing backend, path, or toggling enabled requires a restart.
Disk usage¶
The persistent log writes one row per sample on each snapshot tick. Histograms and summaries explode into multiple rows (sum, count, and per-bucket / per-quantile). Counters and gauges emit a single row per series.
Formula¶
Rough estimate:
Empirically on the SQLite backend, bytes_per_sample ≈ 70-120 bytes once the WAL has been checkpointed (measured on a synthetic workload of 100 series × 10 snapshots; see tests/metrics_persistence_test.rs::disk_usage_smoke_check_under_synthetic_load). Real workloads will land in that band depending on label-set size.
Worked example¶
For a deployment with 5,000 active series, 60s snapshots, and 15 days of retention:
If 15 days of retention is too aggressive, drop retention_days to 7 (≈ 5 GB at the same series count). If snapshot frequency matters less than disk cost, lengthen snapshot_interval_seconds to 120 or 300.
Admin endpoint¶
GET /admin/metrics/history¶
Query the persistent log for a specific metric over a time range.
Query parameters:
metric(required): metric family name, e.g.http_requests_total. Histograms and summaries return multiplekindrows per family (see below).from(optional): inclusive lower bound, either Unix milliseconds (int) or RFC 3339 (2026-05-11T00:00:00Z). Defaults to 24 hours ago.to(optional): exclusive upper bound, same encoding asfrom. Defaults to now.limit(optional): cap returned rows. Defaults to 10,000; hard ceiling is 100,000.
Example requests¶
# Last 24 hours of http_requests_total
curl -s 'http://localhost:8080/admin/metrics/history?metric=http_requests_total' | jq .
# Specific RFC 3339 range, capped at 500 rows
curl -s 'http://localhost:8080/admin/metrics/history?metric=model_tokens_processed&from=2026-05-10T00:00:00Z&to=2026-05-11T00:00:00Z&limit=500' | jq .
# Unix-millis range
curl -s 'http://localhost:8080/admin/metrics/history?metric=errors_total&from=1715385600000&to=1715472000000' | jq .
Response shape¶
{
"metric": "http_requests_total",
"from_ms": 1715385600000,
"to_ms": 1715472000000,
"row_count": 2,
"limit": 10000,
"samples": [
{
"ts_ms": 1715385600000,
"labels": {"backend": "openai", "endpoint": "/v1/chat/completions"},
"value": 42.0,
"kind": "counter"
}
]
}
Sample kinds¶
kind |
Source | Notes |
|---|---|---|
counter |
Counter family | Cumulative monotonic value. |
gauge |
Gauge family | Instantaneous value. |
histogram_sum |
Histogram family | Sum of all observations. |
histogram_count |
Histogram family | Count of all observations. |
histogram_bucket |
Histogram family | Cumulative count per bucket. labels.le carries the upper bound. |
summary_sum |
Summary family | Sum of all observations. |
summary_count |
Summary family | Count of all observations. |
summary_quantile |
Summary family | Per-quantile value. labels.quantile carries the q. |
untyped |
Unknown / future kinds | Forward-compat catch-all. |
Error responses¶
400 Bad Request—metricis missing, empty, oversized, or the time range is non-positive.404 Not Found— persistence is disabled (metrics.persistence.enabled: false).500 Internal Server Error— storage error; details in router logs.503 Service Unavailable—metrics-persistencefeature was not compiled in.
Storage layout¶
The SQLite schema is intentionally minimal, keeping the row shape portable to other storage engines:
CREATE TABLE IF NOT EXISTS metric_samples (
ts INTEGER NOT NULL, -- Unix milliseconds
metric TEXT NOT NULL, -- metric family name
labels TEXT NOT NULL, -- canonical sorted-key JSON
value REAL NOT NULL, -- sample value (cumulative for buckets)
kind TEXT NOT NULL -- counter | gauge | histogram_* | summary_* | untyped
);
CREATE INDEX IF NOT EXISTS idx_metric_samples_metric_ts
ON metric_samples (metric, ts);
CREATE INDEX IF NOT EXISTS idx_metric_samples_ts
ON metric_samples (ts);
The labels column stores label sets as canonical sorted-key JSON (e.g. {"backend":"openai","model":"gpt-4o"}). This makes (metric, labels) a deterministic equality key for ad-hoc joins.
PRAGMA journal_mode = WAL is set so the admin endpoint can read concurrently with the snapshot writer.
Operational notes¶
- First-run permissions: ensure the process can create the parent directory of
path. The router will log an error and continue serving/metricswithout history if it cannot. - Backup: with
PRAGMA journal_mode = WAL, copy the.db,.db-wal, and.db-shmfiles together (or runsqlite3 metrics.db .backup) for a consistent snapshot. - Inspection:
sqlite3 ./data/metrics.db 'select count(*) from metric_samples;'is safe to run against a live router. - Disabling at runtime: set
metrics.persistence.enabled: falseand hot-reload — the snapshot task is not torn down by hot-reload alone; pair the change with a restart if you need to fully release the file handle.
Related¶
- Metrics and Monitoring covers the live
/metricsendpoint, Grafana/Prometheus integration, and the per-API-key token-usage metrics whose long-term analysis this persistence layer enables. - The admin stats snapshot (
admin.stats.persistencein the Admin API) separately persists aggregate/admin/statscounters across restarts; it stores a different shape and serves a different read path.