Guardrails¶
Guardrails are content-safety policies that inspect request input and model output, then allow, block, transform, or flag the content before it reaches the backend or the client. They give the router a single place to enforce moderation, prompt-injection defense, PII redaction, and custom allow/deny rules across every provider and every API surface (OpenAI chat, the /v1/responses bridge, and native Anthropic Messages).
Guardrails are off by default. When the guardrails block is absent or enabled: false, the router adds no behavior and no overhead. A request that matches no configured guardrail flows through byte-for-byte unchanged.
Concepts¶
Verdicts¶
Every guardrail check returns one of four verdicts:
| Verdict | Meaning | Effect in enforce mode |
|---|---|---|
Allow |
Content is permitted. | Proceed unchanged. |
Block |
Content violates policy. Carries a category, a confidence score in [0.0, 1.0], and a reason. |
The request/response is gated per block_behavior. |
Transform |
Content should be replaced (for example PII redaction). Carries the replacement text. | The router substitutes the sanitized text and continues. |
Flag |
Content is noted for observation but not blocked. Carries a category and score. | Recorded only; the request proceeds. |
When several guardrails run at the same stage, the service aggregates their verdicts using a most-severe-wins rule: Allow (0) < Flag (1) < Transform (2) < Block (3).
Categories¶
Verdicts carry a safety category drawn from a fixed taxonomy modeled on the MLCommons hazard categories and OpenAI-style moderation labels: violence, hate_speech, sexual_content, self_harm, harassment, dangerous, jailbreak, pii, and profanity. A provider-specific label that does not match a known category is preserved verbatim. Category labels are the keys you use in category_thresholds.
Input gating versus output gating¶
A guardrail runs at one or both stages:
- Input (
input): the inbound prompt or messages are inspected before the backend is called. A block short-circuits the request without ever dispatching to the model; a transform rewrites the prompt before dispatch. - Output (
output): the model-generated text is inspected before it is returned to the client. A block replaces the response; a transform substitutes sanitized text.
When stages is omitted, a provider runs at both stages.
Lifecycle hooks¶
The router drives input-stage guardrails at two points so that classify-only providers add no serial latency:
- pre-call: run before the backend dispatch. A blocking verdict here means the backend is never called. Use this for cheap local checks (deny lists, PII, prompt-injection screens) where you want to avoid spending a backend call on a request that will be blocked anyway.
- during-call: run concurrently with the backend call via
tokio::join!. The guardrail latency overlaps the model latency, so a remote classifier (OpenAI Moderation, a cloud guardrail) adds little wall-clock cost. If the verdict blocks, the in-flight backend response is discarded and the block response is returned instead.
Output-stage guardrails run post-call: after the backend returns, over the assistant's generated text, before the response (or the cached copy) is returned.
Monitor versus enforce¶
The mode setting decides whether verdicts change request handling:
- monitor (default): every verdict is computed, recorded in metrics, and written to the audit log, but never alters the request or response. This is the logging-only mode used to observe what a policy would do before turning it on.
- enforce: blocking verdicts gate the request/response according to
block_behavior. Enforce mode requires at least one configured provider when guardrails are enabled.
The recommended rollout is monitor first, then enforce. See Threshold tuning workflow.
Block behavior¶
In enforce mode, block_behavior selects how a blocked request is rendered:
block_behavior |
OpenAI / Responses surface | Anthropic surface |
|---|---|---|
content_filter (default) |
A chat.completion whose choices[0].finish_reason is content_filter and whose assistant message carries a filtered-content placeholder. |
A Messages object with a single refusal text block and stop_reason: end_turn. |
refusal_message |
Same shape as content_filter, with a canned refusal string as the message content. |
Same Messages shape, with the refusal string as the text block. |
error |
An OpenAI error envelope: {"error": {"type": "content_filter", "code": "content_filter", ...}}. |
An Anthropic error envelope: {"type": "error", "error": {"type": "invalid_request_error", ...}}. |
Every blocked response also carries annotation headers: x-guardrail-action, x-guardrail-category, x-guardrail-score, and (when a single provider produced the verdict) x-guardrail-provider. Block responses are never cached.
Fail-open versus fail-closed¶
on_error decides what happens when a guardrail errors or exceeds its timeout:
- fail_open (default): the request proceeds. Availability is favored over strictness; a moderation outage does not take down the router.
- fail_closed: the request is blocked. Strictness is favored over availability.
The policy can be set globally and overridden per provider.
Timeouts¶
timeout_ms (default 2000) bounds each provider check. A provider can override it with its own timeout_ms. A check that exceeds its deadline is treated as an error and resolved per on_error.
Per-route policy¶
The routes map overrides the global policy per route or model name. Any field left unset inherits the global value. A route can switch mode (for example, enforce on a customer-facing model while the rest of the deployment stays in monitor), restrict to a subset of providers, set its own category_thresholds, and add route-specific allow/deny lists.
Per-category thresholds¶
category_thresholds maps a category label to a score floor in [0.0, 1.0]. A provider reports a confidence score per category; a category blocks only when its score is at or above the threshold. A category with no threshold never blocks. Thresholds are set per provider and per route.
Allow and deny lists¶
allow and deny are match lists, each with exact (literal strings) and regex (patterns validated to compile at config load) entries. They apply globally and can be extended per route. Use them for deterministic rules that do not need a model: a deny list to hard-block known forbidden terms, an allow list to exempt known-safe phrases.
Bypass allowlist¶
bypass_api_keys lists API keys that skip guardrails entirely. A request authenticated with a bypassed key runs no guardrail check at any stage. Use this sparingly, for trusted internal automation that must not be gated.
Providers¶
Five provider types ship with the router. Each is referenced by a stable name (used in route overrides) and a type. Credentials are always supplied by environment variable name (api_key_env), never inline.
OpenAI Moderation (openai_moderation)¶
Calls POST /v1/moderations with the free, multimodal omni-moderation-latest model and maps the returned per-category scores against your thresholds. The moderation model does not count against usage limits.
- name: openai-moderation
type: openai_moderation
enabled: true
endpoint: "https://api.openai.com/v1/moderations"
api_key_env: OPENAI_API_KEY
stages: [input, output]
category_thresholds:
violence: 0.8
hate_speech: 0.7
sexual_content: 0.9
timeout_ms: 1000
on_error: fail_open
Because it is a remote call, run it during-call (the default input lifecycle) so its latency overlaps the backend.
Self-hosted classifier (self_hosted_classifier / classifier)¶
Runs an open guardrail model served as an ordinary backend (Ollama, vLLM, or any OpenAI-compatible chat or completion endpoint) and maps its verdict onto a category. The prompt never leaves your deployment. The call reuses the router's HTTP client and circuit breaker; it does not open a separate HTTP stack. The two type names self_hosted_classifier and classifier are equivalent.
The template option selects the model family and its prompt/parser:
template |
Model family | License / notes |
|---|---|---|
granite_guardian (default) |
IBM Granite Guardian. Replies Yes / No with an optional risk_name dimension (harm, social_bias, groundedness, jailbreak, ...). |
Apache-2.0. Recommended default. |
llama_guard |
Llama Guard 3 / 4. Replies safe / unsafe plus S1..S14 hazard codes, mapped to router categories. A categories subset restricts which codes can block. |
Gated license; Llama Guard 4 (12B) is GPU-heavy. |
shieldgemma |
Google ShieldGemma. Per-policy Yes / No. |
Gemma license. |
Serving a classifier model¶
- Pull the guardrail model into a backend you already run. For example, with Ollama:
ollama pull granite-guardian(or a Llama Guard / ShieldGemma image on vLLM). - Confirm the model answers on an OpenAI-compatible endpoint, for example
http://127.0.0.1:11434/v1/chat/completionsfor Ollama. - Point the provider's
endpointat that URL and settemplateto the model family. Setmodelif you want a non-default model name.
- name: self-hosted-guard
type: classifier
enabled: true
endpoint: "http://127.0.0.1:11434/v1/chat/completions"
stages: [input, output]
options:
template: granite_guardian
task: content # `content` (default) or `injection` (input-stage jailbreak screen)
model: "granite-guardian:5b"
api_format: chat # `chat` (default) or `completion`
risk_name: harm # Granite Guardian only
# categories: ["S1", "S10", "S11"] # Llama Guard only: restrict to these hazard codes
category_thresholds:
dangerous: 0.5
Llama Guard 4 is gated on Hugging Face and needs a GPU with enough memory for a 12B model; Granite Guardian is the lighter, permissively licensed default. Set task: injection for a lightweight prompt-injection / jailbreak screen intended for the input stage.
PII detection and redaction (pii)¶
Detects personally identifiable information and high-value secrets, then redacts them in place (a Transform verdict) or blocks the request. Built-in scanners run locally with no external dependency. An optional Microsoft Presidio-compatible analyzer can be added for richer NER-based PII; its spans are merged with the built-in findings. Raw detected values are never logged.
This provider is documented in full, with its options table and entity types, in Security and Admin → Guardrails: PII Detection and Redaction. A minimal example:
- name: pii-redaction
type: pii
enabled: true
stages: [input, output]
options:
default_action: mask
actions:
email: mask
ssn: block
credit_card: block
api_key: block
placeholder_format: "<REDACTED:{TYPE}>"
on_error: fail_open
AWS Bedrock Guardrails (bedrock_guardrail)¶
Calls the Bedrock ApplyGuardrail API, which evaluates content independently of any model invocation, so it works for any backend (including OpenAI, Gemini, and self-hosted). It covers content filters (including Prompt Attack), denied topics, and sensitive-information (PII) policies: a PII block becomes a Block verdict and a PII mask becomes a Transform that substitutes the redacted text. Requests are signed with AWS SigV4.
Cloud-side setup¶
- Create a guardrail in the Bedrock console and note its identifier and version.
-
Supply configuration through environment variables (no account identifiers in the config file):
AWS_REGION: for exampleus-east-1.CONTINUUM_BEDROCK_GUARDRAIL_ID: the guardrail identifier.CONTINUUM_BEDROCK_GUARDRAIL_VERSION: version (defaultDRAFT).
-
Provide AWS credentials through the standard environment variables (
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY, and optionallyAWS_SESSION_TOKEN).
- name: bedrock-guardrail
type: bedrock_guardrail
enabled: true
stages: [input, output]
timeout_ms: 1500
on_error: fail_open
endpoint is optional and only needed to override the derived regional URL (for example a private proxy).
Azure AI Content Safety / Prompt Shields (azure_content_safety)¶
Text analysis returns a 0-7 severity for Hate, Sexual, Violence, and SelfHarm, normalized to [0.0, 1.0] and compared against your thresholds. On the input stage the provider also runs Prompt Shields, which adds jailbreak and direct/indirect (cross-prompt) injection detection; a detection blocks under the jailbreak category. The output stage runs text analysis only.
Cloud-side setup¶
- Create an Azure AI Content Safety resource.
- Set
endpointto its base URL (https://<resource>.cognitiveservices.azure.com). - Store the subscription key in the environment variable named by
api_key_env.
- name: azure-content-safety
type: azure_content_safety
enabled: true
endpoint: "https://my-resource.cognitiveservices.azure.com"
api_key_env: AZURE_CONTENT_SAFETY_KEY
stages: [input, output]
category_thresholds:
violence: 0.7
hate_speech: 0.7
sexual_content: 0.7
self_harm: 0.7
timeout_ms: 1500
on_error: fail_open
Streaming output gating¶
Streaming responses cannot be checked all at once, so streaming_mode selects how the output stage handles a streamed response. The choice is a tradeoff between time-to-first-token (TTFT) and how much unsafe text can reach the client.
streaming_mode |
How it works | TTFT | Safety |
|---|---|---|---|
buffer_full (default) |
Buffer the whole streamed response, then run the output check once at end of stream. | Worst (the client sees nothing until the check passes). | Strongest: nothing unsafe is ever streamed. |
chunked |
Run incremental checks over a rolling window as the stream progresses. A violation cuts the stream and emits a content-filter terminal chunk. | Good. | Strong: a violation is caught mid-stream, though a small prefix may already have been seen. |
passthrough |
Stream chunks through with no output checking. | Best. | None on the streamed output (input gating still applies). |
chunked is tuned by three fields, modeled on NeMo Guardrails:
streaming_chunk_size(default200): characters of new text to accumulate before each incremental check.streaming_context_size(default50): trailing characters carried into each check so a violation spanning a chunk boundary is still seen.streaming_stream_first(defaultfalse): whentrue, each window is emitted to the client before it is checked (lowest latency, a violating chunk can be partially seen); whenfalse, a window is checked before it is released (safer, adds the check latency to each window).
In monitor mode, streaming verdicts are computed and logged but never cut the stream.
Configuration¶
The canonical, fully commented reference is the guardrails: block in config.yaml.example. Below is a compact end-to-end example combining several providers, a per-route override, and tuning. Do not store secrets inline; reference them by environment variable name.
guardrails:
enabled: true
mode: monitor # start in monitor; switch to enforce after observing metrics
providers:
- name: openai-moderation
type: openai_moderation
endpoint: "https://api.openai.com/v1/moderations"
api_key_env: OPENAI_API_KEY
stages: [input, output]
category_thresholds:
violence: 0.8
hate_speech: 0.7
- name: pii-redaction
type: pii
stages: [input, output]
options:
default_action: mask
actions:
ssn: block
credit_card: block
routes:
"gpt-5.4":
mode: enforce
providers: ["openai-moderation", "pii-redaction"]
category_thresholds:
pii: 0.95
deny:
exact: ["forbidden-term"]
regex: ['(?i)\bclassified\b']
bypass_api_keys: []
timeout_ms: 2000
on_error: fail_open
block_behavior: content_filter
streaming_mode: buffer_full
streaming_chunk_size: 200
streaming_context_size: 50
streaming_stream_first: false
deny:
exact: ["badword"]
regex: ['\bssn\b', '\d{3}-\d{2}-\d{4}']
audit:
enabled: true
log_level: info
Top-level fields¶
| Field | Type | Default | Description |
|---|---|---|---|
enabled |
boolean | false |
Master switch for the subsystem. |
mode |
string | monitor |
monitor or enforce. |
providers |
array | [] |
Provider definitions (see Providers). |
routes |
map | {} |
Per-route overrides keyed by route/model name. |
bypass_api_keys |
array | [] |
API keys that skip all guardrail checks. |
timeout_ms |
integer | 2000 |
Global per-provider check timeout (must be positive). |
on_error |
string | fail_open |
fail_open or fail_closed. |
block_behavior |
string | content_filter |
content_filter, refusal_message, or error. |
streaming_mode |
string | buffer_full |
buffer_full, chunked, or passthrough. |
streaming_chunk_size |
integer | 200 |
chunked: characters per incremental check. |
streaming_context_size |
integer | 50 |
chunked: trailing context per check. |
streaming_stream_first |
boolean | false |
chunked: emit-then-check (true) or check-then-emit (false). |
allow |
match list | {} |
Global allow list (exact + regex). |
deny |
match list | {} |
Global deny list (exact + regex). |
audit |
object | enabled | Audit-log configuration (see Audit logging). |
Provider fields¶
| Field | Type | Default | Description |
|---|---|---|---|
name |
string | required | Stable provider name (unique; referenced by routes). |
type |
string | required | Provider implementation type. |
enabled |
boolean | true |
Whether this provider runs. |
endpoint |
string | - | Provider HTTP endpoint, where applicable. |
api_key_env |
string | - | Name of the environment variable holding the credential. |
stages |
array | both | input, output, or both. |
category_thresholds |
map | {} |
Per-category score floors in [0.0, 1.0]. |
timeout_ms |
integer | global | Per-provider timeout override. |
on_error |
string | global | Per-provider error policy override. |
options |
object | - | Provider-specific options (PII actions, classifier template, ...). |
Configuration is validated at load and on hot-reload: provider names must be non-empty and unique, thresholds must fall in [0.0, 1.0], timeouts must be positive, every regex must compile, and enforce mode with guardrails enabled requires at least one provider.
Threshold tuning workflow¶
Roll out a policy without surprising your users:
- Start in monitor mode. Set
mode: monitor(globally or per route) with the providers and thresholds you intend to use. Verdicts are computed and recorded but never gate traffic. - Observe the metrics. Watch
guardrail_verdicts_total{mode="monitor"}andguardrail_blocks_totalto see what would have been blocked, broken down by stage, provider, and category. Check the audit log for the specific categories and scores. - Tune thresholds. Raise a threshold for a category that produces false positives; lower one that lets real violations through. Adjust per provider and per route.
- Enforce. Switch the route (or the global default) to
mode: enforce. The same thresholds now gate traffic. Keep monitoringguardrail_blocks_totalandguardrail_fail_open_total/guardrail_fail_closed_total.
Because mode and thresholds are per-route, you can enforce on one model while keeping the rest in monitor, and you can change any of this at runtime through the admin API without a restart.
Operations¶
Admin runtime controls¶
The Admin API exposes the live guardrail policy and lets you change it without a restart. All endpoints require admin authentication (see Admin REST API).
| Endpoint | Method | Description |
|---|---|---|
/admin/guardrails |
GET | View the effective guardrail policy. |
/admin/guardrails |
PATCH | Partially update top-level policy (mode, timeouts, fail policy, block behavior, lists). |
/admin/guardrails/providers/{name} |
PUT | Toggle or tune a single provider. |
/admin/guardrails/routes/{route} |
PUT | Create or replace a per-route override. |
/admin/guardrails/routes/{route} |
DELETE | Remove a per-route override (the route falls back to the global policy). |
/admin/guardrails/test |
POST | Dry-run sample text against the configured providers and return the verdicts. |
Use /admin/guardrails/test to check a policy against representative prompts before enforcing it, and the routes endpoints to enforce on a single model first.
Metrics¶
When the metrics feature is enabled, every guardrail decision is exported as Prometheus series:
| Metric | Type | Labels | Description |
|---|---|---|---|
guardrail_checks_total |
counter | stage, provider, result |
Per-provider checks by stage and verdict result. stage is input / output / streaming; result is allow / block / transform / flag. |
guardrail_blocks_total |
counter | stage, provider, category |
Block verdicts by stage, provider, and safety category. |
guardrail_check_duration_seconds |
histogram | stage, provider |
Per-provider check latency. |
guardrail_errors_total |
counter | provider, kind |
Provider errors; kind is timeout or error. |
guardrail_fail_open_total |
counter | provider |
Provider failures resolved fail-open (allowed). |
guardrail_fail_closed_total |
counter | provider |
Provider failures resolved fail-closed (blocked). |
guardrail_verdicts_total |
counter | stage, mode, result |
The single aggregated verdict per request after applying mode semantics. mode is monitor / enforce, so monitor-mode verdicts are visible even though they never gate. |
See Metrics and Monitoring → Guardrail Metrics for the full reference.
Audit logging¶
Every verdict (block, transform, or flag) is logged via structured tracing with the provider, stage, category, score, mode, and action. The audit log never carries raw prompt or response text or secrets: request metadata is redacted before logging, and only category, score, stage, and mode metadata is recorded. Audit logging is on by default and configured under guardrails.audit:
| Field | Type | Default | Description |
|---|---|---|---|
enabled |
boolean | true |
Whether guardrail-decision audit logging is on. |
log_level |
string | info |
Level at which audit events are emitted: debug, info, or warn. |
See also¶
- Security and Admin: API keys, the PII provider reference, and admin authentication.
- Architecture: where the guardrail layer sits in the request lifecycle.
- Metrics and Monitoring: the
guardrail_*metric series. - Admin REST API: the admin endpoint reference.