Guardrails¶

Guardrails are content-safety policies that inspect request input and model output, then allow, block, transform, or flag the content before it reaches the backend or the client. They give the router a single place to enforce moderation, prompt-injection defense, PII redaction, and custom allow/deny rules across every provider and every API surface (OpenAI chat, the /v1/responses bridge, and native Anthropic Messages).

Guardrails are off by default. When the guardrails block is absent or enabled: false, the router adds no behavior and no overhead. A request that matches no configured guardrail flows through byte-for-byte unchanged.

Concepts¶

Verdicts¶

Every guardrail check returns one of four verdicts:

Verdict	Meaning	Effect in enforce mode
`Allow`	Content is permitted.	Proceed unchanged.
`Block`	Content violates policy. Carries a category, a confidence score in `[0.0, 1.0]`, and a reason.	The request/response is gated per `block_behavior`.
`Transform`	Content should be replaced (for example PII redaction). Carries the replacement text.	The router substitutes the sanitized text and continues.
`Flag`	Content is noted for observation but not blocked. Carries a category and score.	Recorded only; the request proceeds.

When several guardrails run at the same stage, the service aggregates their verdicts using a most-severe-wins rule: Allow (0) < Flag (1) < Transform (2) < Block (3).

Categories¶

Verdicts carry a safety category drawn from a fixed taxonomy modeled on the MLCommons hazard categories and OpenAI-style moderation labels: violence, hate_speech, sexual_content, self_harm, harassment, dangerous, jailbreak, pii, and profanity. A provider-specific label that does not match a known category is preserved verbatim. Category labels are the keys you use in category_thresholds.

Input gating versus output gating¶

A guardrail runs at one or both stages:

Input (input): the inbound prompt or messages are inspected before the backend is called. A block short-circuits the request without ever dispatching to the model; a transform rewrites the prompt before dispatch.
Output (output): the model-generated text is inspected before it is returned to the client. A block replaces the response; a transform substitutes sanitized text.

When stages is omitted, a provider runs at both stages.

Lifecycle hooks¶

The router drives input-stage guardrails at two points so that classify-only providers add no serial latency:

pre-call: run before the backend dispatch. A blocking verdict here means the backend is never called. Use this for cheap local checks (deny lists, PII, prompt-injection screens) where you want to avoid spending a backend call on a request that will be blocked anyway.
during-call: run concurrently with the backend call via tokio::join!. The guardrail latency overlaps the model latency, so a remote classifier (OpenAI Moderation, a cloud guardrail) adds little wall-clock cost. If the verdict blocks, the in-flight backend response is discarded and the block response is returned instead.

Output-stage guardrails run post-call: after the backend returns, over the assistant's generated text, before the response (or the cached copy) is returned.

Monitor versus enforce¶

The mode setting decides whether verdicts change request handling:

monitor (default): every verdict is computed, recorded in metrics, and written to the audit log, but never alters the request or response. This is the logging-only mode used to observe what a policy would do before turning it on.
enforce: blocking verdicts gate the request/response according to block_behavior. Enforce mode requires at least one configured provider when guardrails are enabled.

The recommended rollout is monitor first, then enforce. See Threshold tuning workflow.

Block behavior¶

In enforce mode, block_behavior selects how a blocked request is rendered:

`block_behavior`	OpenAI / Responses surface	Anthropic surface
`content_filter` (default)	A `chat.completion` whose `choices[0].finish_reason` is `content_filter` and whose assistant message carries a filtered-content placeholder.	A Messages object with a single refusal text block and `stop_reason: end_turn`.
`refusal_message`	Same shape as `content_filter`, with a canned refusal string as the message content.	Same Messages shape, with the refusal string as the text block.
`error`	An OpenAI error envelope: `{"error": {"type": "content_filter", "code": "content_filter", ...}}`.	An Anthropic error envelope: `{"type": "error", "error": {"type": "invalid_request_error", ...}}`.

Every blocked response also carries annotation headers: x-guardrail-action, x-guardrail-category, x-guardrail-score, and (when a single provider produced the verdict) x-guardrail-provider. Block responses are never cached.

Fail-open versus fail-closed¶

on_error decides what happens when a guardrail errors or exceeds its timeout:

fail_open (default): the request proceeds. Availability is favored over strictness; a moderation outage does not take down the router.
fail_closed: the request is blocked. Strictness is favored over availability.

The policy can be set globally and overridden per provider.

Timeouts¶

timeout_ms (default 2000) bounds each provider check. A provider can override it with its own timeout_ms. A check that exceeds its deadline is treated as an error and resolved per on_error.

Per-route policy¶

The routes map overrides the global policy per route or model name. Any field left unset inherits the global value. A route can switch mode (for example, enforce on a customer-facing model while the rest of the deployment stays in monitor), restrict to a subset of providers, set its own category_thresholds, and add route-specific allow/deny lists.

Per-category thresholds¶

category_thresholds maps a category label to a score floor in [0.0, 1.0]. A provider reports a confidence score per category; a category blocks only when its score is at or above the threshold. A category with no threshold never blocks. Thresholds are set per provider and per route.

Allow and deny lists¶

allow and deny are match lists, each with exact (literal strings) and regex (patterns validated to compile at config load) entries. They apply globally and can be extended per route. Use them for deterministic rules that do not need a model: a deny list to hard-block known forbidden terms, an allow list to exempt known-safe phrases.

Bypass allowlist¶

bypass_api_keys lists API keys that skip guardrails entirely. A request authenticated with a bypassed key runs no guardrail check at any stage. Use this sparingly, for trusted internal automation that must not be gated.

Providers¶

Five provider types ship with the router. Each is referenced by a stable name (used in route overrides) and a type. Credentials are always supplied by environment variable name (api_key_env), never inline.

OpenAI Moderation (`openai_moderation`)¶

Calls POST /v1/moderations with the free, multimodal omni-moderation-latest model and maps the returned per-category scores against your thresholds. The moderation model does not count against usage limits.

- name: openai-moderation
  type: openai_moderation
  enabled: true
  endpoint: "https://api.openai.com/v1/moderations"
  api_key_env: OPENAI_API_KEY
  stages: [input, output]
  category_thresholds:
    violence: 0.8
    hate_speech: 0.7
    sexual_content: 0.9
  timeout_ms: 1000
  on_error: fail_open

Because it is a remote call, run it during-call (the default input lifecycle) so its latency overlaps the backend.

Self-hosted classifier (`self_hosted_classifier` / `classifier`)¶

Runs an open guardrail model served as an ordinary backend (Ollama, vLLM, or any OpenAI-compatible chat or completion endpoint) and maps its verdict onto a category. The prompt never leaves your deployment. The call reuses the router's HTTP client and circuit breaker; it does not open a separate HTTP stack. The two type names self_hosted_classifier and classifier are equivalent.

The template option selects the model family and its prompt/parser:

`template`	Model family	License / notes
`granite_guardian` (default)	IBM Granite Guardian. Replies `Yes` / `No` with an optional `risk_name` dimension (`harm`, `social_bias`, `groundedness`, `jailbreak`, ...).	Apache-2.0. Recommended default.
`llama_guard`	Llama Guard 3 / 4. Replies `safe` / `unsafe` plus `S1`..`S14` hazard codes, mapped to router categories. A `categories` subset restricts which codes can block.	Gated license; Llama Guard 4 (12B) is GPU-heavy.
`shieldgemma`	Google ShieldGemma. Per-policy `Yes` / `No`.	Gemma license.

Serving a classifier model¶

Pull the guardrail model into a backend you already run. For example, with Ollama: ollama pull granite-guardian (or a Llama Guard / ShieldGemma image on vLLM).
Confirm the model answers on an OpenAI-compatible endpoint, for example http://127.0.0.1:11434/v1/chat/completions for Ollama.
Point the provider's endpoint at that URL and set template to the model family. Set model if you want a non-default model name.

- name: self-hosted-guard
  type: classifier
  enabled: true
  endpoint: "http://127.0.0.1:11434/v1/chat/completions"
  stages: [input, output]
  options:
    template: granite_guardian
    task: content          # `content` (default) or `injection` (input-stage jailbreak screen)
    model: "granite-guardian:5b"
    api_format: chat        # `chat` (default) or `completion`
    risk_name: harm         # Granite Guardian only
    # categories: ["S1", "S10", "S11"]   # Llama Guard only: restrict to these hazard codes
  category_thresholds:
    dangerous: 0.5

Llama Guard 4 is gated on Hugging Face and needs a GPU with enough memory for a 12B model; Granite Guardian is the lighter, permissively licensed default. Set task: injection for a lightweight prompt-injection / jailbreak screen intended for the input stage.

PII detection and redaction (`pii`)¶

Detects personally identifiable information and high-value secrets, then redacts them in place (a Transform verdict) or blocks the request. Built-in scanners run locally with no external dependency. An optional Microsoft Presidio-compatible analyzer can be added for richer NER-based PII; its spans are merged with the built-in findings. Raw detected values are never logged.

This provider is documented in full, with its options table and entity types, in Security and Admin → Guardrails: PII Detection and Redaction. A minimal example:

- name: pii-redaction
  type: pii
  enabled: true
  stages: [input, output]
  options:
    default_action: mask
    actions:
      email: mask
      ssn: block
      credit_card: block
      api_key: block
    placeholder_format: "<REDACTED:{TYPE}>"
  on_error: fail_open

AWS Bedrock Guardrails (`bedrock_guardrail`)¶

Calls the Bedrock ApplyGuardrail API, which evaluates content independently of any model invocation, so it works for any backend (including OpenAI, Gemini, and self-hosted). It covers content filters (including Prompt Attack), denied topics, and sensitive-information (PII) policies: a PII block becomes a Block verdict and a PII mask becomes a Transform that substitutes the redacted text. Requests are signed with AWS SigV4.

Cloud-side setup¶

Create a guardrail in the Bedrock console and note its identifier and version.
Supply configuration through environment variables (no account identifiers in the config file):
- AWS_REGION: for example us-east-1.
- CONTINUUM_BEDROCK_GUARDRAIL_ID: the guardrail identifier.
- CONTINUUM_BEDROCK_GUARDRAIL_VERSION: version (default DRAFT).
Provide AWS credentials through the standard environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and optionally AWS_SESSION_TOKEN).

- name: bedrock-guardrail
  type: bedrock_guardrail
  enabled: true
  stages: [input, output]
  timeout_ms: 1500
  on_error: fail_open

endpoint is optional and only needed to override the derived regional URL (for example a private proxy).

Azure AI Content Safety / Prompt Shields (`azure_content_safety`)¶

Text analysis returns a 0-7 severity for Hate, Sexual, Violence, and SelfHarm, normalized to [0.0, 1.0] and compared against your thresholds. On the input stage the provider also runs Prompt Shields, which adds jailbreak and direct/indirect (cross-prompt) injection detection; a detection blocks under the jailbreak category. The output stage runs text analysis only.

Cloud-side setup¶

Create an Azure AI Content Safety resource.
Set endpoint to its base URL (https://<resource>.cognitiveservices.azure.com).
Store the subscription key in the environment variable named by api_key_env.

- name: azure-content-safety
  type: azure_content_safety
  enabled: true
  endpoint: "https://my-resource.cognitiveservices.azure.com"
  api_key_env: AZURE_CONTENT_SAFETY_KEY
  stages: [input, output]
  category_thresholds:
    violence: 0.7
    hate_speech: 0.7
    sexual_content: 0.7
    self_harm: 0.7
  timeout_ms: 1500
  on_error: fail_open

Streaming output gating¶

Streaming responses cannot be checked all at once, so streaming_mode selects how the output stage handles a streamed response. The choice is a tradeoff between time-to-first-token (TTFT) and how much unsafe text can reach the client.

`streaming_mode`	How it works	TTFT	Safety
`buffer_full` (default)	Buffer the whole streamed response, then run the output check once at end of stream.	Worst (the client sees nothing until the check passes).	Strongest: nothing unsafe is ever streamed.
`chunked`	Run incremental checks over a rolling window as the stream progresses. A violation cuts the stream and emits a content-filter terminal chunk.	Good.	Strong: a violation is caught mid-stream, though a small prefix may already have been seen.
`passthrough`	Stream chunks through with no output checking.	Best.	None on the streamed output (input gating still applies).

chunked is tuned by three fields, modeled on NeMo Guardrails:

streaming_chunk_size (default 200): characters of new text to accumulate before each incremental check.
streaming_context_size (default 50): trailing characters carried into each check so a violation spanning a chunk boundary is still seen.
streaming_stream_first (default false): when true, each window is emitted to the client before it is checked (lowest latency, a violating chunk can be partially seen); when false, a window is checked before it is released (safer, adds the check latency to each window).

In monitor mode, streaming verdicts are computed and logged but never cut the stream.

Configuration¶

The canonical, fully commented reference is the guardrails: block in config.yaml.example. Below is a compact end-to-end example combining several providers, a per-route override, and tuning. Do not store secrets inline; reference them by environment variable name.

guardrails:
  enabled: true
  mode: monitor          # start in monitor; switch to enforce after observing metrics

  providers:
    - name: openai-moderation
      type: openai_moderation
      endpoint: "https://api.openai.com/v1/moderations"
      api_key_env: OPENAI_API_KEY
      stages: [input, output]
      category_thresholds:
        violence: 0.8
        hate_speech: 0.7

    - name: pii-redaction
      type: pii
      stages: [input, output]
      options:
        default_action: mask
        actions:
          ssn: block
          credit_card: block

  routes:
    "gpt-5.4":
      mode: enforce
      providers: ["openai-moderation", "pii-redaction"]
      category_thresholds:
        pii: 0.95
      deny:
        exact: ["forbidden-term"]
        regex: ['(?i)\bclassified\b']

  bypass_api_keys: []

  timeout_ms: 2000
  on_error: fail_open
  block_behavior: content_filter

  streaming_mode: buffer_full
  streaming_chunk_size: 200
  streaming_context_size: 50
  streaming_stream_first: false

  deny:
    exact: ["badword"]
    regex: ['\bssn\b', '\d{3}-\d{2}-\d{4}']

  audit:
    enabled: true
    log_level: info

Top-level fields¶

Field	Type	Default	Description
`enabled`	boolean	`false`	Master switch for the subsystem.
`mode`	string	`monitor`	`monitor` or `enforce`.
`providers`	array	`[]`	Provider definitions (see Providers).
`routes`	map	`{}`	Per-route overrides keyed by route/model name.
`bypass_api_keys`	array	`[]`	API keys that skip all guardrail checks.
`timeout_ms`	integer	`2000`	Global per-provider check timeout (must be positive).
`on_error`	string	`fail_open`	`fail_open` or `fail_closed`.
`block_behavior`	string	`content_filter`	`content_filter`, `refusal_message`, or `error`.
`streaming_mode`	string	`buffer_full`	`buffer_full`, `chunked`, or `passthrough`.
`streaming_chunk_size`	integer	`200`	`chunked`: characters per incremental check.
`streaming_context_size`	integer	`50`	`chunked`: trailing context per check.
`streaming_stream_first`	boolean	`false`	`chunked`: emit-then-check (`true`) or check-then-emit (`false`).
`allow`	match list	`{}`	Global allow list (`exact` + `regex`).
`deny`	match list	`{}`	Global deny list (`exact` + `regex`).
`audit`	object	enabled	Audit-log configuration (see Audit logging).

Provider fields¶

Field	Type	Default	Description
`name`	string	required	Stable provider name (unique; referenced by routes).
`type`	string	required	Provider implementation type.
`enabled`	boolean	`true`	Whether this provider runs.
`endpoint`	string	-	Provider HTTP endpoint, where applicable.
`api_key_env`	string	-	Name of the environment variable holding the credential.
`stages`	array	both	`input`, `output`, or both.
`category_thresholds`	map	`{}`	Per-category score floors in `[0.0, 1.0]`.
`timeout_ms`	integer	global	Per-provider timeout override.
`on_error`	string	global	Per-provider error policy override.
`options`	object	-	Provider-specific options (PII actions, classifier template, ...).

Configuration is validated at load and on hot-reload: provider names must be non-empty and unique, thresholds must fall in [0.0, 1.0], timeouts must be positive, every regex must compile, and enforce mode with guardrails enabled requires at least one provider.

Threshold tuning workflow¶

Roll out a policy without surprising your users:

Start in monitor mode. Set mode: monitor (globally or per route) with the providers and thresholds you intend to use. Verdicts are computed and recorded but never gate traffic.
Observe the metrics. Watch guardrail_verdicts_total{mode="monitor"} and guardrail_blocks_total to see what would have been blocked, broken down by stage, provider, and category. Check the audit log for the specific categories and scores.
Tune thresholds. Raise a threshold for a category that produces false positives; lower one that lets real violations through. Adjust per provider and per route.
Enforce. Switch the route (or the global default) to mode: enforce. The same thresholds now gate traffic. Keep monitoring guardrail_blocks_total and guardrail_fail_open_total / guardrail_fail_closed_total.

Because mode and thresholds are per-route, you can enforce on one model while keeping the rest in monitor, and you can change any of this at runtime through the admin API without a restart.

Operations¶

Admin runtime controls¶

The Admin API exposes the live guardrail policy and lets you change it without a restart. All endpoints require admin authentication (see Admin REST API).

Endpoint	Method	Description
`/admin/guardrails`	GET	View the effective guardrail policy.
`/admin/guardrails`	PATCH	Partially update top-level policy (mode, timeouts, fail policy, block behavior, lists).
`/admin/guardrails/providers/{name}`	PUT	Toggle or tune a single provider.
`/admin/guardrails/routes/{route}`	PUT	Create or replace a per-route override.
`/admin/guardrails/routes/{route}`	DELETE	Remove a per-route override (the route falls back to the global policy).
`/admin/guardrails/test`	POST	Dry-run sample text against the configured providers and return the verdicts.

Use /admin/guardrails/test to check a policy against representative prompts before enforcing it, and the routes endpoints to enforce on a single model first.

Metrics¶

When the metrics feature is enabled, every guardrail decision is exported as Prometheus series:

Metric	Type	Labels	Description
`guardrail_checks_total`	counter	`stage`, `provider`, `result`	Per-provider checks by stage and verdict result. `stage` is `input` / `output` / `streaming`; `result` is `allow` / `block` / `transform` / `flag`.
`guardrail_blocks_total`	counter	`stage`, `provider`, `category`	Block verdicts by stage, provider, and safety category.
`guardrail_check_duration_seconds`	histogram	`stage`, `provider`	Per-provider check latency.
`guardrail_errors_total`	counter	`provider`, `kind`	Provider errors; `kind` is `timeout` or `error`.
`guardrail_fail_open_total`	counter	`provider`	Provider failures resolved fail-open (allowed).
`guardrail_fail_closed_total`	counter	`provider`	Provider failures resolved fail-closed (blocked).
`guardrail_verdicts_total`	counter	`stage`, `mode`, `result`	The single aggregated verdict per request after applying mode semantics. `mode` is `monitor` / `enforce`, so monitor-mode verdicts are visible even though they never gate.

See Metrics and Monitoring → Guardrail Metrics for the full reference.

Audit logging¶

Every verdict (block, transform, or flag) is logged via structured tracing with the provider, stage, category, score, mode, and action. The audit log never carries raw prompt or response text or secrets: request metadata is redacted before logging, and only category, score, stage, and mode metadata is recorded. Audit logging is on by default and configured under guardrails.audit:

Field	Type	Default	Description
`enabled`	boolean	`true`	Whether guardrail-decision audit logging is on.
`log_level`	string	`info`	Level at which audit events are emitted: `debug`, `info`, or `warn`.

Guardrails¶

Concepts¶

Verdicts¶

Categories¶

Input gating versus output gating¶

Lifecycle hooks¶

Monitor versus enforce¶

Block behavior¶

Fail-open versus fail-closed¶

Timeouts¶

Per-route policy¶

Per-category thresholds¶

Allow and deny lists¶

Bypass allowlist¶

Providers¶

OpenAI Moderation (openai_moderation)¶

Self-hosted classifier (self_hosted_classifier / classifier)¶

Serving a classifier model¶

PII detection and redaction (pii)¶

AWS Bedrock Guardrails (bedrock_guardrail)¶

Cloud-side setup¶

Azure AI Content Safety / Prompt Shields (azure_content_safety)¶

Cloud-side setup¶

Streaming output gating¶

Configuration¶

Top-level fields¶

Provider fields¶

Threshold tuning workflow¶

Operations¶

Admin runtime controls¶

Metrics¶

Audit logging¶

See also¶

OpenAI Moderation (`openai_moderation`)¶

Self-hosted classifier (`self_hosted_classifier` / `classifier`)¶

PII detection and redaction (`pii`)¶

AWS Bedrock Guardrails (`bedrock_guardrail`)¶

Azure AI Content Safety / Prompt Shields (`azure_content_safety`)¶