Skip to content

Guardrails

Guardrails are content-safety policies that inspect request input and model output, then allow, block, transform, or flag the content before it reaches the backend or the client. They give the router a single place to enforce moderation, prompt-injection defense, PII redaction, and custom allow/deny rules across every provider and every API surface (OpenAI chat, the /v1/responses bridge, and native Anthropic Messages).

Guardrails are off by default. When the guardrails block is absent or enabled: false, the router adds no behavior and no overhead. A request that matches no configured guardrail flows through byte-for-byte unchanged.

Concepts

Verdicts

Every guardrail check returns one of four verdicts:

Verdict Meaning Effect in enforce mode
Allow Content is permitted. Proceed unchanged.
Block Content violates policy. Carries a category, a confidence score in [0.0, 1.0], and a reason. The request/response is gated per block_behavior.
Transform Content should be replaced (for example PII redaction). Carries the replacement text. The router substitutes the sanitized text and continues.
Flag Content is noted for observation but not blocked. Carries a category and score. Recorded only; the request proceeds.

When several guardrails run at the same stage, the service aggregates their verdicts using a most-severe-wins rule: Allow (0) < Flag (1) < Transform (2) < Block (3).

Categories

Verdicts carry a safety category drawn from a fixed taxonomy modeled on the MLCommons hazard categories and OpenAI-style moderation labels: violence, hate_speech, sexual_content, self_harm, harassment, dangerous, jailbreak, pii, and profanity. A provider-specific label that does not match a known category is preserved verbatim. Category labels are the keys you use in category_thresholds.

Input gating versus output gating

A guardrail runs at one or both stages:

  • Input (input): the inbound prompt or messages are inspected before the backend is called. A block short-circuits the request without ever dispatching to the model; a transform rewrites the prompt before dispatch.
  • Output (output): the model-generated text is inspected before it is returned to the client. A block replaces the response; a transform substitutes sanitized text.

When stages is omitted, a provider runs at both stages.

Lifecycle hooks

The router drives input-stage guardrails at two points so that classify-only providers add no serial latency:

  • pre-call: run before the backend dispatch. A blocking verdict here means the backend is never called. Use this for cheap local checks (deny lists, PII, prompt-injection screens) where you want to avoid spending a backend call on a request that will be blocked anyway.
  • during-call: run concurrently with the backend call via tokio::join!. The guardrail latency overlaps the model latency, so a remote classifier (OpenAI Moderation, a cloud guardrail) adds little wall-clock cost. If the verdict blocks, the in-flight backend response is discarded and the block response is returned instead.

Output-stage guardrails run post-call: after the backend returns, over the assistant's generated text, before the response (or the cached copy) is returned.

Monitor versus enforce

The mode setting decides whether verdicts change request handling:

  • monitor (default): every verdict is computed, recorded in metrics, and written to the audit log, but never alters the request or response. This is the logging-only mode used to observe what a policy would do before turning it on.
  • enforce: blocking verdicts gate the request/response according to block_behavior. Enforce mode requires at least one configured provider when guardrails are enabled.

The recommended rollout is monitor first, then enforce. See Threshold tuning workflow.

Block behavior

In enforce mode, block_behavior selects how a blocked request is rendered:

block_behavior OpenAI / Responses surface Anthropic surface
content_filter (default) A chat.completion whose choices[0].finish_reason is content_filter and whose assistant message carries a filtered-content placeholder. A Messages object with a single refusal text block and stop_reason: end_turn.
refusal_message Same shape as content_filter, with a canned refusal string as the message content. Same Messages shape, with the refusal string as the text block.
error An OpenAI error envelope: {"error": {"type": "content_filter", "code": "content_filter", ...}}. An Anthropic error envelope: {"type": "error", "error": {"type": "invalid_request_error", ...}}.

Every blocked response also carries annotation headers: x-guardrail-action, x-guardrail-category, x-guardrail-score, and (when a single provider produced the verdict) x-guardrail-provider. Block responses are never cached.

Fail-open versus fail-closed

on_error decides what happens when a guardrail errors or exceeds its timeout:

  • fail_open (default): the request proceeds. Availability is favored over strictness; a moderation outage does not take down the router.
  • fail_closed: the request is blocked. Strictness is favored over availability.

The policy can be set globally and overridden per provider.

Timeouts

timeout_ms (default 2000) bounds each provider check. A provider can override it with its own timeout_ms. A check that exceeds its deadline is treated as an error and resolved per on_error.

Per-route policy

The routes map overrides the global policy per route or model name. Any field left unset inherits the global value. A route can switch mode (for example, enforce on a customer-facing model while the rest of the deployment stays in monitor), restrict to a subset of providers, set its own category_thresholds, and add route-specific allow/deny lists.

Per-category thresholds

category_thresholds maps a category label to a score floor in [0.0, 1.0]. A provider reports a confidence score per category; a category blocks only when its score is at or above the threshold. A category with no threshold never blocks. Thresholds are set per provider and per route.

Allow and deny lists

allow and deny are match lists, each with exact (literal strings) and regex (patterns validated to compile at config load) entries. They apply globally and can be extended per route. Use them for deterministic rules that do not need a model: a deny list to hard-block known forbidden terms, an allow list to exempt known-safe phrases.

Bypass allowlist

bypass_api_keys lists API keys that skip guardrails entirely. A request authenticated with a bypassed key runs no guardrail check at any stage. Use this sparingly, for trusted internal automation that must not be gated.

Providers

Five provider types ship with the router. Each is referenced by a stable name (used in route overrides) and a type. Credentials are always supplied by environment variable name (api_key_env), never inline.

OpenAI Moderation (openai_moderation)

Calls POST /v1/moderations with the free, multimodal omni-moderation-latest model and maps the returned per-category scores against your thresholds. The moderation model does not count against usage limits.

- name: openai-moderation
  type: openai_moderation
  enabled: true
  endpoint: "https://api.openai.com/v1/moderations"
  api_key_env: OPENAI_API_KEY
  stages: [input, output]
  category_thresholds:
    violence: 0.8
    hate_speech: 0.7
    sexual_content: 0.9
  timeout_ms: 1000
  on_error: fail_open

Because it is a remote call, run it during-call (the default input lifecycle) so its latency overlaps the backend.

Self-hosted classifier (self_hosted_classifier / classifier)

Runs an open guardrail model served as an ordinary backend (Ollama, vLLM, or any OpenAI-compatible chat or completion endpoint) and maps its verdict onto a category. The prompt never leaves your deployment. The call reuses the router's HTTP client and circuit breaker; it does not open a separate HTTP stack. The two type names self_hosted_classifier and classifier are equivalent.

The template option selects the model family and its prompt/parser:

template Model family License / notes
granite_guardian (default) IBM Granite Guardian. Replies Yes / No with an optional risk_name dimension (harm, social_bias, groundedness, jailbreak, ...). Apache-2.0. Recommended default.
llama_guard Llama Guard 3 / 4. Replies safe / unsafe plus S1..S14 hazard codes, mapped to router categories. A categories subset restricts which codes can block. Gated license; Llama Guard 4 (12B) is GPU-heavy.
shieldgemma Google ShieldGemma. Per-policy Yes / No. Gemma license.

Serving a classifier model

  1. Pull the guardrail model into a backend you already run. For example, with Ollama: ollama pull granite-guardian (or a Llama Guard / ShieldGemma image on vLLM).
  2. Confirm the model answers on an OpenAI-compatible endpoint, for example http://127.0.0.1:11434/v1/chat/completions for Ollama.
  3. Point the provider's endpoint at that URL and set template to the model family. Set model if you want a non-default model name.
- name: self-hosted-guard
  type: classifier
  enabled: true
  endpoint: "http://127.0.0.1:11434/v1/chat/completions"
  stages: [input, output]
  options:
    template: granite_guardian
    task: content          # `content` (default) or `injection` (input-stage jailbreak screen)
    model: "granite-guardian:5b"
    api_format: chat        # `chat` (default) or `completion`
    risk_name: harm         # Granite Guardian only
    # categories: ["S1", "S10", "S11"]   # Llama Guard only: restrict to these hazard codes
  category_thresholds:
    dangerous: 0.5

Llama Guard 4 is gated on Hugging Face and needs a GPU with enough memory for a 12B model; Granite Guardian is the lighter, permissively licensed default. Set task: injection for a lightweight prompt-injection / jailbreak screen intended for the input stage.

PII detection and redaction (pii)

Detects personally identifiable information and high-value secrets, then redacts them in place (a Transform verdict) or blocks the request. Built-in scanners run locally with no external dependency. An optional Microsoft Presidio-compatible analyzer can be added for richer NER-based PII; its spans are merged with the built-in findings. Raw detected values are never logged.

This provider is documented in full, with its options table and entity types, in Security and Admin → Guardrails: PII Detection and Redaction. A minimal example:

- name: pii-redaction
  type: pii
  enabled: true
  stages: [input, output]
  options:
    default_action: mask
    actions:
      email: mask
      ssn: block
      credit_card: block
      api_key: block
    placeholder_format: "<REDACTED:{TYPE}>"
  on_error: fail_open

AWS Bedrock Guardrails (bedrock_guardrail)

Calls the Bedrock ApplyGuardrail API, which evaluates content independently of any model invocation, so it works for any backend (including OpenAI, Gemini, and self-hosted). It covers content filters (including Prompt Attack), denied topics, and sensitive-information (PII) policies: a PII block becomes a Block verdict and a PII mask becomes a Transform that substitutes the redacted text. Requests are signed with AWS SigV4.

Cloud-side setup

  1. Create a guardrail in the Bedrock console and note its identifier and version.
  2. Supply configuration through environment variables (no account identifiers in the config file):

    • AWS_REGION: for example us-east-1.
    • CONTINUUM_BEDROCK_GUARDRAIL_ID: the guardrail identifier.
    • CONTINUUM_BEDROCK_GUARDRAIL_VERSION: version (default DRAFT).
  3. Provide AWS credentials through the standard environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and optionally AWS_SESSION_TOKEN).

- name: bedrock-guardrail
  type: bedrock_guardrail
  enabled: true
  stages: [input, output]
  timeout_ms: 1500
  on_error: fail_open

endpoint is optional and only needed to override the derived regional URL (for example a private proxy).

Azure AI Content Safety / Prompt Shields (azure_content_safety)

Text analysis returns a 0-7 severity for Hate, Sexual, Violence, and SelfHarm, normalized to [0.0, 1.0] and compared against your thresholds. On the input stage the provider also runs Prompt Shields, which adds jailbreak and direct/indirect (cross-prompt) injection detection; a detection blocks under the jailbreak category. The output stage runs text analysis only.

Cloud-side setup

  1. Create an Azure AI Content Safety resource.
  2. Set endpoint to its base URL (https://<resource>.cognitiveservices.azure.com).
  3. Store the subscription key in the environment variable named by api_key_env.
- name: azure-content-safety
  type: azure_content_safety
  enabled: true
  endpoint: "https://my-resource.cognitiveservices.azure.com"
  api_key_env: AZURE_CONTENT_SAFETY_KEY
  stages: [input, output]
  category_thresholds:
    violence: 0.7
    hate_speech: 0.7
    sexual_content: 0.7
    self_harm: 0.7
  timeout_ms: 1500
  on_error: fail_open

Streaming output gating

Streaming responses cannot be checked all at once, so streaming_mode selects how the output stage handles a streamed response. The choice is a tradeoff between time-to-first-token (TTFT) and how much unsafe text can reach the client.

streaming_mode How it works TTFT Safety
buffer_full (default) Buffer the whole streamed response, then run the output check once at end of stream. Worst (the client sees nothing until the check passes). Strongest: nothing unsafe is ever streamed.
chunked Run incremental checks over a rolling window as the stream progresses. A violation cuts the stream and emits a content-filter terminal chunk. Good. Strong: a violation is caught mid-stream, though a small prefix may already have been seen.
passthrough Stream chunks through with no output checking. Best. None on the streamed output (input gating still applies).

chunked is tuned by three fields, modeled on NeMo Guardrails:

  • streaming_chunk_size (default 200): characters of new text to accumulate before each incremental check.
  • streaming_context_size (default 50): trailing characters carried into each check so a violation spanning a chunk boundary is still seen.
  • streaming_stream_first (default false): when true, each window is emitted to the client before it is checked (lowest latency, a violating chunk can be partially seen); when false, a window is checked before it is released (safer, adds the check latency to each window).

In monitor mode, streaming verdicts are computed and logged but never cut the stream.

Configuration

The canonical, fully commented reference is the guardrails: block in config.yaml.example. Below is a compact end-to-end example combining several providers, a per-route override, and tuning. Do not store secrets inline; reference them by environment variable name.

guardrails:
  enabled: true
  mode: monitor          # start in monitor; switch to enforce after observing metrics

  providers:
    - name: openai-moderation
      type: openai_moderation
      endpoint: "https://api.openai.com/v1/moderations"
      api_key_env: OPENAI_API_KEY
      stages: [input, output]
      category_thresholds:
        violence: 0.8
        hate_speech: 0.7

    - name: pii-redaction
      type: pii
      stages: [input, output]
      options:
        default_action: mask
        actions:
          ssn: block
          credit_card: block

  routes:
    "gpt-5.4":
      mode: enforce
      providers: ["openai-moderation", "pii-redaction"]
      category_thresholds:
        pii: 0.95
      deny:
        exact: ["forbidden-term"]
        regex: ['(?i)\bclassified\b']

  bypass_api_keys: []

  timeout_ms: 2000
  on_error: fail_open
  block_behavior: content_filter

  streaming_mode: buffer_full
  streaming_chunk_size: 200
  streaming_context_size: 50
  streaming_stream_first: false

  deny:
    exact: ["badword"]
    regex: ['\bssn\b', '\d{3}-\d{2}-\d{4}']

  audit:
    enabled: true
    log_level: info

Top-level fields

Field Type Default Description
enabled boolean false Master switch for the subsystem.
mode string monitor monitor or enforce.
providers array [] Provider definitions (see Providers).
routes map {} Per-route overrides keyed by route/model name.
bypass_api_keys array [] API keys that skip all guardrail checks.
timeout_ms integer 2000 Global per-provider check timeout (must be positive).
on_error string fail_open fail_open or fail_closed.
block_behavior string content_filter content_filter, refusal_message, or error.
streaming_mode string buffer_full buffer_full, chunked, or passthrough.
streaming_chunk_size integer 200 chunked: characters per incremental check.
streaming_context_size integer 50 chunked: trailing context per check.
streaming_stream_first boolean false chunked: emit-then-check (true) or check-then-emit (false).
allow match list {} Global allow list (exact + regex).
deny match list {} Global deny list (exact + regex).
audit object enabled Audit-log configuration (see Audit logging).

Provider fields

Field Type Default Description
name string required Stable provider name (unique; referenced by routes).
type string required Provider implementation type.
enabled boolean true Whether this provider runs.
endpoint string - Provider HTTP endpoint, where applicable.
api_key_env string - Name of the environment variable holding the credential.
stages array both input, output, or both.
category_thresholds map {} Per-category score floors in [0.0, 1.0].
timeout_ms integer global Per-provider timeout override.
on_error string global Per-provider error policy override.
options object - Provider-specific options (PII actions, classifier template, ...).

Configuration is validated at load and on hot-reload: provider names must be non-empty and unique, thresholds must fall in [0.0, 1.0], timeouts must be positive, every regex must compile, and enforce mode with guardrails enabled requires at least one provider.

Threshold tuning workflow

Roll out a policy without surprising your users:

  1. Start in monitor mode. Set mode: monitor (globally or per route) with the providers and thresholds you intend to use. Verdicts are computed and recorded but never gate traffic.
  2. Observe the metrics. Watch guardrail_verdicts_total{mode="monitor"} and guardrail_blocks_total to see what would have been blocked, broken down by stage, provider, and category. Check the audit log for the specific categories and scores.
  3. Tune thresholds. Raise a threshold for a category that produces false positives; lower one that lets real violations through. Adjust per provider and per route.
  4. Enforce. Switch the route (or the global default) to mode: enforce. The same thresholds now gate traffic. Keep monitoring guardrail_blocks_total and guardrail_fail_open_total / guardrail_fail_closed_total.

Because mode and thresholds are per-route, you can enforce on one model while keeping the rest in monitor, and you can change any of this at runtime through the admin API without a restart.

Operations

Admin runtime controls

The Admin API exposes the live guardrail policy and lets you change it without a restart. All endpoints require admin authentication (see Admin REST API).

Endpoint Method Description
/admin/guardrails GET View the effective guardrail policy.
/admin/guardrails PATCH Partially update top-level policy (mode, timeouts, fail policy, block behavior, lists).
/admin/guardrails/providers/{name} PUT Toggle or tune a single provider.
/admin/guardrails/routes/{route} PUT Create or replace a per-route override.
/admin/guardrails/routes/{route} DELETE Remove a per-route override (the route falls back to the global policy).
/admin/guardrails/test POST Dry-run sample text against the configured providers and return the verdicts.

Use /admin/guardrails/test to check a policy against representative prompts before enforcing it, and the routes endpoints to enforce on a single model first.

Metrics

When the metrics feature is enabled, every guardrail decision is exported as Prometheus series:

Metric Type Labels Description
guardrail_checks_total counter stage, provider, result Per-provider checks by stage and verdict result. stage is input / output / streaming; result is allow / block / transform / flag.
guardrail_blocks_total counter stage, provider, category Block verdicts by stage, provider, and safety category.
guardrail_check_duration_seconds histogram stage, provider Per-provider check latency.
guardrail_errors_total counter provider, kind Provider errors; kind is timeout or error.
guardrail_fail_open_total counter provider Provider failures resolved fail-open (allowed).
guardrail_fail_closed_total counter provider Provider failures resolved fail-closed (blocked).
guardrail_verdicts_total counter stage, mode, result The single aggregated verdict per request after applying mode semantics. mode is monitor / enforce, so monitor-mode verdicts are visible even though they never gate.

See Metrics and Monitoring → Guardrail Metrics for the full reference.

Audit logging

Every verdict (block, transform, or flag) is logged via structured tracing with the provider, stage, category, score, mode, and action. The audit log never carries raw prompt or response text or secrets: request metadata is redacted before logging, and only category, score, stage, and mode metadata is recorded. Audit logging is on by default and configured under guardrails.audit:

Field Type Default Description
enabled boolean true Whether guardrail-decision audit logging is on.
log_level string info Level at which audit events are emitted: debug, info, or warn.

See also