Skip to content

Router-Managed Web Search

Continuum Router can transparently provide an OpenAI-style web_search tool to chat completion requests routed to self-hosted LLM backends (vLLM, Ollama, llama.cpp, MLxcel, LM Studio, Generic OpenAI-compatible, Continuum Router remote). This lets users run full agentic workflows on self-hosted models without wiring search integration into every client.

Commercial providers (OpenAI, Azure OpenAI, Gemini, Anthropic) already ship native web search tools, so the router leaves them completely unchanged. A request to gpt-4o or claude-sonnet-4.6 flows through the router exactly as before.

How it works

For each chat completion, the router:

  1. Resolves which backend will serve the model.
  2. If the backend is self-hosted and the web_search feature is enabled, injects a web_search function tool into the outbound tools[] array.
  3. Dispatches the request. If the model responds with a tool_calls entry for web_search, the router:
    • parses the {"query": "..."} arguments,
    • calls the configured search provider,
    • appends a tool-role message with the JSON-encoded results,
    • re-invokes the backend with the enriched conversation.
  4. Repeats up to max_tool_iterations times (default 5), then returns whatever terminal response the model produced.

If the provider fails mid-execution, the router returns a structured {"error": "..."} tool-result to the model instead of failing the request — the model can then apologize, retry, or answer without search.

Anthropic Messages API support

When a client talks to the router through /anthropic/v1/messages, the router serves two shapes of web_search tool:

Custom web_search tool

If the client sends a client-defined tool named web_search (an Anthropic custom tool with a JSON input_schema), the router treats it like any other self-hosted request: translate to OpenAI format, run the bounded tool-execution loop, translate the response back to Anthropic format. This is the path the main Claude Code conversation uses.

Anthropic server tool web_search_20250305

Anthropic's native web search is identified by {"type": "web_search_20250305", "name": "web_search"}. Claude Code's built-in WebSearch uses this shape, plus tool_choice: {type: "tool", name: "web_search"} to force the model to call it. Running the generic tool-execution loop against this shape would never terminate — the forced tool_choice makes the model re-emit the tool call every round — so the router takes a dedicated path:

  1. One backend turn extracts the query from the model's tool call (with a fallback that strips the Claude-Code-specific "Perform a web search for the query: " prefix off the user prompt when the backend does not emit a tool call).
  2. The configured search provider is invoked once (no loop).
  3. The response is assembled as Anthropic's native format:

    • a server_tool_use block carrying the query,
    • a web_search_tool_result block carrying the ranked results, with a base64-encoded snippet stored in each encrypted_content field.
  4. Streaming clients receive the matching SSE sequence (message_start, content_block_start/delta/stop for both blocks, message_delta, message_stop), the same events Claude Code's parser consumes when talking to api.anthropic.com directly.

The commercial native-Anthropic path (forward_to_anthropic_native) is not modified — if you have an Anthropic backend configured, the server tool is proxied through verbatim so Anthropic itself executes it.

Configuration

Add a top-level web_search section to config.yaml:

web_search:
  enabled: true
  provider: serper              # serper | exa | brave
  api_key: "${SERPER_API_KEY}"  # env var substitution supported
  timeout_ms: 5000              # per-request timeout (100 - 60 000 ms)
  max_results: 5                # 1 - 20
  max_tool_iterations: 5        # hard cap on tool rounds (1 - 20)
  inject_policy: auto           # auto | always | never
  tool_name: web_search         # function name advertised to the model
  result_char_cap: 4000         # per-result snippet truncation cap (bytes)
  streaming_enabled: false      # mid-stream tool execution (advanced)
  loop_wall_clock_ms: 60000     # total wall-clock budget for the loop
  max_total_result_bytes: 32768 # combined byte cap on tool-result content

Injection policies

inject_policy decides when the router adds the tool definition to an outbound request:

  • auto (default): inject when the client sends enable_web_search: true OR the client already references a web_search tool by name.
  • always: inject for every self-hosted request (when enabled).
  • never: never inject — lets the operator disable injection without disabling the whole feature (useful for A/B tests).

Per-backend overrides

web_search:
  enabled: true
  # ... global settings ...
  per_backend:
    sensitive-vllm:
      enabled: false           # disable for this backend only
    ollama-external:
      inject_policy: always
    # canary-vllm:
    #   enabled: true          # can opt in this backend even when the
    #                          # global enabled flag is false

Hot reload

All web_search fields, including the API key, are re-read when the router's configuration reloads. You can rotate an API key or toggle the feature on and off without restarting.

Supported providers

Provider Status Endpoint
Serper Implemented https://google.serper.dev/search
Brave Implemented https://api.search.brave.com/res/v1/web/search
Exa Implemented https://api.exa.ai/search

All three providers are live and share the same SearchProvider trait; switching between them is a single provider: line change in the web_search config section.

Security

  • API keys are expanded from ${ENV} references at load time and are never written to logs or error responses.
  • Debug output for WebSearchConfig, SerperProvider, BraveProvider, and ExaProvider redacts the api_key field.
  • Search-result titles and snippets are sanitized to strip HTML/control characters, then truncated to result_char_cap before being fed back into the model context.

Observability

Prometheus metrics emitted by the feature (when the metrics feature is compiled in):

  • web_search_calls_total{provider, outcome} — one per executed tool call.
  • web_search_call_duration_seconds{provider, outcome} — histogram of search execution latency.
  • web_search_iteration_cap_total{component="loop"} — incremented whenever the max_tool_iterations cap is hit.
  • web_search_injections_total{backend_type} — one per request where the router successfully injected its tool.

Safety bounds

The non-streaming tool-execution loop enforces four independent guards:

  • max_tool_iterations caps the number of backend round-trips (hard upper bound, default 5).
  • loop_wall_clock_ms caps total wall-clock time spent in the loop (default 60 s). Once exceeded, the last terminal response from the model is returned unchanged; no error is surfaced to the client.
  • max_total_result_bytes caps the combined byte length of tool-result content appended to the conversation across every iteration (default 32 KiB). Once exceeded, further tool-result payloads are replaced with a short "tool-result budget exhausted" error so the model still sees a closed tool-call cycle and can produce a terminal response.
  • Orphan tool calls (tool calls whose function.name is not the configured web_search tool name) are answered with a structured tool-role error so the backend does not reject the next round for a dangling tool_call_id. The router never attempts to execute tool calls it does not know how to service.

Model aliasing (Claude Code compatibility)

Clients such as Claude Code hard-code specific model names for their internal features. For example, claude-haiku-4-5-20251001 is used by WebFetch's content summarizer and by the intermediate call WebSearch uses to extract a query. On a router serving only self-hosted models, every such internal call would otherwise fail with ModelNotFound.

The top-level model_aliases section rewrites the incoming model field before backend selection, mirroring the ANTHROPIC_DEFAULT_{HAIKU,SONNET,OPUS}_MODEL convention popularised by cc-switch:

model_aliases:
  haiku: GLM-5_1          # any incoming name containing "haiku"
  sonnet: GLM-5_1         # any incoming name containing "sonnet"
  opus: GLM-5_1           # any incoming name containing "opus"
  reasoning: GLM-5_1      # used when thinking.type is enabled | adaptive
  default: GLM-5_1        # catch-all; leave unset to disable
  exact:                  # full-name pins, take precedence
    claude-haiku-4-5-20251001: GLM-5_1

Match order is exactreasoning (when thinking is enabled) → haikuopussonnetdefault. Matching is a case-insensitive substring check on the incoming name. Every rewrite is logged at info level so operators can observe the redirection.

Aliasing is applied only on /anthropic/v1/messages; the OpenAI /v1/chat/completions endpoint is unaffected (callers there already specify their own model name). Native Anthropic forwarding round-trips the typed request, so if you route to an Anthropic backend and do not want aliasing to apply, leave model_aliases unset.

Caveats

  • The non-streaming tool-execution loop and the Anthropic web_search_20250305 server-tool emulation path (including its streaming SSE sequence) are fully functional. Streaming tool execution for the generic custom-tool loop is still gated behind streaming_enabled: false by default; streaming requests that hit the generic loop (i.e. neither a custom web_search tool nor the Anthropic server tool) currently pass through to the underlying SSE stream without router-side execution.
  • The router trusts models to emit well-formed {"query": "..."} arguments. Malformed arguments result in a tool-role error message to the model rather than an HTTP error.
  • The injected tool is deliberately OpenAI function-calling shaped (tools[].function.parameters) so self-hosted models trained on OpenAI-style tool use behave predictably.
  • encrypted_content in the emulated web_search_tool_result block is a base64 of the snippet, not a cryptographically signed blob. Anthropic itself uses the field as a signed reference it validates on follow-up turns; the router is the server in this flow, so an opaque base64 payload is sufficient for Claude Code to round-trip.