Router-Managed Web Search¶
Continuum Router can transparently provide an OpenAI-style web_search
tool to chat completion requests routed to self-hosted LLM backends
(vLLM, Ollama, llama.cpp, MLxcel, LM Studio, Generic OpenAI-compatible,
Continuum Router remote). This lets users run full agentic workflows on
self-hosted models without wiring search integration into every client.
Commercial providers (OpenAI, Azure OpenAI, Gemini, Anthropic) already
ship native web search tools, so the router leaves them completely
unchanged. A request to gpt-4o or claude-sonnet-4.6 flows through
the router exactly as before.
How it works¶
For each chat completion, the router:
- Resolves which backend will serve the model.
- If the backend is self-hosted and the
web_searchfeature is enabled, injects aweb_searchfunction tool into the outboundtools[]array. - Dispatches the request. If the model responds with a
tool_callsentry forweb_search, the router:- parses the
{"query": "..."}arguments, - calls the configured search provider,
- appends a
tool-role message with the JSON-encoded results, - re-invokes the backend with the enriched conversation.
- parses the
- Repeats up to
max_tool_iterationstimes (default 5), then returns whatever terminal response the model produced.
If the provider fails mid-execution, the router returns a structured
{"error": "..."} tool-result to the model instead of failing the
request — the model can then apologize, retry, or answer without search.
Anthropic Messages API support¶
When a client talks to the router through /anthropic/v1/messages, the
router serves two shapes of web_search tool:
Custom web_search tool¶
If the client sends a client-defined tool named web_search (an
Anthropic custom tool with a JSON input_schema), the router treats it
like any other self-hosted request: translate to OpenAI format, run the
bounded tool-execution loop, translate the response back to Anthropic
format. This is the path the main Claude Code conversation uses.
Anthropic server tool web_search_20250305¶
Anthropic's native web search is identified by
{"type": "web_search_20250305", "name": "web_search"}. Claude Code's
built-in WebSearch uses this shape, plus tool_choice: {type: "tool",
name: "web_search"} to force the model to call it. Running the
generic tool-execution loop against this shape would never terminate —
the forced tool_choice makes the model re-emit the tool call every
round — so the router takes a dedicated path:
- One backend turn extracts the
queryfrom the model's tool call (with a fallback that strips the Claude-Code-specific"Perform a web search for the query: "prefix off the user prompt when the backend does not emit a tool call). - The configured search provider is invoked once (no loop).
-
The response is assembled as Anthropic's native format:
- a
server_tool_useblock carrying the query, - a
web_search_tool_resultblock carrying the ranked results, with a base64-encoded snippet stored in eachencrypted_contentfield.
- a
-
Streaming clients receive the matching SSE sequence (
message_start,content_block_start/delta/stopfor both blocks,message_delta,message_stop), the same events Claude Code's parser consumes when talking toapi.anthropic.comdirectly.
The commercial native-Anthropic path (forward_to_anthropic_native) is
not modified — if you have an Anthropic backend configured, the server
tool is proxied through verbatim so Anthropic itself executes it.
Configuration¶
Add a top-level web_search section to config.yaml:
web_search:
enabled: true
provider: serper # serper | exa | brave
api_key: "${SERPER_API_KEY}" # env var substitution supported
timeout_ms: 5000 # per-request timeout (100 - 60 000 ms)
max_results: 5 # 1 - 20
max_tool_iterations: 5 # hard cap on tool rounds (1 - 20)
inject_policy: auto # auto | always | never
tool_name: web_search # function name advertised to the model
result_char_cap: 4000 # per-result snippet truncation cap (bytes)
streaming_enabled: false # mid-stream tool execution (advanced)
loop_wall_clock_ms: 60000 # total wall-clock budget for the loop
max_total_result_bytes: 32768 # combined byte cap on tool-result content
Injection policies¶
inject_policy decides when the router adds the tool definition to an
outbound request:
auto(default): inject when the client sendsenable_web_search: trueOR the client already references aweb_searchtool by name.always: inject for every self-hosted request (when enabled).never: never inject — lets the operator disable injection without disabling the whole feature (useful for A/B tests).
Per-backend overrides¶
web_search:
enabled: true
# ... global settings ...
per_backend:
sensitive-vllm:
enabled: false # disable for this backend only
ollama-external:
inject_policy: always
# canary-vllm:
# enabled: true # can opt in this backend even when the
# # global enabled flag is false
Hot reload¶
All web_search fields, including the API key, are re-read when the
router's configuration reloads. You can rotate an API key or toggle the
feature on and off without restarting.
Supported providers¶
| Provider | Status | Endpoint |
|---|---|---|
| Serper | Implemented | https://google.serper.dev/search |
| Brave | Implemented | https://api.search.brave.com/res/v1/web/search |
| Exa | Implemented | https://api.exa.ai/search |
All three providers are live and share the same SearchProvider
trait; switching between them is a single provider: line change in
the web_search config section.
Security¶
- API keys are expanded from
${ENV}references at load time and are never written to logs or error responses. Debugoutput forWebSearchConfig,SerperProvider,BraveProvider, andExaProviderredacts theapi_keyfield.- Search-result titles and snippets are sanitized to strip HTML/control
characters, then truncated to
result_char_capbefore being fed back into the model context.
Observability¶
Prometheus metrics emitted by the feature (when the metrics feature
is compiled in):
web_search_calls_total{provider, outcome}— one per executed tool call.web_search_call_duration_seconds{provider, outcome}— histogram of search execution latency.web_search_iteration_cap_total{component="loop"}— incremented whenever themax_tool_iterationscap is hit.web_search_injections_total{backend_type}— one per request where the router successfully injected its tool.
Safety bounds¶
The non-streaming tool-execution loop enforces four independent guards:
max_tool_iterationscaps the number of backend round-trips (hard upper bound, default 5).loop_wall_clock_mscaps total wall-clock time spent in the loop (default 60 s). Once exceeded, the last terminal response from the model is returned unchanged; no error is surfaced to the client.max_total_result_bytescaps the combined byte length of tool-resultcontentappended to the conversation across every iteration (default 32 KiB). Once exceeded, further tool-result payloads are replaced with a short"tool-result budget exhausted"error so the model still sees a closed tool-call cycle and can produce a terminal response.- Orphan tool calls (tool calls whose
function.nameis not the configuredweb_searchtool name) are answered with a structured tool-role error so the backend does not reject the next round for a danglingtool_call_id. The router never attempts to execute tool calls it does not know how to service.
Model aliasing (Claude Code compatibility)¶
Clients such as Claude Code hard-code specific model names for their
internal features. For example, claude-haiku-4-5-20251001 is used by
WebFetch's content summarizer and by the intermediate call WebSearch
uses to extract a query. On a router serving only self-hosted models, every
such internal call would otherwise fail with ModelNotFound.
The top-level model_aliases section rewrites the incoming model
field before backend selection, mirroring the
ANTHROPIC_DEFAULT_{HAIKU,SONNET,OPUS}_MODEL convention popularised by
cc-switch:
model_aliases:
haiku: GLM-5_1 # any incoming name containing "haiku"
sonnet: GLM-5_1 # any incoming name containing "sonnet"
opus: GLM-5_1 # any incoming name containing "opus"
reasoning: GLM-5_1 # used when thinking.type is enabled | adaptive
default: GLM-5_1 # catch-all; leave unset to disable
exact: # full-name pins, take precedence
claude-haiku-4-5-20251001: GLM-5_1
Match order is exact → reasoning (when thinking is enabled) →
haiku → opus → sonnet → default. Matching is a
case-insensitive substring check on the incoming name. Every rewrite
is logged at info level so operators can observe the redirection.
Aliasing is applied only on /anthropic/v1/messages; the OpenAI
/v1/chat/completions endpoint is unaffected (callers there already
specify their own model name). Native Anthropic forwarding round-trips
the typed request, so if you route to an Anthropic backend and do not
want aliasing to apply, leave model_aliases unset.
Caveats¶
- The non-streaming tool-execution loop and the Anthropic
web_search_20250305server-tool emulation path (including its streaming SSE sequence) are fully functional. Streaming tool execution for the generic custom-tool loop is still gated behindstreaming_enabled: falseby default; streaming requests that hit the generic loop (i.e. neither a customweb_searchtool nor the Anthropic server tool) currently pass through to the underlying SSE stream without router-side execution. - The router trusts models to emit well-formed
{"query": "..."}arguments. Malformed arguments result in a tool-role error message to the model rather than an HTTP error. - The injected tool is deliberately OpenAI function-calling shaped
(
tools[].function.parameters) so self-hosted models trained on OpenAI-style tool use behave predictably. encrypted_contentin the emulatedweb_search_tool_resultblock is a base64 of the snippet, not a cryptographically signed blob. Anthropic itself uses the field as a signed reference it validates on follow-up turns; the router is the server in this flow, so an opaque base64 payload is sufficient for Claude Code to round-trip.