Backend Passthrough Contract¶
For several backend types, Continuum Router forwards the chat completions request body without backend-specific reshaping once the request reaches the transport layer. This is narrower than an edge-to-upstream byte-equivalence guarantee: earlier handler stages may still modify the payload before make_http_request, make_unix_socket_request, or LlamaCppBackend::execute_chat_completion see it. This document defines which backends offer that transport-layer guarantee, which fields known consumers rely on, and where the boundary is with backends that do transform the request.
Passthrough Backends¶
The following backend types do not apply a backend-specific transform in their chat-completions transport path. Once a request has reached that path, all non-standard top-level fields, extra_body, and unrecognized keys are forwarded unchanged:
OpenAI(with the cloud-boundary exception described below)Llamacpp(llama-server)Mlxcel(reusesLlamaCppBackend; seesrc/infrastructure/backends/factory/backend_factory.rs)OllamavLLMLMStudioLocalAI
The passthrough has three equivalent proxy sites plus one factory-backed backend site:
- HTTP path:
src/proxy/backend.rs::make_http_request, the primary passthrough site. Thepayload.clone()call in theelsebranch (non-Anthropic, non-Gemini) forwards the body unchanged. - Unix-socket path: the sibling
elsebranch inmake_unix_socket_requestwithin the same file, exercised whenbackend.transportisUnixSocket. - Streaming path:
src/http/streaming/handler.rsaround theclient.post(&backend_url).json(¤t_payload)call, which forwardscurrent_payloadverbatim for each attempt in the streaming fallback loop. - Factory-backed llama.cpp / MLxcel path:
src/infrastructure/backends/llamacpp/backend.rs::LlamaCppBackend::execute_chat_completion, reached forBackendTypeConfig::LlamacppandBackendTypeConfig::Mlxcelviasrc/infrastructure/backends/factory/backend_factory.rs.
Apart from the cloud OpenAI field strip described next, none of these sites applies any provider-specific transformation.
Cloud OpenAI exception¶
The OpenAI backend type covers both the OpenAI cloud and local OpenAI-compatible engines, and the two disagree on unknown fields: api.openai.com rejects any unrecognized top-level key with HTTP 400, while local engines consume them. The transport sites therefore apply one gated filter. When a /v1/chat/completions request is bound for a URL containing api.openai.com, the engine-only fields chat_template_kwargs, thinking_budget_tokens, enable_thinking, preserve_thinking, top_k, min_p, and repeat_penalty (NON_OPENAI_FIELDS in src/infrastructure/backends/field_filter.rs) are removed before the request is sent.
- The strip runs at every cloud-OpenAI chat send site: the non-streaming passthrough branch of
make_http_request, the streaming transport, and each hop of the mid-stream and auto-selection fallback loops. A fallback may switch backends, so the gate is re-evaluated per hop. - The same denylist backs the cloud Gemini strip (Google's
/v1beta/openai/chat/completionsendpoint validates field names just as strictly). Azure OpenAI and the Anthropic/v1/messagesbridge are separate surfaces and are not covered by this gate. - Stripping is top-level only.
extra_bodyandreasoning_effortare never touched:extra_bodyis the intentional escape hatch for provider-specific settings, andreasoning_effortis accepted by cloud OpenAI directly (cloud Gemini maps it tothinking_level). - Local engine URLs never contain
api.openai.com, so the filter is a no-op for llama.cpp, vLLM, MLxcel, Ollama, LM Studio, and LocalAI, and the passthrough guarantee for them is unchanged.
Known consumers¶
The following fields are used in production and must survive the passthrough unchanged:
chat_template_kwargs— Jinja template parameters for llama.cpp / MLxcel (e.g.,{"enable_thinking": false, "preserve_thinking": true}). Required for Qwen3-family thinking-mode control.thinking_budget_tokens— per-request thinking-budget cap accepted by llama.cpp. Required for Qwen3 thinking mode via llama-server.extra_body— object of additional parameters passed by OpenAI client libraries (e.g., the PythonopenaiSDK'sextra_bodykwarg). Used to forward vLLM-specific settings such as{"skip_special_tokens": false}.
Transform Backends¶
The following backend types run the request through a provider-specific transformation layer before forwarding it. Non-standard fields may not survive verbatim:
Anthropic—src/http/handlers/anthropic/transform.rs. Among other changes, thethinkingblock (withbudget_tokens) is transformed intoreasoning_effortfor the downstream OpenAI-compatible call.Gemini—src/infrastructure/backends/gemini/transform.rs. Gemini-native request reshaping; consult the transform source for the current field mapping.
Integration Tests¶
The transport-layer passthrough contract is protected by the following integration tests:
tests/llamacpp_passthrough_test.rs— verifies thatchat_template_kwargs,thinking_budget_tokens,extra_body, and arbitrary unknown fields reach the llama-server endpoint unchanged.tests/mlxcel_passthrough_test.rs— keeps separate coverage for theMlxcelbackend type. It includes a proxy-path fixture and a factory-backedBackendFactory -> LlamaCppBackend::execute_chat_completionassertion so a future divergence in MLxcel wiring surfaces as a distinct failing test.tests/anthropic_input_test.rs— includes a negative test (test_anthropic_thinking_budget_tokens_transforms_to_reasoning_effort) that confirmsthinking.budget_tokensis transformed and does not reach the downstream server as a rawthinking_budget_tokensfield.- Unit tests in
src/infrastructure/backends/field_filter.rs— cover the cloud-gated strip: a cloud OpenAI chat request loses every denylisted field, a local-engine request keeps them all, and non-chat endpoints are left alone.
If a refactor breaks any of these tests, treat it as a breaking change to a public contract, not a routine test failure. Update this document to reflect the new behavior before merging.
What the Guarantee Covers¶
The guarantee applies to the transport sites listed above. It does not mean the upstream server always receives a byte-for-byte copy of the client-submitted JSON body.
Before those passthrough sites run, chat_completions can still perform router-level preprocessing:
- Global system prompt injection when
global_promptsis configured, which modifies themessagesarray. transform_payload_for_openai()renamingmax_tokenstomax_completion_tokensforo1,o3, andgpt-5*model IDs.- Router-managed
web_searchinjection for eligible self-hosted backends whenweb_search.enabledapplies, which can add atoolsentry before the upstream call.
At the transport layer itself, no additional top-level JSON keys are injected, and no passthrough fields are filtered or renamed except for the cloud OpenAI strip described above.
Related Documentation¶
- Reasoning Effort — how
reasoning_effort/thinking.budget_tokensis mapped across providers - Model Fallback — fallback chain configuration
- Architecture Overview — main architecture guide