Skip to content

Backend Passthrough Contract

For several backend types, Continuum Router forwards the chat completions request body without backend-specific reshaping once the request reaches the transport layer. This is narrower than an edge-to-upstream byte-equivalence guarantee: earlier handler stages may still modify the payload before make_http_request, make_unix_socket_request, or LlamaCppBackend::execute_chat_completion see it. This document defines which backends offer that transport-layer guarantee, which fields known consumers rely on, and where the boundary is with backends that do transform the request.

Passthrough Backends

The following backend types do not apply a backend-specific transform in their chat-completions transport path. Once a request has reached that path, all non-standard top-level fields, extra_body, and unrecognized keys are forwarded unchanged:

  • OpenAI
  • Llamacpp (llama-server)
  • Mlxcel (reuses LlamaCppBackend; see src/infrastructure/backends/factory/backend_factory.rs)
  • Ollama
  • vLLM
  • LMStudio
  • LocalAI

The passthrough has three equivalent proxy sites plus one factory-backed backend site:

  • HTTP path: src/proxy/backend.rs::make_http_request, the primary passthrough site. The payload.clone() call in the else branch (non-Anthropic, non-Gemini) forwards the body unchanged.
  • Unix-socket path: the sibling else branch in make_unix_socket_request within the same file, exercised when backend.transport is UnixSocket.
  • Streaming path: src/http/streaming/handler.rs around the client.post(&backend_url).json(&current_payload) call, which forwards current_payload verbatim for each attempt in the streaming fallback loop.
  • Factory-backed llama.cpp / MLxcel path: src/infrastructure/backends/llamacpp/backend.rs::LlamaCppBackend::execute_chat_completion, reached for BackendTypeConfig::Llamacpp and BackendTypeConfig::Mlxcel via src/infrastructure/backends/factory/backend_factory.rs.

None of these sites apply any additional provider-specific transformation. The factory-level note about MLxcel inheriting LlamaCppBackend (see src/infrastructure/backends/factory/backend_factory.rs) is still correct and is now covered explicitly by the factory-backed site above.

Known consumers

The following fields are used in production and must survive the passthrough unchanged:

  • chat_template_kwargs — Jinja template parameters for llama.cpp / MLxcel (e.g., {"enable_thinking": false, "preserve_thinking": true}). Required for Qwen3-family thinking-mode control.
  • thinking_budget_tokens — per-request thinking-budget cap accepted by llama.cpp. Required for Qwen3 thinking mode via llama-server.
  • extra_body — object of additional parameters passed by OpenAI client libraries (e.g., the Python openai SDK's extra_body kwarg). Used to forward vLLM-specific settings such as {"skip_special_tokens": false}.

Transform Backends

The following backend types run the request through a provider-specific transformation layer before forwarding it. Non-standard fields may not survive verbatim:

  • Anthropicsrc/http/handlers/anthropic/transform.rs. Among other changes, the thinking block (with budget_tokens) is transformed into reasoning_effort for the downstream OpenAI-compatible call.
  • Geminisrc/infrastructure/backends/gemini/transform.rs. Gemini-native request reshaping; consult the transform source for the current field mapping.

Integration Tests

The transport-layer passthrough contract is protected by the following integration tests:

  • tests/llamacpp_passthrough_test.rs — verifies that chat_template_kwargs, thinking_budget_tokens, extra_body, and arbitrary unknown fields reach the llama-server endpoint unchanged.
  • tests/mlxcel_passthrough_test.rs — keeps separate coverage for the Mlxcel backend type. It includes a proxy-path fixture and a factory-backed BackendFactory -> LlamaCppBackend::execute_chat_completion assertion so a future divergence in MLxcel wiring surfaces as a distinct failing test.
  • tests/anthropic_input_test.rs — includes a negative test (test_anthropic_thinking_budget_tokens_transforms_to_reasoning_effort) that confirms thinking.budget_tokens is transformed and does not reach the downstream server as a raw thinking_budget_tokens field.

If a refactor breaks any of these tests, treat it as a breaking change to a public contract, not a routine test failure. Update this document to reflect the new behavior before merging.

What the Guarantee Covers

The guarantee applies to the transport sites listed above. It does not mean the upstream server always receives a byte-for-byte copy of the client-submitted JSON body.

Before those passthrough sites run, chat_completions can still perform router-level preprocessing:

  • Global system prompt injection when global_prompts is configured, which modifies the messages array.
  • transform_payload_for_openai() renaming max_tokens to max_completion_tokens for o1, o3, and gpt-5* model IDs.
  • Router-managed web_search injection for eligible self-hosted backends when web_search.enabled applies, which can add a tools entry before the upstream call.

At the transport layer itself, no additional top-level JSON keys are injected and no passthrough fields are filtered or renamed.