Skip to content

Changelog

All notable changes to Continuum Router are documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

Unreleased

Added

  • ChatGPT subscription / Codex backend authentication via OAuth device flow (#551, #592)
  • continuum-router auth login --backend <name> runs the OpenAI Codex three-step headless device-code flow: POST /api/accounts/deviceauth/usercode to mint a one-time user_code, POST /api/accounts/deviceauth/token polling, and a PKCE exchange at /oauth/token. Standards-compliant RFC 8628 device flow remains available for any future provider that implements it; the new OpenAICodexDeviceFlowClient is selected automatically for provider: openai.
  • Tokens are wrapped in SecretString, written to the configured token_store with mode 0600 on Unix using an O_CREAT|O_EXCL open + atomic rename; a random tempfile suffix prevents concurrent saves from colliding, and a partial write is unlinked on failure so secret material does not linger on disk.
  • Access-token expiry is parsed from the JWT exp claim (with a 1-hour fallback for non-JWT tokens) and clamped to a useful minimum so a degenerate expires_in from the provider cannot trigger a refresh storm.
  • Proactive refresh fires 60 s before expiry, single-flighted with a tokio::sync::Mutex. A 401 from the upstream backend triggers exactly one forced refresh and a single retry; the previous refresh token is preserved race-free when the provider omits refresh_token from a refresh response.
  • The strategy reports an identity_fingerprint() (backend name, client_id, token_store) so that hot-reload rebuilds the strategy when any of those rotate, instead of silently keeping the prior in-memory state.
  • The CLI strips C0/C1 control characters from verification_uri_complete and user_code before printing, so a hostile provider response cannot inject ANSI escapes that rewrite the terminal.
  • Every device-flow and runtime request to auth.openai.com / chatgpt.com/backend-api/codex carries originator: codex_cli_rs (configurable via auth.oauth.originator) and a codex_cli_rs/<version> User-Agent (configurable via auth.oauth.user_agent), matching the official Codex CLI so Cloudflare admits the traffic instead of returning a 403 JS challenge.
  • auth.type: oauth is accepted in YAML alongside the legacy o_auth snake_case rendering. client_id and scope default to the public Codex CLI values; only token_store is required for the ChatGPT-subscription case.
  • Anthropic Messages and Chat Completions surfaces both transparently route to the ChatGPT Codex backend (#592)
  • Any backend whose auth.type is oauth and whose provider uses the Codex flow (currently openai) is forced through the Responses API for every request, regardless of per-model responses_only metadata. chatgpt.com/backend-api/codex exposes /responses only — no /chat/completions — so chat-shaped models (e.g. gpt-5.5, alias-mapped claude-haiku-4-5) and unknown model IDs all dispatch through /v1/responses…/backend-api/codex/responses. Non-OAuth OpenAI backends continue to honor the per-model responses_only flag.
  • New core::url_utils::compose_backend_url centralizes backend URL composition for the three OpenAI-compatible roots (/v1, /openai, /backend-api/codex). Replaces ad-hoc ends_with("/v1") || ends_with("/openai") checks across proxy/backend.rs, http/handlers/responses.rs, http/streaming/handler.rs, services/responses/stream_service.rs, and the Anthropic handler so the /backend-api/codex rule applies uniformly.
  • The proxy hot path (proxy/backend.rs, proxy/responses_only.rs, proxy/image_gen.rs, proxy/image_edit.rs) now flows through a backend-name-keyed AuthStrategyRegistry exposed on AppState via src/proxy/oauth_helper.rs. The helper looks up the strategy, calls refresh_if_needed() before sending, replaces the static-bearer header with one derived from the strategy, and force-refreshes + retries once on a 401. Static api_key auth continues to work unchanged when no strategy is registered.
  • The Anthropic-compatible handler (src/http/handlers/anthropic/handler.rs) consults the same registry. Client-supplied Authorization: sk-ant-… and x-api-key headers are dropped when the backend has an OAuth strategy, instead of being forwarded to OpenAI as the bearer.
  • Model fetcher detects OAuth-authed backends and falls back to the configured models list rather than probing /v1/models, since chatgpt.com/backend-api/codex does not expose a models endpoint.
  • Codex-compatible Responses API extensions (#536, #537)
  • POST /v1/responses/compact endpoint for context compaction — passthrough to OpenAI / Azure OpenAI native /v1/responses/compact; other backend types return 501.
  • store field on ResponsesRequest (defaults to true) controls upstream session persistence; Codex sends store: false for ephemeral requests.
  • output_text content part type alongside input_text so converters can differentiate assistant vs. user content in input items. All converters (OpenAI, Anthropic, Gemini) handle the new variant.

v1.5.6 - 2026-04-29

Fixed

  • /v1/chat/completions returned HTTP 502 responses_parse_failed for responses_only reasoning models (gpt-5.4-pro, gpt-5.5-pro). OpenAI's /v1/responses payload for these models contains output items shaped like { "id": "rs_...", "type": "reasoning", "summary": [] }, but OutputItem::Reasoning required content and status, so serde rejected the payload with missing field 'content'. The Anthropic Messages surface bypassed the strict variant on a different conversion path, masking the bug until directly tested. content and status are now optional on OutputItem::Reasoning; reasoning items are dropped before reaching Chat Completions clients (per existing project policy), so body shape is irrelevant beyond successful deserialization. (#594)

Changed

  • Realign gemini-3.1-pro-preview as the canonical metadata id for the Gemini 3.1 Pro family in model-metadata.yaml, with gemini-3.1-pro (and existing -latest / -customtools forms) demoted to aliases. Matches what generativelanguage.googleapis.com actually serves today — the canonical gemini-3.1-pro form returns 404 from upstream — and avoids implying GA availability that does not exist yet. The metadata cache still resolves both forms to the same entry. Note: alias-to-canonical rewriting on the upstream-bound payload is out of scope for this release; clients calling with the gemini-3.1-pro alias will still hit upstream 404 until that work lands. (#594)
  • Sample config.yaml registers the newly-available pro / 5.5 family models so the responses_only dispatch path can be exercised end-to-end against real upstreams (gpt-5.4-pro, gpt-5.2-pro, gpt-5.5, gpt-5.5-pro, claude-opus-4-7, gemini-3.1-pro, gemini-3.1-pro-preview); duplicate claude-haiku-4-5 entry removed.

v1.5.5 - 2026-04-27

Added

  • Transparent Responses-API routing for OpenAI Pro models (epic #581)
  • New responses_only: true capability flag in model-metadata.yaml and the built-in OpenAI registry marks gpt-5.2-pro, gpt-5.4-pro, and gpt-5.5-pro as served only on /v1/responses upstream (#574, #582)
  • /v1/chat/completions requests for responses_only models are dispatched to the upstream /v1/responses endpoint and translated back into a strict-mode chat.completion (or chat.completion.chunk for streaming) envelope, transparent to the client. Stream usage is gated by stream_options.include_usage, and per-model latency / success counters are recorded for the responses_only path (#578, #584)
  • /anthropic/v1/messages requests for responses_only models are converted to the Responses API shape, dispatched to /v1/responses, and translated back into Anthropic Messages JSON (or the Anthropic SSE event sequence for streaming) — tool-call round-trips, web-search emulation, and Unix-socket transports all branch on the flag (#575, #577, #583, #585, #586)
  • Anthropic Messages <-> Responses request transformer covers system → instructions, tools, toolchoice (including disable_parallel_tool_useparallel_tool_calls: false), max_tokensmax_output_tokens, reasoning effort derivation, and multi-turn tool round-trips; the response transformer preserves thinking/text/tooluse ordering and stop-reason fidelity (#575, #583)
  • SSE streaming bridge (AnthropicResponsesStreamTranslator) maps Responses API events to Anthropic Messages events while preserving Anthropic's strict event-ordering invariants (single message_start, paired content_block_start/content_block_stop, terminal message_stop); handles mid-stream error / response.failed / response.cancelled, response.incompletestop_reason: max_tokens, deferred input tokens, and graceful early-close synthesis (#576, #585)
  • Only OpenAI and Azure OpenAI backends serve /v1/responses; pairing a responses_only model with another backend type produces a 400 invalid_request_error before any upstream call (rejection fires on both /v1/chat/completions and /anthropic/v1/messages surfaces) (#577, #589)
  • The first dispatch per (backend, model) pair logs at info level so operators can confirm Responses-API routing without enabling debug logs
  • Anthropic Messages → Responses requests explicitly send store: false to avoid upstream side-effects (#589)
  • 22 deterministic, in-process integration tests covering the {Anthropic, Chat} × {gpt-5.4-pro, gpt-5.2-pro} × {non-streaming, streaming} × {plain, tool-call, reasoning} matrix, mid-stream backend-failure negatives on both surfaces, and an upstream byte-fragmentation regression guard (#579, #588)
  • Documented in docs/en/configuration/advanced.md (Responses-API-only Models section split into Models-marked-out-of-the-box, Marking-a-new-model, Dispatch-behavior, and Backend-type-constraint subsections), docs/en/architecture.md (Responses-API Routing data-flow diagram), and the docs/en/api.md Chat Completions and Anthropic Messages surface notes with a Transparent-Responses-API-routing subsection (#580, #587)

Fixed

  • Chat Completions responses-only routing now rejects incompatible-only backend configs before upstream dispatch and chooses a compatible OpenAI/Azure Responses backend when available (#589)
  • Chat assistant tool_calls[] are preserved as Responses function_call input items for stateless tool-result turns over /v1/chat/completions (#589)

v1.5.4 - 2026-04-25

Changed

  • Refresh model-metadata.yaml for late-April 2026 frontier model releases (#572, #573)
  • Add GPT-5.5 ($5/$30 per 1M, 1M context, knowledge cutoff 2025-12, omnimodal, leads Terminal-Bench 2.0 at 82.7%) and GPT-5.5 Pro ($30/$180 per 1M, Responses API only, deep reasoning) — released 2026-04-23
  • Add DeepSeek V4 Pro (1.6T total / 49B active MoE, 1M context, 384K max output, three reasoning effort modes) and DeepSeek V4 Flash (284B total / 13B active MoE, 1M context, 384K max output) with deepseek-chat and deepseek-reasoner retained as deprecated aliases per official API docs — released 2026-04-24
  • Add gpt-image-2 (token-billed instead of per-image: text $5/$30, image $8/$30 per 1M tokens; 1K/2K/4K resolution tiers; ~99% text accuracy in any language; built-in reasoning before generation; context-aware multi-turn editing; gpt-image-2-latest alias) — released 2026-04-21
  • Add Claude Opus 4.7 ($5/$25 per 1M, 1M context, 128K max output, knowledge cutoff 2026-01, high-resolution image support up to 2576px / 3.75MP, new tokenizer with ~1.0–1.35× token usage vs prior models, new xhigh effort level) — released 2026-04-16
  • Promote Gemini 3.1 series from preview to GA, retaining -preview suffix as alias for fallback compatibility (#573)
  • gemini-3.1-pro-previewgemini-3.1-pro (with gemini-3.1-pro-preview, gemini-3.1-pro-preview-customtools, and gemini-3.1-pro-latest aliases)
  • gemini-3.1-flash-image-previewgemini-3.1-flash-image (with gemini-3.1-flash-image-preview, nano-banana-2, and gemini-3.1-flash-image-latest aliases)
  • gemini-3.1-flash-lite-previewgemini-3.1-flash-lite (with gemini-3.1-flash-lite-preview and gemini-3.1-flash-lite-latest aliases)
  • Updated gemini-3-flash-preview deprecation note to point to the new GA gemini-3.1-pro id

v1.5.3 - 2026-04-23

Added

  • HuggingFace repo-prefix stripping as a new matching phase (phase 5) in src/models/pattern_matching.rs (#555)
  • try_strip_hf_repo_prefix() validates a vendor/repo (or org/team/repo) prefix against a MAX_PREFIX_SEGMENTS = 3 bound, rejects empty segments (/repo, vendor/, vendor//repo), and rejects any ASCII whitespace before returning the residual
  • Phase 5 re-enters phases 1-4 on the stripped residual with a structurally-enforced recursion depth of exactly 1 (the re-entry call clears the allow_prefix_strip gate), so prefix stripping composes with the existing layered suffix peel in a single lookup — the motivating case unsloth/Qwen3.6-35B-A3B-GGUF now resolves to qwen3.6-35b-a3b without any hand-registered alias
  • Phase 5 runs before the wildcard phase; the blast-radius audit confirmed no *-bearing alias in model-metadata.yaml contains /, so the ordering change is behavior-neutral for existing routing
  • Phase numbering in tracing output realigned to match the documented phase chain (previous code emitted phase = 7 for the namespace fallback while comments called it phase 6)
  • 12 new unit tests covering standard HF form, composition with suffix peel, case-sensitive vendor, registered-alias precedence, unresolvable residual, three-segment form, segment-cap rejection, no-slash input, whitespace rejection, empty segments, re-entry bounding, and alias-phase precedence
  • 9 new integration tests in tests/format_suffix_normalization_test.rs exercising the full RouterConfig / BackendConfig public API through phase 5
  • Pipeline doc updated in docs/en/configuration/advanced.md (and Korean counterpart) with a new "HuggingFace repo-prefix stripping (phase 5)" section covering the composition semantics, security bounds, and out-of-scope list (hyphen prefixes, HF API discovery)

Changed

  • Replaced the previous phase-6 namespace fallback with the new phase-5 HuggingFace prefix-strip layer. The previous phase was case-sensitive and did not compose with suffix peel; the new phase applies stricter input validation (segment cap, empty-segment rejection, whitespace rejection) but composes with phase 4's case-insensitive peel through the bounded re-entry. Pathological inputs above MAX_PREFIX_SEGMENTS (3) — such as provider/deep/nested/model — are now rejected by phase 5 rather than silently matched via recursive rsplit_once fallback (#555)
  • Aliases currently classified as vendor-prefix in the #560 audit (e.g., Qwen/Qwen3.6-35B-A3B, MiniMaxAI/MiniMax-M2.5) are now peel-coverable-adjacent post-#555: phase 2 still wins on the explicit alias, but phase 5 + phase 4 together reach the same metadata. Retroactive removal is deferred to a follow-up audit per #555 design section 7

Fixed

  • POST /anthropic/v1/messages now works when the selected backend is configured with a unix:// URL (#567)
  • Native Anthropic backends and OpenAI-compatible backends both work over Unix sockets, for both non-streaming and streaming requests
  • Socket paths containing spaces (e.g. macOS ~/Library/Application Support/...) are handled correctly
  • Auth header selection (x-api-key for Anthropic backends, Authorization: Bearer for OpenAI-compatible backends) is correct on the Unix socket path
  • anthropic-version header is added automatically for Anthropic backends on the Unix socket path, matching the HTTP path behavior

v1.5.2 - 2026-04-21

Added

  • Regression tests locking down the transport-layer passthrough contract for llama.cpp and MLxcel backends (#562)
  • New tests/llamacpp_passthrough_test.rs and tests/mlxcel_passthrough_test.rs covering all four passthrough call sites: direct backend execute_chat_completion, factory-built backend (BackendFactory -> LlamaCppBackend), proxy/backend.rs HTTP handler, and the streaming handler
  • New test_mlxcel_factory_backend_passthrough_nonstandard_fields asserts that BackendFactory -> LlamaCppBackend::execute_chat_completion preserves non-standard fields byte-for-byte at transport time
  • Anthropic input test (tests/anthropic_input_test.rs) extended with explicit passthrough coverage
  • docs/en/architecture/backend-passthrough.md and its Korean counterpart docs/ko/architecture/backend-passthrough.md documenting the passthrough contract, the four guarded call sites, and the list of router-side transforms that run before transport (global_prompts, transform_payload_for_openai for o1/o3/gpt-5*, web_search injection) (#562, #563)
  • docs/reports/alias-audit-2026-04.md classifying every alias in model-metadata.yaml into peel-redundant, peel-redundant-but-kept, and peel-independent categories, with an "aliases vs peel" policy section added to docs/en/configuration/advanced.md (and the Korean counterpart) explaining when to prefer each mechanism (#560)

Changed

  • Narrowed the passthrough contract from an implied "byte-equivalent" global guarantee to a transport-layer scope — the router may still run global_prompts injection, o1/o3/gpt-5* payload transforms, and web_search tool injection before transport, but no provider-specific rewriting happens at the transport boundary (#563)
  • Comment-only clarifications in src/http/streaming/handler.rs, src/infrastructure/backends/factory/backend_factory.rs, src/infrastructure/backends/llamacpp/backend.rs, and src/proxy/backend.rs
  • Audited model-metadata.yaml aliases for peel-normalization redundancy: removed aliases that differ from the canonical ID only by suffixes already handled by the layered peel (-4bit, -q4_k_m, -fp8, -gguf, -mlx, -awq, etc.), while preserving aliases that encode canonical flavor variants (-qat, -instruct) or disambiguate parameter counts (#557)
  • New tests/alias_audit_helper.rs and tests/format_suffix_normalization_test.rs enforce the peel-vs-alias boundary going forward

CI

  • Target Ubuntu 26.04 LTS (Resolute) instead of 25.10 (Questing) in the Debian build workflow
  • Fall back to createdAt when release publishedAt is null in debian/update-changelog.sh to prevent changelog regression when the latest release is still in draft

v1.5.1 - 2026-04-20

Added

  • Built-in web_search tool for self-hosted LLM backends (#553)
  • Router-level tool transparently injected into chat completion requests for vLLM, Ollama, llama.cpp, MLxcel, LM Studio, Continuum Router, and Generic backends
  • Pluggable SearchProvider trait under src/services/search/ with SerperProvider implementation; Exa and Brave scaffolded behind the same trait
  • Configurable inject_policy (auto/always/never) with per-backend overrides; commercial backends (OpenAI, Azure, Gemini, Anthropic) left untouched so their native web_search continues to flow through unchanged
  • Bounded non-streaming tool-execution loop parses web_search tool calls, executes the provider, appends tool-role results, and re-invokes the backend up to max_tool_iterations rounds
  • New BackendTypeConfig::is_self_hosted / is_commercial helpers covered by unit tests enforcing the commercial/self-hosted partition invariant
  • API keys redacted in Debug output and never logged; hot-reload friendly WebSearchConfig with ${ENV} substitution
  • Prometheus counters for tool calls, injections, and iteration-cap hits under src/metrics/web_search
  • Layered quantization and format suffix normalization for model metadata lookup (#549)
  • New layered_format_strip() in src/models/pattern_matching.rs iteratively peels allowlisted quantization/format/flavor tokens from the right side of a model ID, retrying exact-id/alias/date-suffix matches after each peel
  • Token categories: BIT_WIDTH, GGUF_QUANT, FP_FORMAT, INT_FORMAT, LIBRARY, IMATRIX, UNSLOTH, CONTAINER, FLAVOR (all case-insensitive)
  • Parameter-count suffixes preserved: -Nbit stripped as quantization; -Nb, -aNb, -eNb, -0.6b kept as parameter counts
  • Canonical base IDs ending in allowlisted flavors (e.g. gemma-3-12b-qat) win via exact-id match before peel runs
  • Normalization pipeline wired into find_matching_config, BackendConfig::get_model_metadata, RouterConfig::get_model_metadata, RouterConfig::get_thinking_pattern_config, resolve_model_tier (routing), and get_model_profile (admin)
  • Model metadata for GLM 5.1, Qwen 3.6, and MiniMax M2.7 (#548)
  • Teams release notification posted to Microsoft Teams via Power Automate webhook after build and Docker jobs

Changed

  • Migrate documentation toolchain from MkDocs + Material for MkDocs to Zensical — reads mkdocs.yml natively and bundles required extensions

Fixed

  • Security: Cap layered peel phase with MAX_MODEL_ID_LEN=256 and MAX_PEEL_ITERATIONS=8 to eliminate DoS via pathological model IDs (previously O(n²) allocation on inputs like -4bit-4bit-4bit-...)
  • Security: Enforce 256-char model field length at /v1/chat/completions, /v1/completions, /v1/embeddings, and /v1/embeddings/sparse (parity with existing /v1/responses check)
  • Consolidate 7-phase metadata matching pipeline into a single implementation (find_matching_config_slice) with thin adapters at each call site, eliminating drift between BackendConfig, Config::get_model_metadata, Config::get_thinking_pattern_config, and find_matching_config
  • Replace cfg.to_ascii_lowercase() == peel with str::eq_ignore_ascii_case on the hot path (~4000 fewer per-request String allocations)
  • Pin Pygments <2.20 to fix MkDocs build failure (superseded by Zensical migration)

CI

  • Bump softprops/action-gh-release from 2 to 3 (#544)
  • Bump actions/github-script from 8 to 9 (#545)
  • Bump actions/upload-pages-artifact from 4 to 5 (#554)

Documentation

  • Document suffix-order ambiguity (-qat-4bit vs -4bit-qat) and internal peel phase bounds in docs/en/configuration/advanced.md
  • Add pattern_matching.rs to Model Aggregation Service module listing in docs/en/architecture.md with cross-reference to suffix normalization section
  • New docs/en/web-search.md feature documentation; config.yaml.example extended with web_search section

v1.5.0 - 2026-04-11

Added

  • Smart routing system with model tier & capability profile registry (#525, #531)
  • Rule-based request classifier & smart routing policy engine (#526, #532)
  • Load-aware dynamic tier adjustment (#527, #533)
  • LLM-based request classifier with hybrid mode (#528, #534)
  • Smart routing observability, admin API & documentation (#529, #535)
  • Codex-compatible Responses API extensions (#536, #537)

Changed

  • Upgrade core dependencies — axum 0.8, sha2 0.11, rand 0.10 (#523)
  • Add Gemma 4 model family metadata (#538)

Fixed

  • Complete smart routing integration gaps
  • Increase DefaultTransformer PDF size limit from 20MB to 32MB (#542)

CI

  • Bump actions/deploy-pages from 4 to 5 (#521)

Dependencies

  • Bump the minor-and-patch dependency group with 4 updates (#539)

Documentation

  • Add Codex-compatible Responses API gap analysis report

v1.4.5 - 2026-03-27

Fixed

  • Return 400 error when file references are used without file service configured (#519)

Changed

  • Add GLM-5-Turbo model metadata (#516)

Documentation

  • Fix Korean anti-AI-slop violations in ko/ documentation
  • Fix slop word and transition word in api.md

v1.4.4 - 2026-03-18

Fixed

  • Fix Anthropic thinking failing for high/xhigh reasoning effort — budget_tokens (32768) exceeded default max_tokens (16384), causing API rejection (#514)
  • Auto-adjust max_tokens to budget_tokens + 4096 when thinking is enabled and budget exceeds max

Changed

  • Add GPT-5.4 model family: gpt-5.4, gpt-5.4-pro, gpt-5.4-mini, gpt-5.4-nano with 1M context window (#515)
  • Update Gemini 3 series: add gemini-3.1-pro-preview, gemini-3-flash-preview, gemini-3.1-flash-lite-preview; mark gemini-3-pro-preview as deprecated
  • Recognize Gemini 3 Flash and 3.1 Flash-Lite as thinking models for include_thoughts auto-injection
  • Update Claude 4.6 models: context window to 1M (GA), fix Sonnet 4.6 max_output to 64K, correct knowledge cutoffs
  • Update config examples and documentation with latest model names across 8 files

v1.4.3 - 2026-03-18

Fixed

  • Fix Gemini thinking models (2.5 Pro, 3 Pro, etc.) not returning reasoning_content in streaming responses through the router (#513)
  • Replaced transform_payload_for_gemini() with transform_request_gemini() across all three Gemini streaming paths to ensure include_thoughts: true auto-injection

v1.4.2 - 2026-03-17

Changed

  • Change mid-stream fallback default to enabled for improved streaming reliability (#504)
  • Breaking: Mid-stream fallback is now enabled by default; set mid_stream_fallback.enabled: false to restore previous behavior

Documentation

  • Add failover latency tuning guide for optimizing fallback behavior

v1.4.1 - 2026-03-17

Added

  • Mid-stream fallback for streaming inference (#497) — when a backend fails mid-stream during SSE streaming, the router transparently retries with a fallback backend

Changed

  • Decouple pre-stream fallback from mid-stream fallback (#500) — each can now be independently enabled/disabled
  • Bump dependency versions to latest major releases

Fixed

  • Fix streaming config changes not detected in hot reload system (#503)
  • Fix mid-stream connection errors leaking to client during fallback (#502)
  • Remove unused config crate dependency

CI

  • Bump dorny/paths-filter from 3 to 4 (#493)
  • Bump actions/create-github-app-token from 2 to 3 (#494)

v1.4.0 - 2026-03-14

Added

  • Prefix-aware routing: PrefixAwareHash selection strategy with Consistent Hash with Bounded Loads (CHWBL) (#455, #457, #461)
  • Response caching: SHA256-based cache key computation with streaming response buffering and post-completion caching (#456, #459, #462)
  • Multi-tier CacheStore: in-memory backend (#466), Redis/Valkey backend with connection pooling (#467), and S3-backed tiered L1/L2 cache (#483)
  • KV cache index: shared data structure (#470), KV event consumer for vLLM backend streams (#471), prefix overlap scoring integrated into backend selection (#473), configuration/metrics/admin endpoints (#474)
  • Tiered KV cache with storage-tier awareness (GPU hot / external warm) (#484)
  • Disaggregated prefill/decode orchestration with external KV tensor transfer (#485)
  • Anthropic cache_control breakpoint auto-injection (#460)
  • Multimodal embedding support for Gemini Embedding 2 (#492)
  • Shared cache configuration and operational metrics (#468)
  • 30 new models added to model-metadata.yaml (#472)

Changed

  • Rename VAST-specific identifiers to generic S3/external storage names (#490) — update configuration files if using VAST-specific field names

Fixed

  • Make RequestExecutor transport-aware for Unix socket paths with spaces (#488)
  • Replace relative source tree links with GitHub URLs in docs

CI

  • Bump docker/setup-qemu-action from 3 to 4 (#428)
  • Bump docker/metadata-action from 5 to 6 (#426)
  • Bump docker/setup-buildx-action from 3 to 4 (#429)
  • Bump docker/build-push-action from 6 to 7 (#430)
  • Bump docker/login-action from 3 to 4 (#427)

Documentation

  • Comprehensive KV cache feature documentation, benchmarks, and config examples (#477)
  • VAST Data connection guide and integration examples (#486)
  • Sync Korean documentation with English counterparts
  • Split monolithic configuration.md into 6 smaller files

v1.3.0 - 2026-03-12

Added

  • Agent Communication Protocol (ACP) support with JSON-RPC 2.0 protocol layer and stdio transport (#414, #420)
  • ACP session management with protocol lifecycle, initialize/shutdown handshake (#415, #421)
  • ACP-to-LLM inference pipeline with streaming support (#416, #422)
  • ACP tool call reporting and permission delegation (#417, #423)
  • MCP-over-ACP bridge for MCP server tunneling (#418, #424)
  • ACP agent registry with metadata and configuration support (#419, #425)
  • ACP integration tests for protocol lifecycle and session management

Fixed

  • Resolve clippy field_reassign_with_default warnings in ACP integration tests

CI

  • Bump actions/upload-artifact from 6 to 7 (#398)

Documentation

  • ACP architecture documentation with MkDocs integration
  • ACP practical usage guide with IDE integration examples
  • KV cache integration plan for router-level caching strategies

v1.2.1 - 2026-03-07

Added

  • MLxcel backend type support for MLX-based model serving (#412, #413) — fully API-compatible with llama-server, reusing the same backend implementation for health checks, model discovery, and proxying

v1.2.0 - 2026-03-06

Added

  • Admin Statistics API with comprehensive request-level statistics collection and reporting (#409)
  • Endpoints: GET /admin/stats, GET /admin/stats/models, GET /admin/stats/backends, POST /admin/stats/reset
  • Time-windowed queries, token usage tracking, latency percentiles (p50, p95, p99)
  • Statistics persistence with configurable snapshot path, interval, and staleness checks (#410, #411)
  • Atomic writes, restore on startup, final snapshot on graceful shutdown

Documentation

  • Add admin stats and persistence to configuration guide
  • Add post-refactoring benchmark report for v1.1.0 (#407)

v1.1.1 - 2026-03-04

Added

  • Embeddable library crate (Phase 1) — use continuum-router as a Rust dependency (#394)
  • Type-safe config builders for programmatic library usage (#400)
  • Cargo feature flags for optional library dependencies (#399)
  • Persistent storage for runtime API keys (#405)
  • New LLM model metadata entries (#403)

Fixed

  • Fix Gemini-specific transforms incorrectly applied in Anthropic handler (#404)

v1.1.0 - 2026-03-01

Added

  • Embedded WebUI for configuration management and API key administration (#388)
  • Windows AF_UNIX socket support via socket2 crate (#390)
  • Nano Banana 2 (Gemini Image Generation) support

Fixed

  • Resolve compilation error in ClientAddr::is_unix for tuple variant matching
  • Resolve Windows AF_UNIX socket accept failure and config validation
  • Accept Windows absolute paths in Unix socket config validation (#393)
  • Resolve Windows compilation errors in Unix socket tests and transport parsing (#392)

v1.0.0 - 2026-02-19

Added

  • Continuum Router federation — router-to-router chaining as a new backend type (#385)
  • LM Studio as a dedicated backend type (#381)
  • Anthropic adaptive thinking effort parameter (output_config.effort) (#384)
  • Adaptive thinking and auto reasoning effort level across backends (#378)
  • Cohere/Jina-compatible rerank and sparse embedding endpoints (#374)
  • BGE-M3 and multilingual embedding model support (#373)
  • Claude Opus 4.6 model metadata
  • Qwen3-Coder-Next, Qwen3-VL-30B/8B model metadata

Changed

  • Handle SIGTERM for graceful shutdown on Unix systems (#370)
  • Reduce per-backend filter and model metadata log verbosity during model refresh (#371, #375)

CI

  • Replace Ubuntu 24.10 with 25.10 in deb build matrix (#376)

v0.36.1 - 2026-01-30

Fixed

  • Trigger immediate health check after sync_backends during hot reload (#368) — new backends now available within 1-2 seconds instead of up to 30 seconds
  • Sync healthcheckinfo and use URL-based updates during hot reload (#369) — new backends properly receive API key authentication
  • Accelerate health checks for recently added backends — 1-second check interval for 5 minutes after addition
  • Trigger model cache refresh when backends transition to healthy state with 5-second debounce

v0.36.0 - 2026-01-27

Added

  • Native Anthropic Messages API handler with endpoint routing (#355)
  • Anthropic to OpenAI request/response transformation (#356, #357)
  • Anthropic streaming response format (#358)
  • Direct Anthropic to Gemini request/response transformation (#359)
  • File_id source type and file resolution for Anthropic input (#360)
  • Claude Code compatibility for Anthropic handler (#365)
  • Tiered token counting for all backend types
  • Parallel file reference resolution for improved performance
  • Anthropic-version header format validation

Fixed

  • Require HTTPS for image and document URLs to prevent SSRF
  • Return generic error messages to clients instead of backend details
  • Use authenticated user_id from API key for file ownership checks
  • Use UUID v4 for secure message/tool ID generation
  • Place tool messages before user text in Anthropic-to-OpenAI conversion
  • Override stopreason to tooluse when tool_use blocks are present
  • Apply maxcompletiontokens conversion for OpenAI-routed Anthropic requests
  • Propagate file access denied and not found errors to client
  • Call current_config() once per request for consistent behavior

Refactored

  • Extract common SSE event type and data extraction logic
  • Add parse_bytes method to SseParser for proper UTF-8 handling
  • Remove unnecessary Arc wrapper in AnthropicFileResolver
  • Box FileResolutionResult::Resolved to reduce enum size

v0.35.0 - 2026-01-23

Added

  • Gemini 3 thoughtSignature support in function calling (#354)
  • PDF support for OpenAI and Anthropic file transformers (#340)
  • Text/plain support for AnthropicFileTransformer (#342)

Fixed

  • Add PDF support to DefaultTransformer and file resolution (#343)
  • Add tool message transformation to non-streaming Anthropic requests (#344)
  • Reject non-image files in DefaultTransformer with clear error message (#338)
  • Fix AI SDK incompatibility with Responses API streaming format (#335)

v0.34.0 - 2026-01-16

Added

  • Automatic quality parameter conversion between DALL-E and GPT Image models (#330)

Changed

  • Native Anthropic conversion for Responses API PDF file uploads (#332)

Fixed

  • Gemini streaming toolcalls compatibility fixes (#333) — missing index field, toolchoice format preservation, unnecessary transformation removal

v0.33.0 - 2026-01-13

Added

  • /v1/embeddings endpoint for embedding API support (#319)
  • Resolve local file_id references in Responses API requests (#326)
  • user_data and evals purpose values for Files API (#322)

Fixed

  • Use flat tool format for Responses API function tools (#324)
  • Improve Unix socket test stability for parallel execution (#328)

v0.32.0 - 2026-01-09

Added

  • Reasoning effort documentation and improved xhigh fallback logging (#317)

Fixed

  • Support implicit message type inference in Responses API InputItem (#316)

Refactored

  • Optimize InputItem deserializer and add invalid role test

v0.31.5 - 2026-01-09

Added

  • Responses API pass-through support for native OpenAI backends (#313) — smart routing based on backend type with direct forwarding to /v1/responses endpoint
  • OpenAI Responses API file input types (#311) — support for input_text, input_file, input_image content parts with SSRF validation

Fixed

  • Forward raw backend error responses in pass-through mode
  • Address security and performance issues in Responses API pass-through

v0.31.4 - 2026-01-07

Fixed

  • Use current_config() for hot reload support in proxy handlers (#310) — API key and configuration changes via hot reload now properly apply to new requests

v0.31.3 - 2026-01-06

Fixed

  • Add Anthropic transformations to Unix socket transport (#308) — Unix socket transport now applies the same request/response transformations as HTTP transport
  • Preserve stream parameter for non-streaming Anthropic requests (#306)

v0.31.2 - 2026-01-05

Added

  • Non-streaming support for Anthropic backend requests
  • Tool call and tool result transformation for Anthropic backend — enables multi-turn tool use conversations

v0.31.1 - 2026-01-04

Fixed

  • Non-streaming Anthropic requests failing with wrong authentication header (#301) — now correctly uses x-api-key header instead of Authorization: Bearer

v0.31.0 - 2026-01-04

Added

  • Unix socket server binding alongside TCP (#298) — supports unix: URI scheme, socket_mode configuration, auto-cleanup
  • Reasoning parameter support for Responses API (#296) with nested format and low/medium/high/xhigh effort levels
  • xhigh reasoning effort support for GPT-5.2 thinking models with auto-downgrade for unsupported models
  • Configurable health check endpoints per backend type (#293) — custom endpoint, fallback endpoints, method, body, accept_status, and headers

Changed

  • Comprehensive reasoning parameter normalization across backends (#294)

v0.30.0 - 2026-01-01

Added

  • Wildcard patterns and date suffix handling in model aliases (#286) — automatic date suffix normalization, * pattern matching (prefix, suffix, infix), zero-config date handling

Fixed

  • Apply default URL for Anthropic backend when not specified (#288)
  • Replace owned_by placeholders with backend-type-specific values (#287)

Documentation

  • Translate wildcard pattern and date suffix handling documentation to Korean (#289)

v0.29.0 - 2026-01-01

Added

  • Accelerated health checks during backend warmup (#282) — 1s interval on HTTP 503, configurable via warmup_check_interval and max_warmup_duration
  • --model-metadata CLI option for specifying model metadata file path at runtime (#281)

Fixed

  • Replace OpenAI owned_by placeholder with 'openai' (#280)
  • Prevent race condition in Admin API concurrent backend creation (#278)
  • Fix missing processing steps in hot reload (#277)
  • Cloud backends now show available: true in /v1/models/{model_id} (#272)

v0.28.0 - 2025-12-31

Added

  • SSE streaming support for tool calls (#258)
  • llama.cpp tool calling auto-detection via /props endpoint (#263)
  • Extended /v1/models/{model_id} endpoint with rich metadata fields (#262)
  • Tool result message transformation for multi-turn conversations (#265)
  • Backend-specific owned_by placeholders for llamacpp, vllm, ollama, http (#267)

Changed

  • Improved --help output formatting with title header and project attribution (#269)

Fixed

  • Sync model metadata cache with ConfigManager (#270)

v0.27.0 - 2025-12-29

Added

  • Complete Unix socket support for model discovery and SSE streaming (#248, #252, #253, #254, #256)
  • SSE/streaming for Unix socket backends
  • Backend type auto-detection for Unix sockets
  • vLLM and llama.cpp model discovery via Unix sockets
  • Tool call transformation across all backends (#244, #245, #246) — tool definitions, tool_choice, and tool call responses for Anthropic, Gemini, and llama.cpp

v0.26.0 - 2025-12-27

Added

  • GET /v1/models/{model} endpoint for single model retrieval with real-time availability status (#236)

v0.25.0 - 2025-12-26

Added

  • CORS (Cross-Origin Resource Sharing) support (#234) — configurable origins, wildcard patterns, custom schemes (e.g., tauri://localhost), preflight cache
  • Unix Domain Socket backend support (#232) — unix:///path/to/socket scheme, lower latency than localhost TCP

v0.24.0 - 2025-12-26

Added

  • llama.cpp backend support for local LLM inference (#230)
  • Allow router to start without any backends configured (#226)

Changed

  • Enable hot reload for backend additions/removals from config (#229)

v0.23.1 - 2025-12-25

CI

  • Add Windows x86_64 build target to release workflow (#224)

v0.23.0 - 2025-12-23

Added

  • GLM 4.7 model support with thinking capabilities (#222)
  • GCP Service Account authentication support for Gemini (#208)
  • Distributed tracing with correlation ID propagation (#207) — W3C Trace Context with traceparent header
  • Thinking pattern metadata for models with implicit start tags (#218)
  • Model metadata for NVIDIA Nemotron 3 Nano, Qwen Image Layered, and Kakao Kanana-2 (#202)
  • ASCII diagram to image replacement system for MkDocs (#200)

Fixed

  • Prevent cache stampede with singleflight, stale-while-revalidate, and background refresh (#220)
  • Apply global_prompts changes via hot reload (#219)
  • Invalidate model cache when backend config changes (#206)

CI

  • Skip Rust tests in CI when only non-code files change (#204)
  • Bump actions/github-script from 7 to 8 (#210)
  • Bump apple-actions/import-codesign-certs from 3 to 6 (#212)
  • Bump actions/cache from 4 to 5 (#211)
  • Bump actions/checkout from 4 to 6 (#209)

v0.22.0 - 2025-12-19

Added

  • Docker support with pre-built binary images — Debian (~50MB) and Alpine (~10MB) with multi-arch support (#198)
  • Container health check CLI (--health-check) for orchestration (#198)
  • Docker Compose quick start configuration
  • Automated Docker image publishing to ghcr.io in release workflow
  • MkDocs documentation website with Material theme (#183)
  • Korean documentation translation (i18n) — complete localization of all 20 documentation files (#190)
  • Security policy with vulnerability reporting process (#191)
  • Dependency security auditing with cargo-deny and Dependabot (#192)

Changed

  • Integrate orphaned architecture documentation into MkDocs site (#186)
  • Rename documentation files to lowercase kebab-case for URL-friendly filenames

Fixed

  • Fix health check response validation logic bug (operator precedence)
  • Fix address parsing fallback silently hiding configuration errors
  • Fix IPv6 address formatting in health check

v0.21.0 - 2025-12-19

Added

  • Gemini 3 Flash Preview model support (#168)
  • Default authentication mode for API endpoints (#173) — permissive (default) or blocking mode
  • Backend error passthrough for 4xx responses (#177) — parse and forward original error messages from OpenAI, Anthropic, and Gemini

Fixed

  • Handle UTF-8 multi-byte character corruption in streaming responses (#179)
  • Strip response_format parameter for GPT Image models (#176)
  • Allow auto-discovery for all backends except Anthropic (#172)
  • Always return b64_json field for Gemini image generation responses (#181)

v0.20.0 - 2025-12-18

Added

  • Image variations support for Gemini (nano-banana) models (#165)
  • Image edit support for Gemini (nano-banana) models (#164)
  • Enhanced /v1/images/generations with streaming and GPT Image features (#161)
  • gpt-image-1.5 model support (#159)
  • /v1/images/variations endpoint (#155)
  • /v1/images/edits endpoint for image editing and inpainting (#156)
  • External Markdown file support for system prompts with REST API management (#146)
  • Automatic model discovery for backends without explicit model list (#142)
  • Solar Open 100B model

Security

  • API key redaction to prevent credential exposure in logs and error messages (#150)

Changed

  • Optimized release binary size from 20MB to 6MB (70% reduction) (#144)

Refactored

  • Split large files to keep each under 500 lines (#147, #148)

v0.19.0 - 2025-12-13

Added

  • Runtime Configuration Management API (#139)
  • Configuration query, modification, save/restore, and backend management APIs
  • Sensitive information masking, JSON Schema generation, configuration history with rollback (up to 50 entries)
  • Comprehensive Admin REST API reference documentation
  • 33 integration tests for configuration API endpoints

Security

  • Input validation with 1MB content limit and 32-level nesting depth
  • Audit logging for sensitive data exports with 30+ sensitive field patterns

v0.18.0 - 2025-12-13

Added

  • Per-API-key rate limiting (#137)
  • API key management and configuration system
  • Files API authentication and authorization (#131)
  • Hot reload for runtime configuration updates (#130)

Fixed

  • Add ConnectInfo extension for admin/metrics/files endpoints
  • Address security vulnerabilities in API key management

Refactored

  • Extract CLI and app utilities into modular structure (#132)
  • Split converter.rs into modular structure (#132)
  • Split large source files into modular components

v0.17.0 - 2025-12-12

Added

  • Anthropic backend file content transformation (#126)
  • Gemini backend file content transformation (#127)

Fixed

  • Streaming file uploads to prevent memory exhaustion (#128)

v0.16.0 - 2025-12-12

Added

  • OpenAI-compatible Files API endpoints (#111)
  • File resolution middleware for chat completions (#120)
  • OpenAI backend file handling strategy (#121, #122)
  • Persistent metadata storage for Files API (#125)
  • GPT-5.2 model support (#124)
  • Circuit breaker pattern for automatic backend failover
  • Admin endpoint authentication and audit logging
  • Configurable fallback models for unavailable model scenarios with cross-provider support

Fixed

  • Sanitize fallback error headers and metric labels
  • Use index-based lookup for fallback chain traversal
  • Reduce lock contention in FallbackService with snapshot pattern

v0.15.0 - 2025-12-05

Added

  • Nano Banana (Gemini Image Generation) API support (#102)
  • Split /v1/models endpoint — standard lightweight vs extended metadata response (#101)

Changed

  • Optimize LRU cache to use read lock for cache lookups (#105)

Fixed

  • Replace .expect() panics with proper error propagation in HttpClientFactory (#104)

Refactored

  • Extract streaming handler logic to dedicated StreamService (#106)
  • Eliminate retry logic code duplication in proxy.rs (#103)

v0.14.2 - 2025-12-05

Added

  • Log token usage (input/output tokens) on request completion (#92)

v0.14.1 - 2025-12-05

Fixed

  • Optimize Anthropic backend TTFT with connection pooling and HTTP/2 (#90)
  • Optimize Gemini backend TTFT with connection pooling and HTTP/2 (#88)
  • Apply base name fallback matching to aliases in model metadata lookup (#84)

v0.14.0 - 2025-12-04

Added

  • Router-wide global system prompt injection (#82)

CI

  • Replace deprecated actions-rs/toolchain with dtolnay/rust-toolchain
  • Add RUSTFLAGS for macOS ARM64 ring build
  • Switch to rustls-tls for musl cross-compilation support

v0.13.0 - 2025-12-04

Added

  • OpenAI /v1/responses API support with session management (#49)
  • True SSE streaming for /v1/responses API
  • Background cleanup task for expired sessions
  • Override /v1/models response fields via model-metadata.yaml (#75)

Security

  • SecretString for API key storage across all backends (#76)
  • Session access control and input validation for Responses API

Changed

  • Immediate mode for SseParser for reduced first-response latency

Refactored

  • String allocation optimizations and error handling standardization

v0.12.0 - 2025-12-04

Fixed

  • Handle exact hash matches in consistent hash binary search (#72)
  • Replace panics with Option returns and implement stats aggregation (#71)
  • Remove hardcoded auth requirement from /v1/models endpoint

Refactored

  • Reorganize OpenAI model metadata by family (#74)
  • Extract AnthropicStreamTransformer to dedicated module (#73)
  • Split backends mod.rs into separate modules (#69)
  • Extract embedded tests to separate files (#68)
  • Create HttpClientFactory for centralized HTTP client creation (#67)
  • Create UrlValidator module with SSRF prevention (#66)
  • Extract RequestExecutor to shared common module (#65)
  • Extract HeaderBuilder with auth strategies (#64)
  • Extract AtomicStatistics to shared common module

v0.11.0 - 2025-12-03

Added

  • Native Anthropic Claude API backend with extended thinking support
  • OpenAI to Claude reasoning parameter conversion
  • Flat reasoning_effort parameter for Anthropic
  • Claude 4, 4.1, 4.5 model metadata

Fixed

  • Improve health check and model fetching for Anthropic/Gemini backends
  • Accept-Encoding fixes for streaming — use identity header and disable compression

v0.10.0 - 2025-12-03

Added

  • Native Google Gemini API backend support
  • OpenAI Images API support for image generation
  • Authenticated health checks for OpenAI and API-key backends
  • Built-in OpenAI model metadata for /v1/models response
  • API key authentication for streaming requests
  • Configurable image generation timeout
  • Response_format validation for image generation API

Fixed

  • Convert maxtokens to maxcompletion_tokens for newer OpenAI models
  • Correct URL construction for all API endpoints
  • Request body size limits to prevent DoS attacks

Security

  • Remove sensitive data from debug logs

Refactored

  • Unify request retry logic with RequestType enum

v0.9.0 - 2025-12-02

Added

  • Enhanced rate limiting with token bucket algorithm
  • Comprehensive Prometheus metrics and monitoring (#10)

Security

  • Prevent IP spoofing via X-Forwarded-For manipulation
  • Prevent header injection vulnerabilities
  • Eliminate race condition in token refill
  • Protect API keys with SHA-256 hashing
  • Prevent memory exhaustion via unbounded bucket growth
  • Comprehensive authentication for metrics endpoint
  • Cardinality limits and label sanitization to prevent metric explosion DoS

Fixed

  • Implement singleton pattern for metrics to prevent memory leaks
  • Improve error handling to prevent panic conditions
  • Resolve environment variable race condition in config test
  • Fix integration test failures in metrics

v0.8.0 - 2025-09-09

Added

  • Model ID alias support for metadata sharing (#27)

Fixed

  • Return empty list instead of 503 when all backends are unhealthy (#28)

v0.7.1 - 2025-09-08

Fixed

  • Improve config path validation for home directory and executable paths (#26)

v0.7.0 - 2025-09-07

Added

  • Rich metadata support for /v1/models endpoint (#23, #25)
  • Enhanced configuration management (#9, #22)
  • Advanced load balancing strategies (Weighted, Least-Latency, Consistent-Hash) with enhanced error handling (#21)

Fixed

  • Use streaming timeout configuration from config.yaml instead of hardcoded 25s limit

v0.6.0 - 2025-09-03

Fixed

  • Use timeout configuration from config.yaml instead of hardcoded values (#19)

Documentation

  • Comprehensive timeout configuration and model documentation updates

v0.5.0 - 2025-09-02

Added

  • Optional retry configuration with sensible defaults
  • Comprehensive integration tests and performance optimizations
  • Complete service layer implementation
  • Middleware architecture and enhanced backend abstraction

Fixed

  • Handle streaming requests without model field gracefully
  • Resolve floating-point precision and timing issues in tests
  • Resolve test failures and deadlocks in object pool and SSE parser
  • Resolve initial health check race condition

Refactored

  • Split oversized modules into layered architecture
  • Extract complex types into type aliases for better readability

v0.4.0 - 2025-08-25

Added

  • Model-based routing with health monitoring

Fixed

  • Improve health check integration and SSE parsing

v0.3.0 - 2025-08-25

Added

  • SSE streaming support for real-time chat completions (#5)
  • Model aggregation from multiple endpoints (#4)

v0.2.0 - 2025-08-25

Added

  • Multiple backends support with round-robin load balancing (#1)

v0.1.0 - 2025-08-24

Added

  • Initial release with OpenAI-compatible endpoints and proxy functionality