Skip to content

Changelog

All notable changes to Continuum Router are documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

v1.10.2 - 2026-06-15

Added

  • Kimi K2.7-Code and GLM-5.2 model metadata (#775). Kimi K2.7-Code is a 1T-parameter MoE (32B active) with a MoonViT vision encoder and 256K context that runs only in thinking mode and returns reasoning in the native reasoning_content field, so it carries no <think> marker config. GLM-5.2 is the GLM-5 line coding flagship with a 1M context window, 131K max output, and two thinking-effort levels (High and Max); it carries the standard <think>/</think> marker config like the rest of the GLM family. GLM-5.2 standalone API pricing was not published at launch, so its input/output rates are estimated from the GLM-5/GLM-5.1 tier.

Fixed

  • Normalize the vLLM/OpenAI-compatible reasoning field to the canonical reasoning_content on /v1/chat/completions (#776, closes #774). Newer vLLM renamed its reasoning output field from reasoning_content to reasoning (streaming delta.reasoning, non-streaming message.reasoning), and the router relayed it unchanged, so clients reading reasoning_content silently dropped all reasoning text from self-hosted vLLM reasoning models. The rename runs across every OpenAI-compatible relay path (the streaming default and thinking transformers, the unix-socket relay, the mid-stream fallback relay, and the non-streaming proxy body), only when reasoning_content is absent so an upstream already using the canonical name is never overwritten, and is scoped away from the Gemini and Anthropic handlers and the Responses API.

Documentation

  • Note in the reasoning-effort architecture reference (English and Korean) that the router normalizes the upstream vLLM reasoning field to reasoning_content.

v1.10.1 - 2026-06-15

Added

  • Per-API-key and per-user usage statistics REST API: four admin endpoints GET /admin/stats/api-keys, GET /admin/stats/api-keys/{id}, GET /admin/stats/users, and GET /admin/stats/users/{user_id}, mirroring the existing /admin/stats/models shape under the admin auth router (#772, closes #770). The StatsCollector gains per_api_key and per_user dimensions recorded from the same spawned task that updates the Prometheus llm_tokens_total counter, so the in-memory dimensions no longer depend on the metrics feature and are attributed even for failed and zero-token requests. The api_key_id is the derived, non-reversible id resolved once off the request hot path; unauthenticated requests bucket under "anonymous", and ids beyond the 1000-per-dimension cardinality cap fold into an "unknown" overflow bucket so usage is still counted in aggregate. Each GET /admin/stats/.../{id} returns 404 when the id has no recorded usage, and the optional window query param is echoed back but does not filter, since the aggregates are all-time atomic counters like /admin/stats/models. The snapshot/persist format adds both dimensions as #[serde(default)], so snapshots written before this change still load without a format-version bump.

Changed

  • Ignore root-level .sh scripts via .gitignore so local helper scripts are not accidentally committed.

Documentation

  • Document the admin API key management endpoints and the per-API-key/per-user usage statistics in the Admin REST API reference for both English and Korean (#773, closes #770). A new "API Key Management APIs" section covers the eight /admin/api-keys endpoints (create, list, get, update, delete, rotate, enable, disable) with request/response schemas reflecting the sk-***abcd masking and the full-value-once-on-create behavior, the ApiKeyConfig fields, the permissive vs blocking api_keys.mode semantics, and runtime-key persistence via persistence_file with hot-reload. The Statistics section gains the four new stats endpoints, the window echo, the anonymous and unknown buckets, the per-dimension cardinality cap, and the api_key_id-to-issued-key linkage.

v1.10.0 - 2026-06-12

Added

  • Dynamic model enumeration for Codex (ChatGPT OAuth) backends (#753, closes #752). The Codex backend has no standard /v1/models; the router instead queries the plan-gated GET <base>/models?client_version=<ver> endpoint with the loaded OAuth token during model discovery, keeps only user-facing entries (visibility: list and an available_in_plans that is empty or matches the account's chatgpt_plan_type claim), and uses the result to populate /v1/models and routing. On any failure (network error, non-2xx, empty list) the configured models: list remains the fallback, and the behavior applies only to Codex OAuth backends.

Changed

  • Extract a shared parse_responses_body in src/services/responses/sse.rs and a shared build_openai_chat_request_core in src/http/streaming/handler.rs, eliminating the duplicated /responses SSE-aggregation parse path and the duplicated streaming request-construction logic that existed across the Codex, passthrough, and Anthropic handler variants (#766, closes #762). No behavior change; the consolidation removes drift risk where a fix to one copy would not propagate to the others.

Fixed

  • Route Codex (ChatGPT OAuth) chat completions to the backend's /responses endpoint (#755, closes #754). build_responses_url appended /v1/responses to the .../backend-api/codex root, and the upstream edge rejects that path with a 403 HTML page that the router surfaced as an opaque authentication error despite a valid OAuth token; the Codex root now routes through the shared URL composer that strips the internal /v1. With the path fixed, Codex rejects a bare-string input with HTTP 400 "Input must be a list", so sanitize_codex_responses_request coerces a single-message string input into a one-element item list.
  • Accept an RFC3339 expires_at in the OAuth token store (#756, closes #751). The field was a plain u64, so a token store that wrote the field as a datetime string failed to deserialize, the backend silently lost its OAuth strategy, and requests went out unauthenticated, surfacing as an opaque 403 that hid the real cause. expires_at now deserializes leniently (integer or float epoch seconds, a numeric string, or an RFC3339 datetime with offset, all normalized to epoch seconds) and rejects anything else with an error naming the accepted formats; the canonical on-disk shape remains a JSON number, so existing stores and the save/load roundtrip are unchanged.
  • Strip local-engine-only fields (chat_template_kwargs, thinking_budget_tokens, enable_thinking, preserve_thinking, top_k, min_p, repeat_penalty) from /v1/chat/completions requests bound for cloud OpenAI, which rejects unknown top-level keys with HTTP 400 "Unknown parameter" (#760, closes #758). The strip mirrors the existing cloud Gemini behavior through a shared field_filter module gated on api.openai.com, runs at every cloud-OpenAI chat send site (non-streaming, streaming, and per-hop in the fallback loops), never touches extra_body or reasoning_effort, and is a no-op for local OpenAI-compatible engines, preserving the backend-passthrough contract.
  • Respect the configured models: selection for Codex dynamic enumeration and stamp the clean owned_by on enumerated models (#765, closes #763). When live enumeration succeeded, the operator-configured models: allowlist was bypassed, so /v1/models exposed every enumerated Codex model; the post-filter now applies uniformly, with a non-empty list intersecting the enumerated set down to the selected subset and an empty list exposing the full set. Enumerated Codex entries also carried the raw backend name as owned_by; the owner is now resolved from the backend type (openai), matching every other backend.
  • Return a valid non-streaming response for Codex (ChatGPT OAuth) backends on /v1/chat/completions with stream:false (#764, closes #761). Codex's /responses endpoint only accepts stream:true + store:false; sending stream:false produced HTTP 400 "Stream must be set to true". sanitize_codex_responses_request now forces stream:true and store:false unconditionally. A second fix makes SSE detection in PassthroughService::execute_request robust to Codex responses that carry an SSE body without Content-Type: text/event-stream: a looks_like_sse sniffer checks the first non-empty line for data:/event:/: prefixes, and a failed JSON parse retries SSE aggregation once before surfacing the original parse error.
  • Reconstruct the assistant message from response.output_text.delta events for Codex store:false responses (#768, closes #767). When Codex runs with store:false the terminal response.completed event carries an empty output array and the assistant text arrives only as incremental response.output_text.delta events; the SSE aggregation now accumulates those deltas and synthesizes a message item when the completed event carries no assistant text, so non-streaming clients (including the title/summary utility path) receive the text instead of an empty response.

Documentation

  • Document the post-v1.9.1 Codex and transport changes across the English and Korean manuals: the Codex Chat-to-Responses request handling and lenient expires_at parsing in backends.md, the cloud-OpenAI field strip in backend-passthrough.md, and item_reference resolution on the Responses API in api.md, and convert the remaining Korean pages to the polite -습니다 register so every page reads in one voice.

v1.9.1 - 2026-06-11

Added

  • GET /version endpoint that returns { "version": "<CARGO_PKG_VERSION>" }, registered unconditionally on the base router (not behind the appproxy or any other Cargo feature) so it is present in the standard release binary, and a version field on the existing GET /health response (#750, closes #749). Both endpoints stay outside the API-auth boundary, matching /health, so a downstream consumer can probe the running router version for feature-gating instead of failing open against a 404.

v1.9.0 - 2026-06-10

Added

  • AppProxy worker mode behind an opt-in appproxy Cargo feature, letting Continuum Router run as a Backend.AI AppProxy inference worker driven by an AppProxy coordinator (epic #709). The feature is deliberately left out of full, so default builds are unaffected.
  • Foundation: the typed AppProxyWorkerConfig section (coordinator URL, shared api_secret/jwt_secret, redis_url, wildcard frontend parameters, heartbeat/reconcile durations, and an events toggle, with both bearer secrets redacted in Debug and ${ENV_VAR} references resolved through the same path as backends[].api_key), the SerializableCircuit/RouteInfo wire types and ProxyProtocol/AppMode/FrontendMode enums (snake_case and kebab-case tolerant, unknown fields ignored), and the module scaffold (#716).
  • Coordinator REST client CoordinatorClient with register, heartbeat, deregister, list_circuits, and get_circuit, each carrying X-BackendAI-Token and a fresh per-call X-BackendAI-RequestID, and an error type that separates retryable (connection, timeout) from fatal (HTTP 4xx) failures (#717).
  • Circuit-to-backend translation and reconcile: circuit_to_backends builds one BackendConfig per replica (named appproxy-<circuit_id>-r<route_key>, traffic-ratio mapped to a 1..=1000 weight that never drops a route to 0, vLLM detection from runtime_variant), and apply_circuits injects the translated backends through the existing hot-reload config_sender, namespaced by the appproxy- prefix so statically configured and admin-API backends are preserved (#718).
  • Worker lifecycle service and a /status endpoint: run_worker registers with backoff, performs an initial circuit pull, then runs a heartbeat loop (kept under the coordinator's 30s LOST timeout) and a pull-reconcile loop (the always-on backstop for missed events), auto-discovering each circuit's model from a replica's GET /v1/models and deregistering on shutdown. The shared in-memory AppProxyRegistry is indexed by subdomain and circuit id (#719).
  • Host/subdomain ingress resolver that turns a manager-issued endpoint subdomain into a concrete circuit and pins the request model so the existing selection path serves that circuit's replicas, with HS256 circuit-bearer verification that checks the decoded id against the circuit id and rejects an alg:none downgrade, an optional aggregation_hosts field for cross-circuit model aggregation, and a pure fall-through when no wildcard_domain is configured (#720).
  • Redis Pub/Sub circuit-event overlay that gives legacy-mode coordinators sub-second circuit updates: a subscriber loop with exponential-backoff reconnect, a base64 + msgpack envelope codec, handlers for circuit_created / circuit_removed / circuit_route_updated, and the ack envelope that prevents the coordinator's E10001 "Proxy worker not responding" error (#721).
  • Claude Fable 5 (claude-fable-5) and Mythos 5 (claude-mythos-5) model support (#747). Both are 1M-context, 128K-max-output models priced at \(10/\)50 per MTok; Mythos 5 is the same underlying model with safety classifiers lifted, shipped only through the limited Project Glasswing release. A new is_mythos_class helper routes both ids through the Anthropic capability gates: adaptive thinking is required (legacy budget_tokens is rejected with HTTP 400 and normalized to adaptive), temperature/top_p/top_k are dropped, the max effort level is supported (xhigh maps to max), and mid-conversation system messages are preserved. Both reject an explicit thinking.type == "disabled", so explicit_thinking_for_model now returns an Option and the router omits the thinking parameter entirely instead of forwarding a value that would 400. opus_supports_max_effort is renamed to supports_max_effort because the max effort level is no longer Opus-only. The same handling applies to the OpenAI Responses API conversion path.
  • Gemma 4 QAT model metadata for the five quantization-aware-training checkpoints (E2B, E4B, 12B Unified, 26B-A4B MoE, 31B dense) with load-bearing -it-qat aliases and resolution/drift-guard tests (#723, closes #722).
  • Gemma 4 12B Unified model metadata (#705).

Changed

  • Log request-body extractor rejections at warn, so a malformed or oversized body that Axum rejects before the handler runs is visible in the logs instead of failing silently (#707).

Fixed

  • Stop the retry loop from hammering the same upstream on HTTP 429 by distinguishing transient rate limits from non-transient quota/credit exhaustion and by honoring the upstream Retry-After hint (#742, closes #740). Previously every 429 was retried up to max_attempts (default 3) with a fixed exponential backoff that ignored the provider's Retry-After, so for a model served by a single upstream the router re-hit the same exhausted endpoint, amplifying load and adding latency before an inevitable failure. The RouterError::RateLimited variant now carries a retryable flag: non-transient 429s (OpenAI insufficient_quota / billing_hard_limit_reached, and clear credit-depletion language such as "prepayment credits are depleted") are classified as non-retryable so the router fails fast after a single call and passes the provider status and body through, while transient signals (bare RESOURCE_EXHAUSTED/RPM throttling, rate_limit_exceeded, rate_limit_error) stay retryable. The classifier is deliberately narrow: the over-broad "billing" and "exceeded your current quota" markers that Google reuses verbatim for transient throttling were removed so a recoverable Google 429 is no longer flipped to fail-fast. A retried 429 uses the upstream Retry-After for its backoff (capped to max_delay) and fails fast without sleeping when the requested interval would exceed the remaining total-timeout budget. The hint is preserved end to end (parsed from Google's RetryInfo.retryDelay and the integer-seconds Retry-After header, then reflected in the client-facing Retry-After header). The budget probe now uses saturating_add and the parsed hint is clamped to 24 hours (MAX_RETRY_AFTER_SECS), closing a remotely triggerable panic where a hostile upstream's near-u64::MAX Retry-After overflowed the Duration addition and aborted the request task.
  • Persist the accumulated /v1/responses response on the streaming conversion paths (Anthropic, Chat-Completions/Gemini fallback) so a follow-up request that references a streamed output item via {"type":"item_reference","id":"item_..."} (the default behavior of the OpenAI and Vercel AI SDKs) resolves instead of returning HTTP 400, even when step 1 used store:true (#746, closes #745). The completed response is stored before the first response.completed event reaches the client; error-terminated streams that never emit response.completed are not stored, matching the non-streaming error paths, and passthrough streaming is unchanged because the upstream owns storage there.
  • Resolve item_reference input items before strategy dispatch so /v1/responses no longer returns HTTP 400 for Anthropic/Claude backends on a multi-step tool round-trip that submits an item_reference instead of an inline item (#743, refs #741). References are rewritten to inline FunctionCall/Message/FunctionCallOutput items (de-duplicated by call_id, first-wins), build_context_for_user reconstructs stored function-call output items as proper tool_use/tool_result pairs, the OpenAI/Azure passthrough path still forwards references unchanged, an unresolvable reference returns a descriptive 400 naming the id, and a 256-item cap (MAX_ITEM_REFERENCES) bounds the per-request session-store scan.
  • Propagate server.workers to the Tokio runtime (#736, refs #734). The value was documented and shipped in config.yaml.example but had no effect, because main used an argument-less #[tokio::main] and the runtime always ran with num_cpus::get() worker threads. main is now synchronous: it peeks server.workers from the config file, builds a correctly sized multi-thread runtime through the existing RuntimeConfig::build_runtime path, and runs the async body on it, falling back to the CPU count when the value is unset or 0.
  • Emit conformant function_call output items and argument events on /v1/responses streaming for non-passthrough providers (Anthropic and chat-completions-backed routes), preserving text output payloads and tracking interleaved parallel tool-call arguments by upstream index (#725).
  • Reuse the shared ApiKeyStore in the Files API routes instead of constructing a separate store, so a runtime-managed API key is recognized consistently across the Files API and the rest of the router (#706).
  • AppProxy: preserve sibling circuits on single-circuit events (#737, closes #731). A worker serving two or more circuits previously wiped every unaffected sibling's appproxy-* backends on any single-circuit event (leaving them 404/502 until the next pull-reconcile, up to 15s), because RegistryEntry carried no route info and the rebuilt set held only the delta circuit. The full circuit is now cached on each RegistryEntry and unchanged siblings are rebuilt from it, so only the delta circuit's backends change.
  • AppProxy: reach the fallback chain from wildcard subdomain ingress (#738, closes #735). A registered circuit whose replicas are all down is no longer a dead end; after per-circuit authorization it is pinned to its canonical model and handed to the normal pipeline, where FallbackService takes over (the "deployment went down, traffic goes to a cross-provider model" behavior operators expect). The fall-through is scoped to registered-but-down circuits, a truly unknown subdomain stays a 404, and the open-to-public / bearer-token / IP-allow-list gates still run first.
  • AppProxy: preserve event-known models during periodic reconcile (#739). When the Redis event overlay learned a circuit's model before any successful pull probe, a reconcile could evict the registered-but-down registry entry and break scoped fallback; reconcile now reuses the shared registry's known model before probing replicas.
  • AppProxy: align WorkerRegisterResponse deserialization with the Backend.AI coordinator's actual response shape, which carries slots as an array plus available_slots as the count (#726).

Documentation

  • Rewrite the Zensical user documentation as a current-state manual: drop development-log narration, the roadmap, and "coming soon" entries; correct configuration-reference drift against the actual config structs (nonexistent sections and keys, retry field names, the admin auth shape, the environment-variable tables, and the config discovery order); document previously missing shipped behavior (seven CLI flags, the auth login subcommand, Windows AF_UNIX support and SSE over Unix sockets, and the Windows/musl and .deb release artifacts); and bring the Korean docs to parity with English.
  • Add the AppProxy worker mode design document (#708).
  • Condense the README "Recent Updates" list to one concise line per release.

Dependencies

  • Remove the validator derive dependency and its transitive proc-macro-error2, clearing the RUSTSEC-2026-0173 advisory that previously needed a temporary cargo-deny ignore while validator had no safe upgrade path (#733, #732).
  • Update Rust package versions (#732).

v1.8.2 - 2026-06-02

Fixed

  • Stop forwarding the client Accept-Encoding header on the /v1/responses path (#702). When a client sent Accept-Encoding: gzip, deflate, br, the responses-path header filter omitted accept-encoding from its block list and forwarded it to the upstream backend, which then negotiated gzip and returned compressed bytes. Because reqwest disables automatic decompression once any Accept-Encoding header is set manually (the explicit .header("Accept-Encoding", "identity") call only appended a second value rather than replacing the forwarded one), the SSE transform received raw gzip bytes, parsed them as text, and dropped the leading response.created, output_item.added, content_part.added, and output_text.delta events, leaving only tail fragments with empty item_id and text. "accept-encoding" is now in FILTERED_HEADERS for both the primary convert path (src/http/handlers/responses.rs) and the responses-native passthrough path (src/proxy/responses_only.rs), restoring parity with the chat-completions proxy (src/proxy/backend.rs) so the upstream only ever receives Accept-Encoding: identity.
  • Stop double-wrapping Responses SSE lines so the non-GPT /v1/responses streaming conversion emits single-layer OpenAI-compatible SSE records on the converted Anthropic and Chat-Completions paths instead of nested ones (#701).

Dependencies

  • Bump uuid 1.23.1 → 1.23.2, redis 1.2.1 → 1.2.2, socket2 0.6.3 → 0.6.4, and serial_test 3.4.0 → 3.5.0 (#699).

Tests

  • Hardening regression coverage for the production StreamService conversion processors, asserting single-layer SSE output on the converted Anthropic and Chat-Completions paths (#701).
  • Integration regression for the /v1/responses streaming path with an Anthropic backend and a gzip-requesting client, asserting the upstream request receives only Accept-Encoding: identity and the transformed Responses SSE stream retains the full event sequence with populated text and item ids (#702).

v1.8.1 - 2026-05-29

Added

  • Claude Opus 4.8 recognition with a claude_family_version parser that replaces the hardcoded opus-4-7/opus-4-6 substring gates (#693, part of #687). The four Anthropic capability predicates now compare a parsed (major, minor) version: uses_adaptive_thinking_api ≥ (4,6), model_requires_adaptive_thinking / model_forbids_sampling_params ≥ (4,7), and opus_supports_max_effort = Opus and ≥ (4,6). The parser treats the first integer token as the major and the next version-like token (1 to 2 digits, value < 100) as the minor, so an 8-digit date suffix like 20250514 yields minor 0 and is never mistaken for a version, and new minor releases are recognized without per-version edits. Adds the claude-opus-4-8 metadata entry (1M context, 128K output, \(5/\)25 pricing, adaptive thinking, Jan 2026 cutoff) and registers claude-opus-4-8 / claude-opus-4-8-latest in the built-in supported models and config samples. Behavior for 4.5/4.6/4.7 and Sonnet variants is preserved.
  • Anthropic fast mode behind a per-backend anthropic_fast_mode opt-in (default off) (#694, part of #687). is_fast_mode_eligible returns true only for Opus 4.6/4.7/4.8 and later Opus minors; merge_beta_header comma-joins and de-duplicates beta tokens while preserving any client-supplied anthropic-beta. On the native /anthropic/v1/messages path, resolve_fast_mode_beta injects the merged fast-mode-2026-02-01 beta header only when the request is speed: "fast", the model is eligible, the backend is native Anthropic (never Bedrock), and the opt-in is enabled; when fast mode does not apply, speed is stripped from the outgoing body so it cannot trigger a spurious upstream 400. The OpenAI-compatible path forwards speed: "fast" and injects the beta header only for eligible, opted-in, native Anthropic targets. The Anthropic -> OpenAI and Anthropic -> Google fallback parameter mappings remove speed so a fast-mode request that falls back to a non-Anthropic backend does not leak the native-only field. usage.speed is preserved on the response.
  • Mid-conversation system messages for Claude Opus 4.8+ (#695, part of #687). A role:"system" entry inside the messages array (which earlier Claude families reject with HTTP 400) is now accepted, gated on a new supports_mid_conversation_system(model_id) predicate that reuses claude_family_version and matches family version ≥ (4,8). The native handler round-trips the entry unchanged to a native Anthropic backend; the cross-provider transforms map the System role onto the OpenAI system role and preserve it as user-role text for Gemini and Responses. The OpenAI-compatible transform emits mid-conversation system/developer messages (after the first user turn) as in-array role:"system" entries for supporting models, while leading system messages still fill the top-level system field. Non-supporting models (Opus 4.7 and below, all Sonnet/Haiku, Bedrock-prefixed ids, non-Claude ids) keep the historical flattening into the single top-level system.
  • Refusal stop_details and the refusal stop reason propagated through the full Anthropic response pipeline (#696, part of #687). map_anthropic_finish_reason maps "refusal" to "content_filter"; the non-streaming transform and the streaming message_delta handler attach the stop_details object to the choice when stop_reason is "refusal", and omit the key (rather than forwarding a null) when upstream sends an explicit null.

Fixed

  • Accept an input_image content part that references a Files API upload via file_id instead of an inline image_url on POST /v1/responses (#686, refs #681). Because the parent enums are #[serde(untagged)], a {"type":"input_image","file_id":"file-..."} part previously failed deserialization with a generic untagged-enum error and Axum returned HTTP 422. image_url is now optional with an added file_id, mirroring input_file; a shared resolve_local_file_to_data_url helper resolves a local file_id to an inline base64 image_url data URL through the same metadata, ownership, size, load, and base64 sequence, honoring ownership and the 10MB size limit. The OpenAI/Anthropic/Gemini converters handle the optional image_url, emitting the image when resolved and warning + skipping an unresolved file_id. validate_request walks message content and rejects an input_image with neither image_url nor file_id with a clear 400 before file resolution.
  • Harden the Claude Opus 4.8 routing gates so fast-mode speed is only forwarded when the transport has confirmed native-Anthropic opt-in and beta-header injection, non-Opus Claude families keep mid-conversation system messages flattened, and OpenAI-compatible Anthropic responses preserve usage.speed (#698, refs #687).

Documentation

  • Document Claude Opus 4.8 support in English and Korean (#697, closes #692, part of #687): add claude-opus-4-8-* to the adaptive-thinking model list and the sampling-params-deprecated warning in reasoning-effort.md; add the Claude Opus 4.8 model detail, an Anthropic Fast Mode section, a Mid-Conversation System Messages section, and the refusal stop_reason -> content_filter mapping to backends.md; and add a changelog entry covering model recognition, fast mode, mid-conversation system messages, and refusal stop_details.
  • Document input_image file_id support in api.md (English and Korean), noting that exactly one of image_url or file_id is required and that file_id is resolved to an inline base64 data URL before reaching the backend under the same ownership and 10MB size limit as the input_file path (#686, refs #681).

Tests

  • claude_family_version and supports_mid_conversation_system boundary tests (4.7 false, 4.8 true, Sonnet/Haiku/older false, Bedrock-prefixed and cross-region/ARN ids false), fast-mode eligibility, beta-header merge/dedup, speed/usage.speed (de)serialization, resolve_fast_mode_beta gating with client-beta merge, and speed non-leak versus preservation across fallback providers (#693, #694, #695).
  • Refusal coverage: non-streaming and streaming refusal with and without stop_details, the explicit-null omission path, and regression tests for end_turn/max_tokens/stop_sequence/tool_use (#696).
  • input_image file_id-only and image_url-only deserialization, the full failing request payload as a regression, FileResolver resolution honoring ownership and the size limit, converter output across all three backends, and the neither-field validation case (#686).

v1.8.0 - 2026-05-28

Added

  • Per-API-key backend access control via an optional allowed_backends allow-list on client API keys (#677, closes #674). When the list is non-empty, requests authenticated with that key may only route to the named backends; an empty or absent list keeps the existing unrestricted behavior. The field is integrated end-to-end: config file and hot-reload, the runtime ApiKey and AuthContext, the backend-selection chokepoint and the Responses / Anthropic selection paths, cross-provider fallback, the Admin REST API (create/update/get/list), runtime-key persistence, and the models-listing endpoints.
  • select_backend_with_retry and the Responses (StreamService), Anthropic-native, count_tokens, and image handlers filter candidates by the key's allow-list. When the model exists but the allow-list rejects every candidate, the request is rejected with a new RouterError::Forbidden variant mapped to 403 with error_type = "permission_error" (non-retryable), distinct from the 401 AuthError.
  • /v1/models, /v1/models/extended, and /anthropic/v1/models are filtered to models served by at least one allowed backend when a restricted key is authenticated; GET /v1/models/{model} returns 404 for a model the key cannot reach. Unauthenticated or unrestricted callers see the full list.
  • A new api_optional_auth_middleware is layered in permissive mode. It validates a presented bearer token on a best-effort basis and attaches AuthContext without ever rejecting, so per-key restrictions apply to authenticated callers while anonymous and invalid-token callers pass through unrestricted. The existing blocking-mode api_auth_middleware is unchanged; the two are never layered together.
  • Config validation warns (does not hard-fail) when a key's allowed_backends references an unknown backend name, so a backend rename does not brick the router before operators update the keys.
  • fallback.mid_stream_enabled config field (default true) so operators can keep cheap pre-stream backend re-selection on the initial connection while turning off the per-stream mid-stream buffering (#680, closes #676). Previously fallback.enabled was all-or-nothing: on meant both pre-stream and mid-stream fallback (with a per-stream StreamAccumulator buffering roughly 100 to 200 KB), off meant no fallback at all. Memory-constrained or high-concurrency hosts now have a middle ground.
  • The streaming dispatch becomes a three-way decision factored into a pure decide_streaming_fallback_dispatch helper. With fallback.enabled true and a chain configured, mid_stream_enabled = true keeps the buffering path (handle_streaming_with_mid_stream_fallback), mid_stream_enabled = false routes to the revived handle_streaming_with_pre_stream_fallback (no StreamAccumulator or MidStreamFallbackContext allocation; mid-stream failures surface as a normal stream error), and otherwise the standard no-fallback path runs.
  • The previously dead handle_streaming_with_pre_stream_fallback and advance_to_next_fallback are now live; the #[allow(dead_code)] markers are removed and the per-key allow-list is threaded through fallback re-selection so the newly-live path does not bypass per-API-key access control.

Changed

  • Removed the streaming-local copies of transform_payload_for_openai and its requires_max_completion_tokens helper from src/http/streaming/handler.rs; both call sites now resolve to the canonical implementations in crate::proxy::utils (#679, closes #660). The two copies were byte-for-byte identical, creating a drift risk where a change to one would not be mirrored to the other. The streaming-local duplicate unit tests are removed since the canonical tests in src/proxy/utils.rs already cover the same contract.

Fixed

  • Enforce the per-API-key backend allow-list in the default mid-stream fallback handler when it re-selects a backend after a mid-stream failure (#683). With fallback.mid_stream_enabled = true (the default), a restricted key whose fallback chain mapped to a disallowed backend was transparently switched to it on a mid-stream failure, an access-control bypass that the new mid_stream_enabled work in PR #680 had only closed on the pre-stream path. handle_streaming_with_mid_stream_fallback now takes an owned allowed_backends: Option<Vec<String>> (moved into its spawned streaming task), and a new resolve_allowed_backend_name_for_model helper wraps resolve_backend_name_for_model and applies the same allowed_backends.filter(|l| !l.is_empty()) + exact-name membership semantics used everywhere else in the module (try_get_healthy_backend_for_model, get_backend_for_model_streaming, get_healthy_backend_for_streaming). All three fallback re-selection sites resolve through it; a disallowed or unresolvable candidate returns None, so the existing warn + continue arm skips it and the chain index advances. When every remaining candidate is filtered out the loop terminates via the existing chain-exhausted paths and surfaces an error to the client. An empty or None allow-list is byte-for-byte the prior unrestricted behavior.
  • Resolve an allowed fallback backend before rebuilding the fallback payload in mid-stream fallback, so a disallowed chain entry can no longer mutate current_payload before being skipped (#684).
  • Anthropic-native x-api-key callers now supply the same allowed_backends policy as Authorization: Bearer-authenticated callers when no AuthContext is present, covering Messages, count_tokens, and models listing (#684). The previously documented limitation that the per-key allow-list was unenforced on the native Anthropic surface is now closed.
  • Close a HIGH-severity authorization-bypass on POST /v1/responses/compact found in the post-merge security audit of #677. The endpoint was the one client-facing model-routing handler that never read AuthContext from request extensions, so a key scoped to backend set A could reach a disallowed passthrough backend B (OpenAI / Azure) through compaction whenever B served the requested model. compact_response now mirrors create_response exactly: it derives the per-key allow-list via allow_list_from_auth, filters the model's candidate backends against it before any health-check or passthrough, and returns a deterministic 403 permission_error when the model exists but the key cannot route to any backend serving it. The filter precedes backend forwarding, so the rejection happens without contacting an upstream.

Documentation

  • Clarify in config.yaml.example, src/services/streaming/mid_stream_config.rs rustdoc, and docs/en/configuration/advanced.md that mid_stream_fallback.enabled does not disable mid-stream buffering or reduce memory: regardless of its value the StreamAccumulator is still constructed and still buffers up to roughly 100 KB per stream, and mid-stream fallback still activates on backend failure (#678, closes #675). The flag only selects continuation versus restart mode for the fallback request. The real kill-switch for the buffering and memory is fallback.enabled: false or omitting the model from fallback.fallback_chains. The Korean docs (docs/ko/configuration/advanced.md) had no corresponding Mid-Stream Fallback section so no Korean change is made for the clarification, but #680 separately added a translated "미드스트림 버퍼링 비활성화" subsection covering the new toggle.
  • Document the per-API-key allowed_backends allow-list in the security docs (English and Korean) including the now-closed x-api-key Anthropic Messages limitation (#677, #684).

Tests

  • Deterministic unit tests for resolve_allowed_backend_name_for_model: allow-list excludes the resolved backend returns None, includes it returns Some, empty list returns Some, None matches the unfiltered resolver, unresolvable model returns None (#683).
  • Integration tests in tests/per_key_backend_access_test.rs driving /v1/responses/compact behind blocking auth: a restricted key requesting a model served only by a disallowed backend gets 403 permission_error (the security case), and a restricted key requesting an allowed backend's model passes the filter (#677).
  • Unit test for the new decide_streaming_fallback_dispatch helper covering the three-way decision matrix between fallback.enabled, mid_stream_enabled, and chain presence (#680).

v1.7.1 - 2026-05-28

Fixed

  • Anthropic web_search_20250305 server-tool emulation no longer masks provider failures as empty result sets. A missing API key, HTTP 401/403/429, timeout, or parse error now surfaces to the client as a web_search_tool_result_error content block with a mapped error_code ("too_many_requests" for HTTP 429, "unavailable" for all other failures). A genuinely empty-but-successful search continues to emit an empty content: [] array, so the two outcomes remain distinguishable. Both non-streaming and streaming SSE code paths are covered, with an explicit SSE event-ordering assertion guarding the streaming error path. Serper response-shape drift (a body that parses cleanly but omits the organic key) is now detected and logged with a provider-tagged warn! that records only the observed top-level keys, never the user query text. (#671, #672, #673)

Dependencies

  • Bump lru 0.16 → 0.18, reqwest 0.13.3 → 0.13.4, and rusqlite 0.39 → 0.40 (libsqlite3-sys 0.37 → 0.38, unpinned now that CI runs Rust 1.95), plus a transitive lockfile refresh (aws-lc-rs 1.16.3 → 1.17.0, h2 0.4.13 → 0.4.14, http 1.4.0 → 1.4.1, and hyper/axum/reqwest knock-on revisions); cargo audit reports zero vulnerabilities across the dependency graph (#669, #670).

v1.7.0 - 2026-05-26

Added

  • AWS Bedrock Claude backend Phase 1 (bedrock-mantle) over a Bearer token (#616, closes #613)
  • New type: bedrock backend with serde aliases (aws-bedrock, bedrock-anthropic, AwsBedrock, ...). is_commercial() returns true and owned_by() returns Some("anthropic") so OpenAI-shaped clients see the expected model lineage.
  • endpoint_type: mantle (default) speaks the native Anthropic Messages API at the region-templated https://bedrock-mantle.{region}.api.aws, routes to /anthropic/v1/messages, uses Authorization: Bearer, and omits anthropic-version (Bedrock returns HTTP 400 if present). An explicit url: field overrides the template for proxies and tests; empty or uppercase regions are rejected at load time.
  • model_ids.rs recognizes plain (anthropic.<family>), geographic (us./eu./jp./au.), global (global.anthropic.<family>), and full-ARN identifiers, forwarded unchanged with no automatic alias mapping because the geo prefix carries real billing and residency consequences.
  • The existing OpenAI/Anthropic body transforms and the Anthropic SSE stream transformer are reused unchanged, so per-model quirks (Opus 4.7 sampling-param ban, adaptive thinking) apply identically to Bedrock without duplication. endpoint_type: runtime is reserved here and implemented in Phase 2 below.

  • Request-path rate limiting is now enforced, plus a Redis storage backend (#635, #632, closes #626)

  • state.rs previously dropped the MiddlewareLayer returned by initialize_rate_limiting, so the rate_limiting.* config was a silent no-op. The layer now flows through ServiceHandlesContinuumRouterBuilder::build_routerRouter::layer and attaches to the assembled Axum app, so configured budgets actually return 429.
  • All five rate_limiting dimensions are now optional: per_client, per_backend, and global join the already-optional per_api_key/per_model. Operators may omit any dimension and load with it disabled; the Default impl retains all three so existing deployments are unaffected (#632).
  • New redis-cache-gated rate_limit_v2::redis_backend runs a token-bucket and a sliding-window Lua script, each as a single atomic EVAL, reusing the shared create_redis_pool helper with keys like cr:rl:per_client:10.0.0.1. On any Redis failure (pool unavailable, timeout, Lua error) the backend reports BackendUnavailable and the caller falls through to the in-process token bucket, degrading to per-replica enforcement rather than dropping requests.

  • Cross-provider fallback is now wired into request dispatch and hot-reload (#631, #637, #665)

  • The completed src/core/fallback/ module (~4,900 lines, 36 unit tests) was never called from the request path, so configured fallback.fallback_chains were a silent no-op. FallbackService now runs for chat_completions (non-web-search), completions, embeddings, rerank, sparse_embeddings, and image generation, with an execute_with_optional_fallback wrapper (no overhead when fallback is unconfigured), X-Fallback-* response headers, and a From<RouterError> to TriggerReason mapping that satisfies the executor bound (#631).
  • fallback.fallback_chains and fallback.fallback_policy changes now apply at runtime through the hot-reload subscriber via FallbackService::update_config; toggling fallback.enabled remains restart-only and is documented as such in config.yaml.example (#637, #665).

  • POST /v1/models/refresh force-refresh endpoint for interactive desktop use (#593, #664)

  • Clears the ModelCache (all_models key) and synchronously re-aggregates from all configured backends before responding, returning the same {"object":"list","data":[...]} shape as GET /v1/models. Desktop clients (e.g. backend.ai-go "Refresh models" button) can use the response immediately without a second round-trip.
  • Rate-limited per verified API key, with anonymous or invalid-token callers sharing one global anonymous bucket: 3 requests per 5-second burst window, 12 per minute. Callers that exceed the limit receive 429 Too Many Requests. The limit is intentionally tighter than the regular list endpoint because each call triggers an upstream fetch from every configured backend.
  • Gated by a new model_aggregation.allow_force_refresh: bool config field (default true). Setting it to false makes the endpoint return 403 Forbidden, suitable for hardened deployments where clients must rely on TTL-based expiry.
  • Each refresh logs the verified API key ID when available, or anonymous otherwise, at INFO level for audit correlation.
  • config.yaml.example extended with desktop-embedded guidance: cache_ttl: 10, soft_ttl_ratio: 0.5, and allow_force_refresh: true for backends.ai-go style embedded proxy.
  • New force_refresh(state) helper on ModelAggregationService encapsulates the clear-then-aggregate flow. allow_force_refresh() accessor exposes the config flag to handlers.

  • AWS Bedrock Claude backend Phase 2: bedrock-runtime with SigV4 + AWS binary event-stream (#614)

  • New endpoint_type: runtime value on type: bedrock backends targets https://bedrock-runtime.{region}.amazonaws.com/model/{modelId}/invoke[-with-response-stream]. The router signs each request with AWS Signature V4 (service: "bedrock"), wraps the OpenAI → Anthropic body with "anthropic_version": "bedrock-2023-05-31", strips the top-level "model" field (Bedrock takes the model ID from the URL path), and percent-encodes the model identifier into the path so versioned foundation IDs (anthropic.claude-3-5-sonnet-20240620-v1:0) and full ARNs round-trip cleanly.
  • Streaming responses arrive in the AWS application/vnd.amazon.eventstream binary frame format. A new bedrock::event_stream::EventStreamDecoder reassembles frames that span multiple TCP reads, base64-decodes each chunk payload, and emits synthetic event: <type>\ndata: <json>\n\n SSE bytes for the existing AnthropicStreamTransformer to translate into OpenAI-shape SSE. Exception frames (ThrottlingException, ValidationException, ...) surface as synthetic event: error SSE chunks instead of being silently dropped.
  • AWS credentials resolve in this order: inline auth.aws.access_key_id + auth.aws.secret_access_key (+ optional session_token), then a named profile via auth.aws.profile, then the standard AWS chain (env vars, shared config, IMDS, IRSA / EKS pod identity, ECS task role). The resolver is fronted by aws_credential_types::provider::SharedCredentialsProvider so temporary credentials refresh transparently between requests.
  • New BackendAuthType::Sigv4 variant on BackendAuthConfig, plus an AwsAuthConfig sub-block under auth.aws. Both are wired through Debug redaction so static credentials never leak into logs. BackendAuthType accepts sigv4, aws_sigv4, and aws-sigv4 as YAML spellings.
  • Health check for runtime probes POST /model/{probe_model}/invoke with a single-token body; HTTP 2xx, 400, 401, 403, and 429 all count as healthy because they prove the AWS surface is reachable and the operator can address auth/billing issues separately.
  • All AWS SDK crates (aws-sigv4, aws-smithy-eventstream, aws-credential-types, aws-config) sit behind a new optional bedrock-sigv4 Cargo feature. Default builds do not pull them in; configuring endpoint_type: runtime without the feature returns a clear error pointing at the rebuild flag instead of failing at the AWS edge. The Phase 1 mantle path is unaffected and works with or without the feature.
  • Proxy header policy in src/proxy/backend.rs and src/http/handlers/anthropic/handler.rs splits on endpoint_type: Bedrock-mantle keeps Authorization: Bearer, Bedrock-runtime suppresses the static-Bearer injection because the Backend trait implementation signs each request with SigV4.
  • English and Korean docs in docs/{en,ko}/configuration/backends.md extended in place with the new endpoint_type: runtime configuration, build requirement, credential chain, IAM policy snippet, geo/global profile behaviour, and a streaming-pipeline overview.

  • EXAONE 4.0 (vLLM) registration with a request-gated hybrid-model thinking transform (#640, refs #639)

  • EXAONE 4.0 (e.g. EXAONE-4.0-32B-FP8-RNGD) is a hybrid reasoning model: in reasoning mode it streams chain-of-thought inline in content ended by a lone </think>; in non-reasoning mode it emits a plain answer with no </think>.
  • The assume_reasoning_first (unterminated_start) transform is now gated on the request actually enabling thinking (chat_template_kwargs.enable_thinking or top-level enable_thinking; conservative default false) at both the HTTP and Unix-socket streaming decision points, so non-reasoning mode no longer emits the whole answer as reasoning_content with empty content. Standard-pattern models keyed off a real <think> marker are unaffected.
  • Registers exaone-4.0-32b with the unterminated_start config; the served -RNGD name resolves via the hardware-suffix peel below.

  • NPU/accelerator hardware-variant suffix normalization in model-id matching (#662)

  • Adds a HARDWARE/ACCELERATOR category (rngd, warboy, atom, atommax, rebel) to is_recognized_format_token() so FuriosaAI (RNGD/WARBOY) and Rebellions (ATOM/ATOMMAX/REBEL) serving-target suffixes normalize to the canonical base metadata entry through the existing layered peel chain, without a per-model alias. layered_format_strip() lowercases before peeling, so runtime-emitted upper-case names resolve correctly.
  • Exact-id and exact-alias phases run before the peel, so any model legitimately registered as *-atom or *-rebel wins via exact match first; a grep of shipped model-metadata.yaml confirms zero current collisions. EXAONE-4.0-32B-FP8-RNGD now normalizes to exaone-4.0-32b by peel (-rngd then -fp8).

Changed

  • Narrow the fallback handler's LM Studio compatibility shim so only / and /v1/models return 200; all other unmatched routes now return 404. The JSON error body shape is preserved unchanged, so any consumer already reading the body continues to work (#628).

Fixed

  • Strip seven non-OpenAI top-level fields (chat_template_kwargs, thinking_budget_tokens, enable_thinking, preserve_thinking, top_k, min_p, repeat_penalty) before forwarding to Gemini's /v1beta/openai/chat/completions endpoint, which returns HTTP 400 INVALID_ARGUMENT for unknown keys; the extra_body escape hatch is untouched and reasoning_effort stays (Google maps it to thinking_level). Also extend the 3.5-flash thinking-disable and is-thinking matchers, since 3-flash did not substring-match 3.5-flash (#642).
  • Wire request-stats recording into every Anthropic handler code path (native HTTP/Unix, Bedrock mantle/runtime, OpenAI-compatible, Responses API) for both streaming and non-streaming, adding an AnthropicStreamUsageTracker that accumulates input/output tokens from raw passthrough SSE (#627, #634).
  • Use the configured timeouts.request.streaming.chunk_interval instead of a hardcoded 60s for mid-stream inactivity, and emit bounded keep-alives so a silent backend now advances to the next fallback model via StreamOutcome::Failed rather than emitting keep-alive comments forever (#633).
  • Accept partial model_overrides.<model>.streaming/standard blocks via new StreamingTimeoutOverride/StandardTimeoutOverride structs whose Option<String> fields merge over the base config, fixing YAML parse failures on the --generate-config output path (#630).
  • Gate admin/metrics/metrics-persistence/webui imports and functions behind their Cargo features, and add a #[cfg(not(feature = "metrics"))] no-op metrics stub mirroring the public surface used by always-compiled callers, so feature-reduced builds compile cleanly (#629, #666, closes #636).
  • Harden force-refresh rate limiting so anonymous and invalid-token callers share one global bucket, preventing spoofed Authorization/X-Forwarded-For/X-Real-IP headers from bypassing the budget.
  • Metrics history query limiting, UTF-8-safe metric label truncation, and Bedrock runtime routing through the typed SigV4 implementation (follow-up to #608, #609, #613, #614).

Documentation

  • Document the Bedrock backend (region selection, geographic vs global inference profiles, model ID format, credential chain, IAM policy snippet, and streaming pipeline) in docs/{en,ko}/configuration/backends.md; add a Force-Refresh Models section to docs/en/api.md; and extend config.yaml.example with desktop-embedded model-aggregation guidance and fallback hot-reload annotations.

Tests

  • Negative and positive case coverage for transform_payload_for_openai (#661).
  • Rate-limit middleware hot-reload tests with documented bucket-reset behaviour, plus router_wiring_tests that build a real ContinuumRouter and assert 429 fires when the burst is exhausted (#635, #638, #667).
  • Verify web_search injection interacts correctly with the passthrough contract (#663).
  • MLxcel streaming passthrough integration test (#659).
  • Bedrock unit and integration coverage: serde aliases, URL templating, header policy, model-ID parsing (geo/global/ARN), runtime SigV4, and event-stream frame decoding driven against a wiremock server (#616, #614).

Dependencies

  • Bump tokio 1.52.1 → 1.52.3, tower-http 0.6.8 → 0.6.11, dashmap 6.1.0 → 6.2.1, serde_json 1.0.149 → 1.0.150, aws-config 1.8.16 → 1.8.17, aws-sigv4 1.4.3 → 1.4.4, and aws-smithy-types 1.4.7 → 1.4.8 (#619, #658).

v1.6.3 - 2026-05-12

Added

  • Per-API-key LLM token usage metrics (#608, #610)
  • New Prometheus llm_tokens_total{api_key_id, model, backend, kind} counter that records actual prompt and completion token consumption per API key, model, backend, and token kind. The hot-path counter's label set is intentionally minimal — extra dimensions live on the companion info-metric below.
  • Companion api_key_info{api_key_id, ...} info-metric exposes a configurable allowlist of per-API-key annotation labels (e.g. email, team, environment) so dashboards can group/filter the token counter via standard PromQL * on(api_key_id) group_left(...) joins without bloating the hot-path counter's label set.
  • derive_api_key_id returns either the configured id (when the auth layer matched the request) or a SHA-256 first-12-hex prefix k_<hex> of the raw bearer token. The raw key is never used as a label. A dedicated ApiKeyCardinalityTracker (default cap: 1000 unique key IDs) prevents label-cardinality explosion.
  • ApiKeyConfig and the in-memory ApiKey gain an annotations: HashMap<String, String> field. MetricsConfig gains annotation_labels: Vec<String> — the allowlist that materializes as labels on api_key_info. Reserved canonical annotation keys are documented (email, uuid, owner, team, environment); operators may add custom keys.
  • Streaming and non-streaming paths record at the existing usage parse site through a new StreamObservabilityContext field on StreamTransformConfig, threaded through handle_anthropic_streaming / handle_gemini_streaming / handle_successful_backend_response so OpenAI-compat / Anthropic / Gemini / thinking-pattern streaming response builders all emit the counter without duplicate parsing. The router already injects stream_options.include_usage=true for OpenAI-compat backends, so streaming metrics work uniformly regardless of client opt-in.
  • api_key_info is initialized once at startup from metrics.annotation_labels; label names are frozen at registration (Prometheus does not allow renaming labels). Annotation values hot-reload through the existing config-watch path via ApiKeyStore::refresh_info_metric, called from load_from_config, add_key, and remove_key_by_id so admin operations stay in sync.
  • All label values flow through CardinalityManager / sanitize_label_value. Annotation values use a slightly less strict sanitize_annotation_value that preserves @, +, and : so emails and namespaced identifiers round-trip cleanly.
  • Persistent local metrics log backed by SQLite with configurable retention (#609, #611)
  • New MetricsStore async trait + bundled rusqlite v1 implementation under src/metrics/persistence/ with WAL mode, prepared-statement cache, and PRAGMA user_version schema versioning (store.rs / sqlite.rs / snapshot.rs / snapshot_task.rs). Histograms and summaries are fanned into row-per-sample form.
  • Counters and gauges are NEVER restored on startup — the persistent log is a separate read path so the live /metrics endpoint keeps Prometheus monotonic-counter semantics. Historical samples are read through a new GET /admin/metrics/history?metric=...&from=...&to=... surface (src/admin_metrics_history.rs) that returns 404 when persistence is disabled at runtime and 503 when the feature is not compiled in. No PromQL in v1.
  • Hot-reload pipeline in src/server/serve.rs translates config changes into PersistenceCommand::{SetSnapshotInterval, SetRetentionDays, SetCompaction} messages, atomically rebuilding the ticker and prune cutoff without dropping in-flight snapshots.
  • Compaction schedule honors a minute hour * * * cron subset to avoid pulling in a full cron crate for what is effectively a daily timer.
  • Defaults to enabled: true; switch off via metrics.persistence.enabled: false. The redb and duckdb variants are reserved keywords in the YAML schema and return NotImplemented at startup until they get implementations.
  • Disk usage: measured ~119 bytes/sample on a synthetic 100-series × 10-snapshot workload (see tests/metrics_persistence_test::disk_usage_smoke_check_under_synthetic_load). Documented formula in docs/en/persistent-metrics.md and config.yaml.example.

Fixed

  • Coerce token-usage label values to &str in with_label_values so the release build no longer fails type inference. Mixing &String label variables with a &str literal ("prompt" / "completion") made the compiler pick &[&String] and reject the literal — regression introduced in #610.

Documentation

  • Korean translations for the two metrics features (#612)
  • docs/ko/metrics.md: new ### API 키별 LLM 토큰 사용량 section covering llm_tokens_total, api_key_id derivation, annotation_labels allowlist, api_key_info info-metric, PromQL examples, Grafana panel, and verification steps.
  • docs/ko/persistent-metrics.md: new page translating docs/en/persistent-metrics.md (SQLite-backed snapshot semantics, configuration fields, disk-usage formula, /admin/metrics/history surface, schema layout, operational notes).
  • docs/ko/admin-api.md: insert ## 지속 메트릭 로그 API section between Stats and Response Cache, plus a matching TOC entry.
  • zensical.ko.toml: add 지속 메트릭 로그 nav entry under 운영 so the page is reachable from the Korean sidebar.
  • New docs/en/metrics.md ### Per-API-Key LLM Token Usage section covering metric definition, api_key_id derivation rules, annotation config schema, cardinality and hot-reload semantics, example PromQL (tokens-per-email, top-10 keys, per-team rollup), a Grafana panel example, and verification steps. config.yaml.example gains a documented metrics.annotation_labels block and an annotations: example under each API-key entry.

Tests

  • Per-API-key token-usage unit coverage: derive_api_key_id priority (configured id wins; otherwise hash; otherwise anonymous), determinism, hash format ^k_[0-9a-f]{12}$, annotation-label normalization, info-gauge one-time-init, refresh atomicity, cardinality bounds, email-preserving annotation sanitizer; streaming-transformer write-through; middleware annotation-snapshot exposure; integration coverage in tests/metrics_integration_test.rs (4-label counter with both kinds, anonymous fallback, hash regex). (#610)
  • Persistent-metrics SQLite store unit coverage (insert, query by time range, retention deletion, idempotent open, unknown-kind round-trip) and integration coverage in tests/metrics_persistence_test.rs (snapshot task lands rows in SQLite, retention prunes only old samples, retention hot-reload preserves in-flight snapshots, disk-usage smoke check). (#611)

v1.6.2 - 2026-05-10

Fixed

  • /v1/responses and /v1/chat/completions now accept the OpenAI reasoning-API developer role (#603, #605, #606)
  • Add MessageRole::Developer with a serde lowercase rename so "developer" deserializes as a first-class variant. The previous failure surfaced as a misleading did not match any variant of untagged enum ResponseInput rather than naming the unknown role; the implicit-message deserialization error now names the offending role string and lists the valid roles.
  • Per-backend translation: pass through as developer for OpenAI-compatible servers; merge into the Anthropic top-level system parameter (concatenated with \n\n when both system and developer text are present, fixing a pre-existing overwrite bug); merge into Gemini system_instruction; map to system for Ollama (older builds reject developer).
  • Chat Completions → Responses converter recognizes developer as instruction-bearing: the first occurrence becomes top-level instructions; subsequent occurrences remain as input items with their original role preserved on the wire.
  • Treat developer and system equivalently in cross-cutting string-based recognition sites: prefix-cache key extraction, cross-provider fallback translation, the OpenAI-to-Anthropic transform's system-content extraction, the global-prompt injector's existing-system-message lookup, and the smart-routing classifier / LLM prompt builder.

Documentation

  • Migrate the docs site from MkDocs to Zensical and restore brand styling (#602)
  • Remove mkdocs.yml and mkdocs.ko.yml in favor of native zensical.toml and zensical.ko.toml, both rooted under the [project] namespace per Zensical's TOML schema; per-extension options live inside [project.markdown_extensions] as a dict (Zensical's config loader ignores any separate mdx_configs table).
  • Replace docs/en/shared and docs/ko/shared symlinks with rsync -a --delete docs/shared/ docs/{en,ko}/shared/ invoked before each build, since Zensical does not follow symlinks for asset directories.
  • Register the lablup brand color via Zensical's documented primary = "custom" mechanism plus a [data-md-color-scheme="default"][data-md-color-primary="custom"] selector in docs/shared/stylesheets/extra.css that defines the orange CSS variables.
  • Mermaid is registered as a pymdownx.superfences custom fence rather than relying on the now-incompatible mermaid2 plugin; favicon falls back to logo.png when missing.
  • Restore Zensical render output for icons, diagrams, and brand color (#604)
  • Re-enable pymdownx.emoji with the zensical.extensions.emoji twemoji index/generator (replaces the removed materialx) so :material-*: icon syntax stops rendering as literal text.
  • Reimplement the <!-- diagram: PATH --> ... <!-- /diagram --> ASCII-replacement as a Python-Markdown extension (docs/hooks/diagram_extension.py); the prior MkDocs on_page_content hook does not run because Zensical exposes no MkDocs hook lifecycle. Add docs/__init__.py and prefix builds with PYTHONPATH=. so the extension is importable from Zensical's console-script entry point.
  • Set --md-primary-bg-color on the custom palette and override .md-header / .md-tabs so the orange brand band paints on top of Zensical's modern layout.
  • Move the nav table above the first [project.X] sub-table in both TOMLs so it stops being silently parsed under [[project.extra.social]] (alphabetical fallback was producing an unsorted top menu and wrong prev/next footer neighbors).

Tests

  • Regression coverage for system/developer concatenation in the Anthropic transform on both streaming and non-streaming paths, plus per-backend converter mapping for the developer role across all five backends and the Chat Completions → Responses converter's developer-then-system ordering (#605, #606).

Dependencies

  • Bump redis 1.2.0 → 1.2.1 (#598).

v1.6.1 - 2026-05-07

Fixed

  • Claude Opus 4.7 (claude-opus-4-7) now routes correctly through the Anthropic backend (#599, #600, #601)
  • Extended the adaptive thinking API gate (uses_adaptive_thinking_api) to include 4.7-series model IDs. Claude Opus 4.7 requires thinking.type == "adaptive" + output_config.effort; sending the legacy budget_tokens shape produces HTTP 400.
  • Added model_requires_adaptive_thinking and model_forbids_sampling_params predicates for 4.7-series request-shape rules: explicit manual thinking is normalized to adaptive thinking and temperature, top_p, and top_k are dropped unconditionally before forwarding.
  • Extended opus_supports_max_effort to include Opus 4.7 so xhigh reasoning effort maps to output_config.effort = "max" on Opus 4.7.
  • Added claude-opus-4-7 and claude-opus-4-7-latest to the built-in supported-models list and to model-metadata.yaml; the speculative claude-sonnet-4-7 entry is intentionally not advertised until Anthropic publishes it (defensive request-shape matching is retained for user-supplied configurations).

Documentation

  • Update reasoning-effort docs (EN + KO) and backends.md to cover the Claude 4.7 family adaptive-thinking requirement and unconditional sampling-parameter deprecation (#600).

Tests

  • Responses API regression coverage for Opus 4.7 adaptive thinking and unconditional sampling-parameter stripping; both transform paths (Chat Completions and Responses) for the 4.7 family with negative regression on Opus 4.6 / Sonnet 4.6 / Haiku 4.5 / Haiku 3.5 (#600, #601).

v1.6.0 - 2026-05-04

Added

  • ChatGPT subscription / Codex backend authentication via OAuth device flow (#551, #592)
  • continuum-router auth login --backend <name> runs the OpenAI Codex three-step headless device-code flow: POST /api/accounts/deviceauth/usercode to mint a one-time user_code, POST /api/accounts/deviceauth/token polling, and a PKCE exchange at /oauth/token. Standards-compliant RFC 8628 device flow remains available for any future provider that implements it; the new OpenAICodexDeviceFlowClient is selected automatically for provider: openai.
  • Tokens are wrapped in SecretString, written to the configured token_store with mode 0600 on Unix using an O_CREAT|O_EXCL open + atomic rename; a random tempfile suffix prevents concurrent saves from colliding, and a partial write is unlinked on failure so secret material does not linger on disk.
  • Access-token expiry is parsed from the JWT exp claim (with a 1-hour fallback for non-JWT tokens) and clamped to a useful minimum so a degenerate expires_in from the provider cannot trigger a refresh storm.
  • Proactive refresh fires 60 s before expiry, single-flighted with a tokio::sync::Mutex. A 401 from the upstream backend triggers exactly one forced refresh and a single retry; the previous refresh token is preserved race-free when the provider omits refresh_token from a refresh response.
  • The strategy reports an identity_fingerprint() (backend name, client_id, token_store) so that hot-reload rebuilds the strategy when any of those rotate, instead of silently keeping the prior in-memory state.
  • The CLI strips C0/C1 control characters from verification_uri_complete and user_code before printing, so a hostile provider response cannot inject ANSI escapes that rewrite the terminal.
  • Every device-flow and runtime request to auth.openai.com / chatgpt.com/backend-api/codex carries originator: codex_cli_rs (configurable via auth.oauth.originator) and a codex_cli_rs/<version> User-Agent (configurable via auth.oauth.user_agent), matching the official Codex CLI so Cloudflare admits the traffic instead of returning a 403 JS challenge.
  • auth.type: oauth is accepted in YAML alongside the legacy o_auth snake_case rendering. client_id and scope default to the public Codex CLI values; only token_store is required for the ChatGPT-subscription case.
  • Anthropic Messages and Chat Completions surfaces both transparently route to the ChatGPT Codex backend (#592)
  • Any backend whose auth.type is oauth and whose provider uses the Codex flow (currently openai) is forced through the Responses API for every request, regardless of per-model responses_only metadata. chatgpt.com/backend-api/codex exposes /responses only — no /chat/completions — so chat-shaped models (e.g. gpt-5.5, alias-mapped claude-haiku-4-5) and unknown model IDs all dispatch through /v1/responses…/backend-api/codex/responses. Non-OAuth OpenAI backends continue to honor the per-model responses_only flag.
  • New core::url_utils::compose_backend_url centralizes backend URL composition for the three OpenAI-compatible roots (/v1, /openai, /backend-api/codex). Replaces ad-hoc ends_with("/v1") || ends_with("/openai") checks across proxy/backend.rs, http/handlers/responses.rs, http/streaming/handler.rs, services/responses/stream_service.rs, and the Anthropic handler so the /backend-api/codex rule applies uniformly.
  • The proxy hot path (proxy/backend.rs, proxy/responses_only.rs, proxy/image_gen.rs, proxy/image_edit.rs) now flows through a backend-name-keyed AuthStrategyRegistry exposed on AppState via src/proxy/oauth_helper.rs. The helper looks up the strategy, calls refresh_if_needed() before sending, replaces the static-bearer header with one derived from the strategy, and force-refreshes + retries once on a 401. Static api_key auth continues to work unchanged when no strategy is registered.
  • The Anthropic-compatible handler (src/http/handlers/anthropic/handler.rs) consults the same registry. Client-supplied Authorization: sk-ant-… and x-api-key headers are dropped when the backend has an OAuth strategy, instead of being forwarded to OpenAI as the bearer.
  • Model fetcher detects OAuth-authed backends and falls back to the configured models list rather than probing /v1/models, since chatgpt.com/backend-api/codex does not expose a models endpoint.
  • Codex-compatible Responses API extensions (#536, #537)
  • POST /v1/responses/compact endpoint for context compaction — passthrough to OpenAI / Azure OpenAI native /v1/responses/compact; other backend types return 501.
  • store field on ResponsesRequest (defaults to true) controls upstream session persistence; Codex sends store: false for ephemeral requests.
  • output_text content part type alongside input_text so converters can differentiate assistant vs. user content in input items. All converters (OpenAI, Anthropic, Gemini) handle the new variant.

Documentation

  • Sync Codex / Responses-API extensions across the root CHANGELOG.md and the Korean docs (ko/configuration/backends.md, ko/configuration/advanced.md, ko/api.md, ko/architecture.md); resolve all zensical build warnings on both EN and KO builds and preserve unicode in toc anchor slugs via pymdownx.slugs.slugify (#596).
  • Clean up AI-slop patterns across English and Korean mkdocs sources — replace em dashes in prose, remove filler/slop words, rewrite trailing participial clauses and inflated verbs, collapse colon+bullet AI-style intros, and replace closing summary slop with concrete next-action links (#597).

CI/CD

  • Bump apple-actions/import-codesign-certs from 6 to 7 (#590).

Dependencies

  • Bump tokio 1.51.0 → 1.52.1, axum 0.8.8 → 0.8.9, reqwest 0.13.2 → 0.13.3, clap 4.6.0 → 4.6.1, fastrand 2.4.0 → 2.4.1, uuid 1.23.0 → 1.23.1, rand 0.10.0 → 0.10.1, and lru 0.16.3 → 0.16.4 (#595).

v1.5.6 - 2026-04-29

Fixed

  • /v1/chat/completions returned HTTP 502 responses_parse_failed for responses_only reasoning models (gpt-5.4-pro, gpt-5.5-pro). OpenAI's /v1/responses payload for these models contains output items shaped like { "id": "rs_...", "type": "reasoning", "summary": [] }, but OutputItem::Reasoning required content and status, so serde rejected the payload with missing field 'content'. The Anthropic Messages surface bypassed the strict variant on a different conversion path, masking the bug until directly tested. content and status are now optional on OutputItem::Reasoning; reasoning items are dropped before reaching Chat Completions clients (per existing project policy), so body shape is irrelevant beyond successful deserialization. (#594)

Changed

  • Realign gemini-3.1-pro-preview as the canonical metadata id for the Gemini 3.1 Pro family in model-metadata.yaml, with gemini-3.1-pro (and existing -latest / -customtools forms) demoted to aliases. Matches what generativelanguage.googleapis.com actually serves today — the canonical gemini-3.1-pro form returns 404 from upstream — and avoids implying GA availability that does not exist yet. The metadata cache still resolves both forms to the same entry. Note: alias-to-canonical rewriting on the upstream-bound payload is out of scope for this release; clients calling with the gemini-3.1-pro alias will still hit upstream 404 until that work lands. (#594)
  • Sample config.yaml registers the newly-available pro / 5.5 family models so the responses_only dispatch path can be exercised end-to-end against real upstreams (gpt-5.4-pro, gpt-5.2-pro, gpt-5.5, gpt-5.5-pro, claude-opus-4-7, gemini-3.1-pro, gemini-3.1-pro-preview); duplicate claude-haiku-4-5 entry removed.

v1.5.5 - 2026-04-27

Added

  • Transparent Responses-API routing for OpenAI Pro models (epic #581)
  • New responses_only: true capability flag in model-metadata.yaml and the built-in OpenAI registry marks gpt-5.2-pro, gpt-5.4-pro, and gpt-5.5-pro as served only on /v1/responses upstream (#574, #582)
  • /v1/chat/completions requests for responses_only models are dispatched to the upstream /v1/responses endpoint and translated back into a strict-mode chat.completion (or chat.completion.chunk for streaming) envelope, transparent to the client. Stream usage is gated by stream_options.include_usage, and per-model latency / success counters are recorded for the responses_only path (#578, #584)
  • /anthropic/v1/messages requests for responses_only models are converted to the Responses API shape, dispatched to /v1/responses, and translated back into Anthropic Messages JSON (or the Anthropic SSE event sequence for streaming) — tool-call round-trips, web-search emulation, and Unix-socket transports all branch on the flag (#575, #577, #583, #585, #586)
  • Anthropic Messages <-> Responses request transformer covers system → instructions, tools, tool_choice (including disable_parallel_tool_useparallel_tool_calls: false), max_tokensmax_output_tokens, reasoning effort derivation, and multi-turn tool round-trips; the response transformer preserves thinking/text/tool_use ordering and stop-reason fidelity (#575, #583)
  • SSE streaming bridge (AnthropicResponsesStreamTranslator) maps Responses API events to Anthropic Messages events while preserving Anthropic's strict event-ordering invariants (single message_start, paired content_block_start/content_block_stop, terminal message_stop); handles mid-stream error / response.failed / response.cancelled, response.incompletestop_reason: max_tokens, deferred input tokens, and graceful early-close synthesis (#576, #585)
  • Only OpenAI and Azure OpenAI backends serve /v1/responses; pairing a responses_only model with another backend type produces a 400 invalid_request_error before any upstream call (rejection fires on both /v1/chat/completions and /anthropic/v1/messages surfaces) (#577, #589)
  • The first dispatch per (backend, model) pair logs at info level so operators can confirm Responses-API routing without enabling debug logs
  • Anthropic Messages → Responses requests explicitly send store: false to avoid upstream side-effects (#589)
  • 22 deterministic, in-process integration tests covering the {Anthropic, Chat} × {gpt-5.4-pro, gpt-5.2-pro} × {non-streaming, streaming} × {plain, tool-call, reasoning} matrix, mid-stream backend-failure negatives on both surfaces, and an upstream byte-fragmentation regression guard (#579, #588)
  • Documented in docs/en/configuration/advanced.md (Responses-API-only Models section split into Models-marked-out-of-the-box, Marking-a-new-model, Dispatch-behavior, and Backend-type-constraint subsections), docs/en/architecture.md (Responses-API Routing data-flow diagram), and the docs/en/api.md Chat Completions and Anthropic Messages surface notes with a Transparent-Responses-API-routing subsection (#580, #587)

Fixed

  • Chat Completions responses-only routing now rejects incompatible-only backend configs before upstream dispatch and chooses a compatible OpenAI/Azure Responses backend when available (#589)
  • Chat assistant tool_calls[] are preserved as Responses function_call input items for stateless tool-result turns over /v1/chat/completions (#589)

v1.5.4 - 2026-04-25

Changed

  • Refresh model-metadata.yaml for late-April 2026 frontier model releases (#572, #573)
  • Add GPT-5.5 ($5/$30 per 1M, 1M context, knowledge cutoff 2025-12, omnimodal, leads Terminal-Bench 2.0 at 82.7%) and GPT-5.5 Pro ($30/$180 per 1M, Responses API only, deep reasoning) — released 2026-04-23
  • Add DeepSeek V4 Pro (1.6T total / 49B active MoE, 1M context, 384K max output, three reasoning effort modes) and DeepSeek V4 Flash (284B total / 13B active MoE, 1M context, 384K max output) with deepseek-chat and deepseek-reasoner retained as deprecated aliases per official API docs — released 2026-04-24
  • Add gpt-image-2 (token-billed instead of per-image: text $5/$30, image $8/$30 per 1M tokens; 1K/2K/4K resolution tiers; ~99% text accuracy in any language; built-in reasoning before generation; context-aware multi-turn editing; gpt-image-2-latest alias) — released 2026-04-21
  • Add Claude Opus 4.7 ($5/$25 per 1M, 1M context, 128K max output, knowledge cutoff 2026-01, high-resolution image support up to 2576px / 3.75MP, new tokenizer with ~1.0–1.35× token usage vs prior models, new xhigh effort level) — released 2026-04-16
  • Promote Gemini 3.1 series from preview to GA, retaining -preview suffix as alias for fallback compatibility (#573)
  • gemini-3.1-pro-previewgemini-3.1-pro (with gemini-3.1-pro-preview, gemini-3.1-pro-preview-customtools, and gemini-3.1-pro-latest aliases)
  • gemini-3.1-flash-image-previewgemini-3.1-flash-image (with gemini-3.1-flash-image-preview, nano-banana-2, and gemini-3.1-flash-image-latest aliases)
  • gemini-3.1-flash-lite-previewgemini-3.1-flash-lite (with gemini-3.1-flash-lite-preview and gemini-3.1-flash-lite-latest aliases)
  • Updated gemini-3-flash-preview deprecation note to point to the new GA gemini-3.1-pro id

v1.5.3 - 2026-04-23

Added

  • HuggingFace repo-prefix stripping as a new matching phase (phase 5) in src/models/pattern_matching.rs (#555)
  • try_strip_hf_repo_prefix() validates a vendor/repo (or org/team/repo) prefix against a MAX_PREFIX_SEGMENTS = 3 bound, rejects empty segments (/repo, vendor/, vendor//repo), and rejects any ASCII whitespace before returning the residual
  • Phase 5 re-enters phases 1-4 on the stripped residual with a structurally-enforced recursion depth of exactly 1 (the re-entry call clears the allow_prefix_strip gate), so prefix stripping composes with the existing layered suffix peel in a single lookup — the motivating case unsloth/Qwen3.6-35B-A3B-GGUF now resolves to qwen3.6-35b-a3b without any hand-registered alias
  • Phase 5 runs before the wildcard phase; the blast-radius audit confirmed no *-bearing alias in model-metadata.yaml contains /, so the ordering change is behavior-neutral for existing routing
  • Phase numbering in tracing output realigned to match the documented phase chain (previous code emitted phase = 7 for the namespace fallback while comments called it phase 6)
  • 12 new unit tests covering standard HF form, composition with suffix peel, case-sensitive vendor, registered-alias precedence, unresolvable residual, three-segment form, segment-cap rejection, no-slash input, whitespace rejection, empty segments, re-entry bounding, and alias-phase precedence
  • 9 new integration tests in tests/format_suffix_normalization_test.rs exercising the full RouterConfig / BackendConfig public API through phase 5
  • Pipeline doc updated in docs/en/configuration/advanced.md (and Korean counterpart) with a new "HuggingFace repo-prefix stripping (phase 5)" section covering the composition semantics, security bounds, and out-of-scope list (hyphen prefixes, HF API discovery)

Changed

  • Replaced the previous phase-6 namespace fallback with the new phase-5 HuggingFace prefix-strip layer. The previous phase was case-sensitive and did not compose with suffix peel; the new phase applies stricter input validation (segment cap, empty-segment rejection, whitespace rejection) but composes with phase 4's case-insensitive peel through the bounded re-entry. Pathological inputs above MAX_PREFIX_SEGMENTS (3) — such as provider/deep/nested/model — are now rejected by phase 5 rather than silently matched via recursive rsplit_once fallback (#555)
  • Aliases currently classified as vendor-prefix in the #560 audit (e.g., Qwen/Qwen3.6-35B-A3B, MiniMaxAI/MiniMax-M2.5) are now peel-coverable-adjacent post-#555: phase 2 still wins on the explicit alias, but phase 5 + phase 4 together reach the same metadata. Retroactive removal is deferred to a follow-up audit per #555 design section 7

Fixed

  • POST /anthropic/v1/messages now works when the selected backend is configured with a unix:// URL (#567)
  • Native Anthropic backends and OpenAI-compatible backends both work over Unix sockets, for both non-streaming and streaming requests
  • Socket paths containing spaces (e.g. macOS ~/Library/Application Support/...) are handled correctly
  • Auth header selection (x-api-key for Anthropic backends, Authorization: Bearer for OpenAI-compatible backends) is correct on the Unix socket path
  • anthropic-version header is added automatically for Anthropic backends on the Unix socket path, matching the HTTP path behavior

v1.5.2 - 2026-04-21

Added

  • Regression tests locking down the transport-layer passthrough contract for llama.cpp and MLxcel backends (#562)
  • New tests/llamacpp_passthrough_test.rs and tests/mlxcel_passthrough_test.rs covering all four passthrough call sites: direct backend execute_chat_completion, factory-built backend (BackendFactory -> LlamaCppBackend), proxy/backend.rs HTTP handler, and the streaming handler
  • New test_mlxcel_factory_backend_passthrough_nonstandard_fields asserts that BackendFactory -> LlamaCppBackend::execute_chat_completion preserves non-standard fields byte-for-byte at transport time
  • Anthropic input test (tests/anthropic_input_test.rs) extended with explicit passthrough coverage
  • docs/en/architecture/backend-passthrough.md and its Korean counterpart docs/ko/architecture/backend-passthrough.md documenting the passthrough contract, the four guarded call sites, and the list of router-side transforms that run before transport (global_prompts, transform_payload_for_openai for o1/o3/gpt-5*, web_search injection) (#562, #563)
  • docs/reports/alias-audit-2026-04.md classifying every alias in model-metadata.yaml into peel-redundant, peel-redundant-but-kept, and peel-independent categories, with an "aliases vs peel" policy section added to docs/en/configuration/advanced.md (and the Korean counterpart) explaining when to prefer each mechanism (#560)

Changed

  • Narrowed the passthrough contract from an implied "byte-equivalent" global guarantee to a transport-layer scope — the router may still run global_prompts injection, o1/o3/gpt-5* payload transforms, and web_search tool injection before transport, but no provider-specific rewriting happens at the transport boundary (#563)
  • Comment-only clarifications in src/http/streaming/handler.rs, src/infrastructure/backends/factory/backend_factory.rs, src/infrastructure/backends/llamacpp/backend.rs, and src/proxy/backend.rs
  • Audited model-metadata.yaml aliases for peel-normalization redundancy: removed aliases that differ from the canonical ID only by suffixes already handled by the layered peel (-4bit, -q4_k_m, -fp8, -gguf, -mlx, -awq, etc.), while preserving aliases that encode canonical flavor variants (-qat, -instruct) or disambiguate parameter counts (#557)
  • New tests/alias_audit_helper.rs and tests/format_suffix_normalization_test.rs enforce the peel-vs-alias boundary going forward

CI

  • Target Ubuntu 26.04 LTS (Resolute) instead of 25.10 (Questing) in the Debian build workflow
  • Fall back to createdAt when release publishedAt is null in debian/update-changelog.sh to prevent changelog regression when the latest release is still in draft

v1.5.1 - 2026-04-20

Added

  • Built-in web_search tool for self-hosted LLM backends (#553)
  • Router-level tool transparently injected into chat completion requests for vLLM, Ollama, llama.cpp, MLxcel, LM Studio, Continuum Router, and Generic backends
  • Pluggable SearchProvider trait under src/services/search/ with SerperProvider implementation; Exa and Brave scaffolded behind the same trait
  • Configurable inject_policy (auto/always/never) with per-backend overrides; commercial backends (OpenAI, Azure, Gemini, Anthropic) left untouched so their native web_search continues to flow through unchanged
  • Bounded non-streaming tool-execution loop parses web_search tool calls, executes the provider, appends tool-role results, and re-invokes the backend up to max_tool_iterations rounds
  • New BackendTypeConfig::is_self_hosted / is_commercial helpers covered by unit tests enforcing the commercial/self-hosted partition invariant
  • API keys redacted in Debug output and never logged; hot-reload friendly WebSearchConfig with ${ENV} substitution
  • Prometheus counters for tool calls, injections, and iteration-cap hits under src/metrics/web_search
  • Layered quantization and format suffix normalization for model metadata lookup (#549)
  • New layered_format_strip() in src/models/pattern_matching.rs iteratively peels allowlisted quantization/format/flavor tokens from the right side of a model ID, retrying exact-id/alias/date-suffix matches after each peel
  • Token categories: BIT_WIDTH, GGUF_QUANT, FP_FORMAT, INT_FORMAT, LIBRARY, IMATRIX, UNSLOTH, CONTAINER, FLAVOR (all case-insensitive)
  • Parameter-count suffixes preserved: -Nbit stripped as quantization; -Nb, -aNb, -eNb, -0.6b kept as parameter counts
  • Canonical base IDs ending in allowlisted flavors (e.g. gemma-3-12b-qat) win via exact-id match before peel runs
  • Normalization pipeline wired into find_matching_config, BackendConfig::get_model_metadata, RouterConfig::get_model_metadata, RouterConfig::get_thinking_pattern_config, resolve_model_tier (routing), and get_model_profile (admin)
  • Model metadata for GLM 5.1, Qwen 3.6, and MiniMax M2.7 (#548)
  • Teams release notification posted to Microsoft Teams via Power Automate webhook after build and Docker jobs

Changed

  • Migrate documentation toolchain from MkDocs + Material for MkDocs to Zensical — reads mkdocs.yml natively and bundles required extensions

Fixed

  • Security: Cap layered peel phase with MAX_MODEL_ID_LEN=256 and MAX_PEEL_ITERATIONS=8 to eliminate DoS via pathological model IDs (previously O(n²) allocation on inputs like -4bit-4bit-4bit-...)
  • Security: Enforce 256-char model field length at /v1/chat/completions, /v1/completions, /v1/embeddings, and /v1/embeddings/sparse (parity with existing /v1/responses check)
  • Consolidate 7-phase metadata matching pipeline into a single implementation (find_matching_config_slice) with thin adapters at each call site, eliminating drift between BackendConfig, Config::get_model_metadata, Config::get_thinking_pattern_config, and find_matching_config
  • Replace cfg.to_ascii_lowercase() == peel with str::eq_ignore_ascii_case on the hot path (~4000 fewer per-request String allocations)
  • Pin Pygments <2.20 to fix MkDocs build failure (superseded by Zensical migration)

CI

  • Bump softprops/action-gh-release from 2 to 3 (#544)
  • Bump actions/github-script from 8 to 9 (#545)
  • Bump actions/upload-pages-artifact from 4 to 5 (#554)

Documentation

  • Document suffix-order ambiguity (-qat-4bit vs -4bit-qat) and internal peel phase bounds in docs/en/configuration/advanced.md
  • Add pattern_matching.rs to Model Aggregation Service module listing in docs/en/architecture.md with cross-reference to suffix normalization section
  • New docs/en/web-search.md feature documentation; config.yaml.example extended with web_search section

v1.5.0 - 2026-04-11

Added

  • Smart routing system with model tier & capability profile registry (#525, #531)
  • Rule-based request classifier & smart routing policy engine (#526, #532)
  • Load-aware dynamic tier adjustment (#527, #533)
  • LLM-based request classifier with hybrid mode (#528, #534)
  • Smart routing observability, admin API & documentation (#529, #535)
  • Codex-compatible Responses API extensions (#536, #537)

Changed

  • Upgrade core dependencies — axum 0.8, sha2 0.11, rand 0.10 (#523)
  • Add Gemma 4 model family metadata (#538)

Fixed

  • Complete smart routing integration gaps
  • Increase DefaultTransformer PDF size limit from 20MB to 32MB (#542)

CI

  • Bump actions/deploy-pages from 4 to 5 (#521)

Dependencies

  • Bump the minor-and-patch dependency group with 4 updates (#539)

Documentation

  • Add Codex-compatible Responses API gap analysis report

v1.4.5 - 2026-03-27

Fixed

  • Return 400 error when file references are used without file service configured (#519)

Changed

  • Add GLM-5-Turbo model metadata (#516)

Documentation

  • Fix Korean anti-AI-slop violations in ko/ documentation
  • Fix slop word and transition word in api.md

v1.4.4 - 2026-03-18

Fixed

  • Fix Anthropic thinking failing for high/xhigh reasoning effort — budget_tokens (32768) exceeded default max_tokens (16384), causing API rejection (#514)
  • Auto-adjust max_tokens to budget_tokens + 4096 when thinking is enabled and budget exceeds max

Changed

  • Add GPT-5.4 model family: gpt-5.4, gpt-5.4-pro, gpt-5.4-mini, gpt-5.4-nano with 1M context window (#515)
  • Update Gemini 3 series: add gemini-3.1-pro-preview, gemini-3-flash-preview, gemini-3.1-flash-lite-preview; mark gemini-3-pro-preview as deprecated
  • Recognize Gemini 3 Flash and 3.1 Flash-Lite as thinking models for include_thoughts auto-injection
  • Update Claude 4.6 models: context window to 1M (GA), fix Sonnet 4.6 max_output to 64K, correct knowledge cutoffs
  • Update config examples and documentation with latest model names across 8 files

v1.4.3 - 2026-03-18

Fixed

  • Fix Gemini thinking models (2.5 Pro, 3 Pro, etc.) not returning reasoning_content in streaming responses through the router (#513)
  • Replaced transform_payload_for_gemini() with transform_request_gemini() across all three Gemini streaming paths to ensure include_thoughts: true auto-injection

v1.4.2 - 2026-03-17

Changed

  • Change mid-stream fallback default to enabled for improved streaming reliability (#504)
  • Breaking: Mid-stream fallback is now enabled by default; set mid_stream_fallback.enabled: false to restore previous behavior

Documentation

  • Add failover latency tuning guide for optimizing fallback behavior

v1.4.1 - 2026-03-17

Added

  • Mid-stream fallback for streaming inference (#497) — when a backend fails mid-stream during SSE streaming, the router transparently retries with a fallback backend

Changed

  • Decouple pre-stream fallback from mid-stream fallback (#500) — each can now be independently enabled/disabled
  • Bump dependency versions to latest major releases

Fixed

  • Fix streaming config changes not detected in hot reload system (#503)
  • Fix mid-stream connection errors leaking to client during fallback (#502)
  • Remove unused config crate dependency

CI

  • Bump dorny/paths-filter from 3 to 4 (#493)
  • Bump actions/create-github-app-token from 2 to 3 (#494)

v1.4.0 - 2026-03-14

Added

  • Prefix-aware routing: PrefixAwareHash selection strategy with Consistent Hash with Bounded Loads (CHWBL) (#455, #457, #461)
  • Response caching: SHA256-based cache key computation with streaming response buffering and post-completion caching (#456, #459, #462)
  • Multi-tier CacheStore: in-memory backend (#466), Redis/Valkey backend with connection pooling (#467), and S3-backed tiered L1/L2 cache (#483)
  • KV cache index: shared data structure (#470), KV event consumer for vLLM backend streams (#471), prefix overlap scoring integrated into backend selection (#473), configuration/metrics/admin endpoints (#474)
  • Tiered KV cache with storage-tier awareness (GPU hot / external warm) (#484)
  • Disaggregated prefill/decode orchestration with external KV tensor transfer (#485)
  • Anthropic cache_control breakpoint auto-injection (#460)
  • Multimodal embedding support for Gemini Embedding 2 (#492)
  • Shared cache configuration and operational metrics (#468)
  • 30 new models added to model-metadata.yaml (#472)

Changed

  • Rename VAST-specific identifiers to generic S3/external storage names (#490) — update configuration files if using VAST-specific field names

Fixed

  • Make RequestExecutor transport-aware for Unix socket paths with spaces (#488)
  • Replace relative source tree links with GitHub URLs in docs

CI

  • Bump docker/setup-qemu-action from 3 to 4 (#428)
  • Bump docker/metadata-action from 5 to 6 (#426)
  • Bump docker/setup-buildx-action from 3 to 4 (#429)
  • Bump docker/build-push-action from 6 to 7 (#430)
  • Bump docker/login-action from 3 to 4 (#427)

Documentation

  • Comprehensive KV cache feature documentation, benchmarks, and config examples (#477)
  • VAST Data connection guide and integration examples (#486)
  • Sync Korean documentation with English counterparts
  • Split monolithic configuration.md into 6 smaller files

v1.3.0 - 2026-03-12

Added

  • Agent Communication Protocol (ACP) support with JSON-RPC 2.0 protocol layer and stdio transport (#414, #420)
  • ACP session management with protocol lifecycle, initialize/shutdown handshake (#415, #421)
  • ACP-to-LLM inference pipeline with streaming support (#416, #422)
  • ACP tool call reporting and permission delegation (#417, #423)
  • MCP-over-ACP bridge for MCP server tunneling (#418, #424)
  • ACP agent registry with metadata and configuration support (#419, #425)
  • ACP integration tests for protocol lifecycle and session management

Fixed

  • Resolve clippy field_reassign_with_default warnings in ACP integration tests

CI

  • Bump actions/upload-artifact from 6 to 7 (#398)

Documentation

  • ACP architecture documentation with MkDocs integration
  • ACP practical usage guide with IDE integration examples
  • KV cache integration plan for router-level caching strategies

v1.2.1 - 2026-03-07

Added

  • MLxcel backend type support for MLX-based model serving (#412, #413) — fully API-compatible with llama-server, reusing the same backend implementation for health checks, model discovery, and proxying

v1.2.0 - 2026-03-06

Added

  • Admin Statistics API with comprehensive request-level statistics collection and reporting (#409)
  • Endpoints: GET /admin/stats, GET /admin/stats/models, GET /admin/stats/backends, POST /admin/stats/reset
  • Time-windowed queries, token usage tracking, latency percentiles (p50, p95, p99)
  • Statistics persistence with configurable snapshot path, interval, and staleness checks (#410, #411)
  • Atomic writes, restore on startup, final snapshot on graceful shutdown

Documentation

  • Add admin stats and persistence to configuration guide
  • Add post-refactoring benchmark report for v1.1.0 (#407)

v1.1.1 - 2026-03-04

Added

  • Embeddable library crate (Phase 1) — use continuum-router as a Rust dependency (#394)
  • Type-safe config builders for programmatic library usage (#400)
  • Cargo feature flags for optional library dependencies (#399)
  • Persistent storage for runtime API keys (#405)
  • New LLM model metadata entries (#403)

Fixed

  • Fix Gemini-specific transforms incorrectly applied in Anthropic handler (#404)

v1.1.0 - 2026-03-01

Added

  • Embedded WebUI for configuration management and API key administration (#388)
  • Windows AF_UNIX socket support via socket2 crate (#390)
  • Nano Banana 2 (Gemini Image Generation) support

Fixed

  • Resolve compilation error in ClientAddr::is_unix for tuple variant matching
  • Resolve Windows AF_UNIX socket accept failure and config validation
  • Accept Windows absolute paths in Unix socket config validation (#393)
  • Resolve Windows compilation errors in Unix socket tests and transport parsing (#392)

v1.0.0 - 2026-02-19

Added

  • Continuum Router federation — router-to-router chaining as a new backend type (#385)
  • LM Studio as a dedicated backend type (#381)
  • Anthropic adaptive thinking effort parameter (output_config.effort) (#384)
  • Adaptive thinking and auto reasoning effort level across backends (#378)
  • Cohere/Jina-compatible rerank and sparse embedding endpoints (#374)
  • BGE-M3 and multilingual embedding model support (#373)
  • Claude Opus 4.6 model metadata
  • Qwen3-Coder-Next, Qwen3-VL-30B/8B model metadata

Changed

  • Handle SIGTERM for graceful shutdown on Unix systems (#370)
  • Reduce per-backend filter and model metadata log verbosity during model refresh (#371, #375)

CI

  • Replace Ubuntu 24.10 with 25.10 in deb build matrix (#376)

v0.36.1 - 2026-01-30

Fixed

  • Trigger immediate health check after sync_backends during hot reload (#368) — new backends now available within 1-2 seconds instead of up to 30 seconds
  • Sync health_check_info and use URL-based updates during hot reload (#369) — new backends properly receive API key authentication
  • Accelerate health checks for recently added backends — 1-second check interval for 5 minutes after addition
  • Trigger model cache refresh when backends transition to healthy state with 5-second debounce

v0.36.0 - 2026-01-27

Added

  • Native Anthropic Messages API handler with endpoint routing (#355)
  • Anthropic to OpenAI request/response transformation (#356, #357)
  • Anthropic streaming response format (#358)
  • Direct Anthropic to Gemini request/response transformation (#359)
  • File_id source type and file resolution for Anthropic input (#360)
  • Claude Code compatibility for Anthropic handler (#365)
  • Tiered token counting for all backend types
  • Parallel file reference resolution for improved performance
  • Anthropic-version header format validation

Fixed

  • Require HTTPS for image and document URLs to prevent SSRF
  • Return generic error messages to clients instead of backend details
  • Use authenticated user_id from API key for file ownership checks
  • Use UUID v4 for secure message/tool ID generation
  • Place tool messages before user text in Anthropic-to-OpenAI conversion
  • Override stop_reason to tool_use when tool_use blocks are present
  • Apply max_completion_tokens conversion for OpenAI-routed Anthropic requests
  • Propagate file access denied and not found errors to client
  • Call current_config() once per request for consistent behavior

Refactored

  • Extract common SSE event type and data extraction logic
  • Add parse_bytes method to SseParser for proper UTF-8 handling
  • Remove unnecessary Arc wrapper in AnthropicFileResolver
  • Box FileResolutionResult::Resolved to reduce enum size

v0.35.0 - 2026-01-23

Added

  • Gemini 3 thoughtSignature support in function calling (#354)
  • PDF support for OpenAI and Anthropic file transformers (#340)
  • Text/plain support for AnthropicFileTransformer (#342)

Fixed

  • Add PDF support to DefaultTransformer and file resolution (#343)
  • Add tool message transformation to non-streaming Anthropic requests (#344)
  • Reject non-image files in DefaultTransformer with clear error message (#338)
  • Fix AI SDK incompatibility with Responses API streaming format (#335)

v0.34.0 - 2026-01-16

Added

  • Automatic quality parameter conversion between DALL-E and GPT Image models (#330)

Changed

  • Native Anthropic conversion for Responses API PDF file uploads (#332)

Fixed

  • Gemini streaming tool_calls compatibility fixes (#333) — missing index field, tool_choice format preservation, unnecessary transformation removal

v0.33.0 - 2026-01-13

Added

  • /v1/embeddings endpoint for embedding API support (#319)
  • Resolve local file_id references in Responses API requests (#326)
  • user_data and evals purpose values for Files API (#322)

Fixed

  • Use flat tool format for Responses API function tools (#324)
  • Improve Unix socket test stability for parallel execution (#328)

v0.32.0 - 2026-01-09

Added

  • Reasoning effort documentation and improved xhigh fallback logging (#317)

Fixed

  • Support implicit message type inference in Responses API InputItem (#316)

Refactored

  • Optimize InputItem deserializer and add invalid role test

v0.31.5 - 2026-01-09

Added

  • Responses API pass-through support for native OpenAI backends (#313) — smart routing based on backend type with direct forwarding to /v1/responses endpoint
  • OpenAI Responses API file input types (#311) — support for input_text, input_file, input_image content parts with SSRF validation

Fixed

  • Forward raw backend error responses in pass-through mode
  • Address security and performance issues in Responses API pass-through

v0.31.4 - 2026-01-07

Fixed

  • Use current_config() for hot reload support in proxy handlers (#310) — API key and configuration changes via hot reload now properly apply to new requests

v0.31.3 - 2026-01-06

Fixed

  • Add Anthropic transformations to Unix socket transport (#308) — Unix socket transport now applies the same request/response transformations as HTTP transport
  • Preserve stream parameter for non-streaming Anthropic requests (#306)

v0.31.2 - 2026-01-05

Added

  • Non-streaming support for Anthropic backend requests
  • Tool call and tool result transformation for Anthropic backend — enables multi-turn tool use conversations

v0.31.1 - 2026-01-04

Fixed

  • Non-streaming Anthropic requests failing with wrong authentication header (#301) — now correctly uses x-api-key header instead of Authorization: Bearer

v0.31.0 - 2026-01-04

Added

  • Unix socket server binding alongside TCP (#298) — supports unix: URI scheme, socket_mode configuration, auto-cleanup
  • Reasoning parameter support for Responses API (#296) with nested format and low/medium/high/xhigh effort levels
  • xhigh reasoning effort support for GPT-5.2 thinking models with auto-downgrade for unsupported models
  • Configurable health check endpoints per backend type (#293) — custom endpoint, fallback endpoints, method, body, accept_status, and headers

Changed

  • Comprehensive reasoning parameter normalization across backends (#294)

v0.30.0 - 2026-01-01

Added

  • Wildcard patterns and date suffix handling in model aliases (#286) — automatic date suffix normalization, * pattern matching (prefix, suffix, infix), zero-config date handling

Fixed

  • Apply default URL for Anthropic backend when not specified (#288)
  • Replace owned_by placeholders with backend-type-specific values (#287)

Documentation

  • Translate wildcard pattern and date suffix handling documentation to Korean (#289)

v0.29.0 - 2026-01-01

Added

  • Accelerated health checks during backend warmup (#282) — 1s interval on HTTP 503, configurable via warmup_check_interval and max_warmup_duration
  • --model-metadata CLI option for specifying model metadata file path at runtime (#281)

Fixed

  • Replace OpenAI owned_by placeholder with 'openai' (#280)
  • Prevent race condition in Admin API concurrent backend creation (#278)
  • Fix missing processing steps in hot reload (#277)
  • Cloud backends now show available: true in /v1/models/{model_id} (#272)

v0.28.0 - 2025-12-31

Added

  • SSE streaming support for tool calls (#258)
  • llama.cpp tool calling auto-detection via /props endpoint (#263)
  • Extended /v1/models/{model_id} endpoint with rich metadata fields (#262)
  • Tool result message transformation for multi-turn conversations (#265)
  • Backend-specific owned_by placeholders for llamacpp, vllm, ollama, http (#267)

Changed

  • Improved --help output formatting with title header and project attribution (#269)

Fixed

  • Sync model metadata cache with ConfigManager (#270)

v0.27.0 - 2025-12-29

Added

  • Complete Unix socket support for model discovery and SSE streaming (#248, #252, #253, #254, #256)
  • SSE/streaming for Unix socket backends
  • Backend type auto-detection for Unix sockets
  • vLLM and llama.cpp model discovery via Unix sockets
  • Tool call transformation across all backends (#244, #245, #246) — tool definitions, tool_choice, and tool call responses for Anthropic, Gemini, and llama.cpp

v0.26.0 - 2025-12-27

Added

  • GET /v1/models/{model} endpoint for single model retrieval with real-time availability status (#236)

v0.25.0 - 2025-12-26

Added

  • CORS (Cross-Origin Resource Sharing) support (#234) — configurable origins, wildcard patterns, custom schemes (e.g., tauri://localhost), preflight cache
  • Unix Domain Socket backend support (#232) — unix:///path/to/socket scheme, lower latency than localhost TCP

v0.24.0 - 2025-12-26

Added

  • llama.cpp backend support for local LLM inference (#230)
  • Allow router to start without any backends configured (#226)

Changed

  • Enable hot reload for backend additions/removals from config (#229)

v0.23.1 - 2025-12-25

CI

  • Add Windows x86_64 build target to release workflow (#224)

v0.23.0 - 2025-12-23

Added

  • GLM 4.7 model support with thinking capabilities (#222)
  • GCP Service Account authentication support for Gemini (#208)
  • Distributed tracing with correlation ID propagation (#207) — W3C Trace Context with traceparent header
  • Thinking pattern metadata for models with implicit start tags (#218)
  • Model metadata for NVIDIA Nemotron 3 Nano, Qwen Image Layered, and Kakao Kanana-2 (#202)
  • ASCII diagram to image replacement system for MkDocs (#200)

Fixed

  • Prevent cache stampede with singleflight, stale-while-revalidate, and background refresh (#220)
  • Apply global_prompts changes via hot reload (#219)
  • Invalidate model cache when backend config changes (#206)

CI

  • Skip Rust tests in CI when only non-code files change (#204)
  • Bump actions/github-script from 7 to 8 (#210)
  • Bump apple-actions/import-codesign-certs from 3 to 6 (#212)
  • Bump actions/cache from 4 to 5 (#211)
  • Bump actions/checkout from 4 to 6 (#209)

v0.22.0 - 2025-12-19

Added

  • Docker support with pre-built binary images — Debian (~50MB) and Alpine (~10MB) with multi-arch support (#198)
  • Container health check CLI (--health-check) for orchestration (#198)
  • Docker Compose quick start configuration
  • Automated Docker image publishing to ghcr.io in release workflow
  • MkDocs documentation website with Material theme (#183)
  • Korean documentation translation (i18n) — complete localization of all 20 documentation files (#190)
  • Security policy with vulnerability reporting process (#191)
  • Dependency security auditing with cargo-deny and Dependabot (#192)

Changed

  • Integrate orphaned architecture documentation into MkDocs site (#186)
  • Rename documentation files to lowercase kebab-case for URL-friendly filenames

Fixed

  • Fix health check response validation logic bug (operator precedence)
  • Fix address parsing fallback silently hiding configuration errors
  • Fix IPv6 address formatting in health check

v0.21.0 - 2025-12-19

Added

  • Gemini 3 Flash Preview model support (#168)
  • Default authentication mode for API endpoints (#173) — permissive (default) or blocking mode
  • Backend error passthrough for 4xx responses (#177) — parse and forward original error messages from OpenAI, Anthropic, and Gemini

Fixed

  • Handle UTF-8 multi-byte character corruption in streaming responses (#179)
  • Strip response_format parameter for GPT Image models (#176)
  • Allow auto-discovery for all backends except Anthropic (#172)
  • Always return b64_json field for Gemini image generation responses (#181)

v0.20.0 - 2025-12-18

Added

  • Image variations support for Gemini (nano-banana) models (#165)
  • Image edit support for Gemini (nano-banana) models (#164)
  • Enhanced /v1/images/generations with streaming and GPT Image features (#161)
  • gpt-image-1.5 model support (#159)
  • /v1/images/variations endpoint (#155)
  • /v1/images/edits endpoint for image editing and inpainting (#156)
  • External Markdown file support for system prompts with REST API management (#146)
  • Automatic model discovery for backends without explicit model list (#142)
  • Solar Open 100B model

Security

  • API key redaction to prevent credential exposure in logs and error messages (#150)

Changed

  • Optimized release binary size from 20MB to 6MB (70% reduction) (#144)

Refactored

  • Split large files to keep each under 500 lines (#147, #148)

v0.19.0 - 2025-12-13

Added

  • Runtime Configuration Management API (#139)
  • Configuration query, modification, save/restore, and backend management APIs
  • Sensitive information masking, JSON Schema generation, configuration history with rollback (up to 50 entries)
  • Comprehensive Admin REST API reference documentation
  • 33 integration tests for configuration API endpoints

Security

  • Input validation with 1MB content limit and 32-level nesting depth
  • Audit logging for sensitive data exports with 30+ sensitive field patterns

v0.18.0 - 2025-12-13

Added

  • Per-API-key rate limiting (#137)
  • API key management and configuration system
  • Files API authentication and authorization (#131)
  • Hot reload for runtime configuration updates (#130)

Fixed

  • Add ConnectInfo extension for admin/metrics/files endpoints
  • Address security vulnerabilities in API key management

Refactored

  • Extract CLI and app utilities into modular structure (#132)
  • Split converter.rs into modular structure (#132)
  • Split large source files into modular components

v0.17.0 - 2025-12-12

Added

  • Anthropic backend file content transformation (#126)
  • Gemini backend file content transformation (#127)

Fixed

  • Streaming file uploads to prevent memory exhaustion (#128)

v0.16.0 - 2025-12-12

Added

  • OpenAI-compatible Files API endpoints (#111)
  • File resolution middleware for chat completions (#120)
  • OpenAI backend file handling strategy (#121, #122)
  • Persistent metadata storage for Files API (#125)
  • GPT-5.2 model support (#124)
  • Circuit breaker pattern for automatic backend failover
  • Admin endpoint authentication and audit logging
  • Configurable fallback models for unavailable model scenarios with cross-provider support

Fixed

  • Sanitize fallback error headers and metric labels
  • Use index-based lookup for fallback chain traversal
  • Reduce lock contention in FallbackService with snapshot pattern

v0.15.0 - 2025-12-05

Added

  • Nano Banana (Gemini Image Generation) API support (#102)
  • Split /v1/models endpoint — standard lightweight vs extended metadata response (#101)

Changed

  • Optimize LRU cache to use read lock for cache lookups (#105)

Fixed

  • Replace .expect() panics with proper error propagation in HttpClientFactory (#104)

Refactored

  • Extract streaming handler logic to dedicated StreamService (#106)
  • Eliminate retry logic code duplication in proxy.rs (#103)

v0.14.2 - 2025-12-05

Added

  • Log token usage (input/output tokens) on request completion (#92)

v0.14.1 - 2025-12-05

Fixed

  • Optimize Anthropic backend TTFT with connection pooling and HTTP/2 (#90)
  • Optimize Gemini backend TTFT with connection pooling and HTTP/2 (#88)
  • Apply base name fallback matching to aliases in model metadata lookup (#84)

v0.14.0 - 2025-12-04

Added

  • Router-wide global system prompt injection (#82)

CI

  • Replace deprecated actions-rs/toolchain with dtolnay/rust-toolchain
  • Add RUSTFLAGS for macOS ARM64 ring build
  • Switch to rustls-tls for musl cross-compilation support

v0.13.0 - 2025-12-04

Added

  • OpenAI /v1/responses API support with session management (#49)
  • True SSE streaming for /v1/responses API
  • Background cleanup task for expired sessions
  • Override /v1/models response fields via model-metadata.yaml (#75)

Security

  • SecretString for API key storage across all backends (#76)
  • Session access control and input validation for Responses API

Changed

  • Immediate mode for SseParser for reduced first-response latency

Refactored

  • String allocation optimizations and error handling standardization

v0.12.0 - 2025-12-04

Fixed

  • Handle exact hash matches in consistent hash binary search (#72)
  • Replace panics with Option returns and implement stats aggregation (#71)
  • Remove hardcoded auth requirement from /v1/models endpoint

Refactored

  • Reorganize OpenAI model metadata by family (#74)
  • Extract AnthropicStreamTransformer to dedicated module (#73)
  • Split backends mod.rs into separate modules (#69)
  • Extract embedded tests to separate files (#68)
  • Create HttpClientFactory for centralized HTTP client creation (#67)
  • Create UrlValidator module with SSRF prevention (#66)
  • Extract RequestExecutor to shared common module (#65)
  • Extract HeaderBuilder with auth strategies (#64)
  • Extract AtomicStatistics to shared common module

v0.11.0 - 2025-12-03

Added

  • Native Anthropic Claude API backend with extended thinking support
  • OpenAI to Claude reasoning parameter conversion
  • Flat reasoning_effort parameter for Anthropic
  • Claude 4, 4.1, 4.5 model metadata

Fixed

  • Improve health check and model fetching for Anthropic/Gemini backends
  • Accept-Encoding fixes for streaming — use identity header and disable compression

v0.10.0 - 2025-12-03

Added

  • Native Google Gemini API backend support
  • OpenAI Images API support for image generation
  • Authenticated health checks for OpenAI and API-key backends
  • Built-in OpenAI model metadata for /v1/models response
  • API key authentication for streaming requests
  • Configurable image generation timeout
  • Response_format validation for image generation API

Fixed

  • Convert max_tokens to max_completion_tokens for newer OpenAI models
  • Correct URL construction for all API endpoints
  • Request body size limits to prevent DoS attacks

Security

  • Remove sensitive data from debug logs

Refactored

  • Unify request retry logic with RequestType enum

v0.9.0 - 2025-12-02

Added

  • Enhanced rate limiting with token bucket algorithm
  • Comprehensive Prometheus metrics and monitoring (#10)

Security

  • Prevent IP spoofing via X-Forwarded-For manipulation
  • Prevent header injection vulnerabilities
  • Eliminate race condition in token refill
  • Protect API keys with SHA-256 hashing
  • Prevent memory exhaustion via unbounded bucket growth
  • Comprehensive authentication for metrics endpoint
  • Cardinality limits and label sanitization to prevent metric explosion DoS

Fixed

  • Implement singleton pattern for metrics to prevent memory leaks
  • Improve error handling to prevent panic conditions
  • Resolve environment variable race condition in config test
  • Fix integration test failures in metrics

v0.8.0 - 2025-09-09

Added

  • Model ID alias support for metadata sharing (#27)

Fixed

  • Return empty list instead of 503 when all backends are unhealthy (#28)

v0.7.1 - 2025-09-08

Fixed

  • Improve config path validation for home directory and executable paths (#26)

v0.7.0 - 2025-09-07

Added

  • Rich metadata support for /v1/models endpoint (#23, #25)
  • Enhanced configuration management (#9, #22)
  • Advanced load balancing strategies (Weighted, Least-Latency, Consistent-Hash) with enhanced error handling (#21)

Fixed

  • Use streaming timeout configuration from config.yaml instead of hardcoded 25s limit

v0.6.0 - 2025-09-03

Fixed

  • Use timeout configuration from config.yaml instead of hardcoded values (#19)

Documentation

  • Comprehensive timeout configuration and model documentation updates

v0.5.0 - 2025-09-02

Added

  • Optional retry configuration with sensible defaults
  • Comprehensive integration tests and performance optimizations
  • Complete service layer implementation
  • Middleware architecture and enhanced backend abstraction

Fixed

  • Handle streaming requests without model field gracefully
  • Resolve floating-point precision and timing issues in tests
  • Resolve test failures and deadlocks in object pool and SSE parser
  • Resolve initial health check race condition

Refactored

  • Split oversized modules into layered architecture
  • Extract complex types into type aliases for better readability

v0.4.0 - 2025-08-25

Added

  • Model-based routing with health monitoring

Fixed

  • Improve health check integration and SSE parsing

v0.3.0 - 2025-08-25

Added

  • SSE streaming support for real-time chat completions (#5)
  • Model aggregation from multiple endpoints (#4)

v0.2.0 - 2025-08-25

Added

  • Multiple backends support with round-robin load balancing (#1)

v0.1.0 - 2025-08-24

Added

  • Initial release with OpenAI-compatible endpoints and proxy functionality