Changelog¶
All notable changes to Continuum Router are documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
v1.10.2 - 2026-06-15¶
Added¶
- Kimi K2.7-Code and GLM-5.2 model metadata (#775). Kimi K2.7-Code is a 1T-parameter MoE (32B active) with a MoonViT vision encoder and 256K context that runs only in thinking mode and returns reasoning in the native
reasoning_contentfield, so it carries no<think>marker config. GLM-5.2 is the GLM-5 line coding flagship with a 1M context window, 131K max output, and two thinking-effort levels (High and Max); it carries the standard<think>/</think>marker config like the rest of the GLM family. GLM-5.2 standalone API pricing was not published at launch, so its input/output rates are estimated from the GLM-5/GLM-5.1 tier.
Fixed¶
- Normalize the vLLM/OpenAI-compatible
reasoningfield to the canonicalreasoning_contenton/v1/chat/completions(#776, closes #774). Newer vLLM renamed its reasoning output field fromreasoning_contenttoreasoning(streamingdelta.reasoning, non-streamingmessage.reasoning), and the router relayed it unchanged, so clients readingreasoning_contentsilently dropped all reasoning text from self-hosted vLLM reasoning models. The rename runs across every OpenAI-compatible relay path (the streaming default and thinking transformers, the unix-socket relay, the mid-stream fallback relay, and the non-streaming proxy body), only whenreasoning_contentis absent so an upstream already using the canonical name is never overwritten, and is scoped away from the Gemini and Anthropic handlers and the Responses API.
Documentation¶
- Note in the reasoning-effort architecture reference (English and Korean) that the router normalizes the upstream vLLM
reasoningfield toreasoning_content.
v1.10.1 - 2026-06-15¶
Added¶
- Per-API-key and per-user usage statistics REST API: four admin endpoints
GET /admin/stats/api-keys,GET /admin/stats/api-keys/{id},GET /admin/stats/users, andGET /admin/stats/users/{user_id}, mirroring the existing/admin/stats/modelsshape under the admin auth router (#772, closes #770). TheStatsCollectorgainsper_api_keyandper_userdimensions recorded from the same spawned task that updates the Prometheusllm_tokens_totalcounter, so the in-memory dimensions no longer depend on themetricsfeature and are attributed even for failed and zero-token requests. Theapi_key_idis the derived, non-reversible id resolved once off the request hot path; unauthenticated requests bucket under"anonymous", and ids beyond the 1000-per-dimension cardinality cap fold into an"unknown"overflow bucket so usage is still counted in aggregate. EachGET /admin/stats/.../{id}returns404when the id has no recorded usage, and the optionalwindowquery param is echoed back but does not filter, since the aggregates are all-time atomic counters like/admin/stats/models. The snapshot/persist format adds both dimensions as#[serde(default)], so snapshots written before this change still load without a format-version bump.
Changed¶
- Ignore root-level
.shscripts via.gitignoreso local helper scripts are not accidentally committed.
Documentation¶
- Document the admin API key management endpoints and the per-API-key/per-user usage statistics in the Admin REST API reference for both English and Korean (#773, closes #770). A new "API Key Management APIs" section covers the eight
/admin/api-keysendpoints (create, list, get, update, delete, rotate, enable, disable) with request/response schemas reflecting thesk-***abcdmasking and the full-value-once-on-create behavior, theApiKeyConfigfields, thepermissivevsblockingapi_keys.modesemantics, and runtime-key persistence viapersistence_filewith hot-reload. The Statistics section gains the four new stats endpoints, thewindowecho, theanonymousandunknownbuckets, the per-dimension cardinality cap, and theapi_key_id-to-issued-key linkage.
v1.10.0 - 2026-06-12¶
Added¶
- Dynamic model enumeration for Codex (ChatGPT OAuth) backends (#753, closes #752). The Codex backend has no standard
/v1/models; the router instead queries the plan-gatedGET <base>/models?client_version=<ver>endpoint with the loaded OAuth token during model discovery, keeps only user-facing entries (visibility: listand anavailable_in_plansthat is empty or matches the account'schatgpt_plan_typeclaim), and uses the result to populate/v1/modelsand routing. On any failure (network error, non-2xx, empty list) the configuredmodels:list remains the fallback, and the behavior applies only to Codex OAuth backends.
Changed¶
- Extract a shared
parse_responses_bodyinsrc/services/responses/sse.rsand a sharedbuild_openai_chat_request_coreinsrc/http/streaming/handler.rs, eliminating the duplicated/responsesSSE-aggregation parse path and the duplicated streaming request-construction logic that existed across the Codex, passthrough, and Anthropic handler variants (#766, closes #762). No behavior change; the consolidation removes drift risk where a fix to one copy would not propagate to the others.
Fixed¶
- Route Codex (ChatGPT OAuth) chat completions to the backend's
/responsesendpoint (#755, closes #754).build_responses_urlappended/v1/responsesto the.../backend-api/codexroot, and the upstream edge rejects that path with a 403 HTML page that the router surfaced as an opaque authentication error despite a valid OAuth token; the Codex root now routes through the shared URL composer that strips the internal/v1. With the path fixed, Codex rejects a bare-stringinputwith HTTP 400 "Input must be a list", sosanitize_codex_responses_requestcoerces a single-message stringinputinto a one-element item list. - Accept an RFC3339
expires_atin the OAuth token store (#756, closes #751). The field was a plainu64, so a token store that wrote the field as a datetime string failed to deserialize, the backend silently lost its OAuth strategy, and requests went out unauthenticated, surfacing as an opaque 403 that hid the real cause.expires_atnow deserializes leniently (integer or float epoch seconds, a numeric string, or an RFC3339 datetime with offset, all normalized to epoch seconds) and rejects anything else with an error naming the accepted formats; the canonical on-disk shape remains a JSON number, so existing stores and the save/load roundtrip are unchanged. - Strip local-engine-only fields (
chat_template_kwargs,thinking_budget_tokens,enable_thinking,preserve_thinking,top_k,min_p,repeat_penalty) from/v1/chat/completionsrequests bound for cloud OpenAI, which rejects unknown top-level keys with HTTP 400 "Unknown parameter" (#760, closes #758). The strip mirrors the existing cloud Gemini behavior through a sharedfield_filtermodule gated onapi.openai.com, runs at every cloud-OpenAI chat send site (non-streaming, streaming, and per-hop in the fallback loops), never touchesextra_bodyorreasoning_effort, and is a no-op for local OpenAI-compatible engines, preserving the backend-passthrough contract. - Respect the configured
models:selection for Codex dynamic enumeration and stamp the cleanowned_byon enumerated models (#765, closes #763). When live enumeration succeeded, the operator-configuredmodels:allowlist was bypassed, so/v1/modelsexposed every enumerated Codex model; the post-filter now applies uniformly, with a non-empty list intersecting the enumerated set down to the selected subset and an empty list exposing the full set. Enumerated Codex entries also carried the raw backend name asowned_by; the owner is now resolved from the backend type (openai), matching every other backend. - Return a valid non-streaming response for Codex (ChatGPT OAuth) backends on
/v1/chat/completionswithstream:false(#764, closes #761). Codex's/responsesendpoint only acceptsstream:true+store:false; sendingstream:falseproduced HTTP 400 "Stream must be set to true".sanitize_codex_responses_requestnow forcesstream:trueandstore:falseunconditionally. A second fix makes SSE detection inPassthroughService::execute_requestrobust to Codex responses that carry an SSE body withoutContent-Type: text/event-stream: alooks_like_ssesniffer checks the first non-empty line fordata:/event:/:prefixes, and a failed JSON parse retries SSE aggregation once before surfacing the original parse error. - Reconstruct the assistant message from
response.output_text.deltaevents for Codexstore:falseresponses (#768, closes #767). When Codex runs withstore:falsethe terminalresponse.completedevent carries an emptyoutputarray and the assistant text arrives only as incrementalresponse.output_text.deltaevents; the SSE aggregation now accumulates those deltas and synthesizes a message item when the completed event carries no assistant text, so non-streaming clients (including the title/summary utility path) receive the text instead of an empty response.
Documentation¶
- Document the post-v1.9.1 Codex and transport changes across the English and Korean manuals: the Codex Chat-to-Responses request handling and lenient
expires_atparsing inbackends.md, the cloud-OpenAI field strip inbackend-passthrough.md, anditem_referenceresolution on the Responses API inapi.md, and convert the remaining Korean pages to the polite-습니다register so every page reads in one voice.
v1.9.1 - 2026-06-11¶
Added¶
GET /versionendpoint that returns{ "version": "<CARGO_PKG_VERSION>" }, registered unconditionally on the base router (not behind theappproxyor any other Cargo feature) so it is present in the standard release binary, and aversionfield on the existingGET /healthresponse (#750, closes #749). Both endpoints stay outside the API-auth boundary, matching/health, so a downstream consumer can probe the running router version for feature-gating instead of failing open against a 404.
v1.9.0 - 2026-06-10¶
Added¶
- AppProxy worker mode behind an opt-in
appproxyCargo feature, letting Continuum Router run as a Backend.AI AppProxy inference worker driven by an AppProxy coordinator (epic #709). The feature is deliberately left out offull, so default builds are unaffected. - Foundation: the typed
AppProxyWorkerConfigsection (coordinator URL, sharedapi_secret/jwt_secret,redis_url, wildcard frontend parameters, heartbeat/reconcile durations, and an events toggle, with both bearer secrets redacted inDebugand${ENV_VAR}references resolved through the same path asbackends[].api_key), theSerializableCircuit/RouteInfowire types andProxyProtocol/AppMode/FrontendModeenums (snake_case and kebab-case tolerant, unknown fields ignored), and the module scaffold (#716). - Coordinator REST client
CoordinatorClientwithregister,heartbeat,deregister,list_circuits, andget_circuit, each carryingX-BackendAI-Tokenand a fresh per-callX-BackendAI-RequestID, and an error type that separates retryable (connection, timeout) from fatal (HTTP 4xx) failures (#717). - Circuit-to-backend translation and reconcile:
circuit_to_backendsbuilds oneBackendConfigper replica (namedappproxy-<circuit_id>-r<route_key>, traffic-ratio mapped to a1..=1000weight that never drops a route to 0, vLLM detection fromruntime_variant), andapply_circuitsinjects the translated backends through the existing hot-reloadconfig_sender, namespaced by theappproxy-prefix so statically configured and admin-API backends are preserved (#718). - Worker lifecycle service and a
/statusendpoint:run_workerregisters with backoff, performs an initial circuit pull, then runs a heartbeat loop (kept under the coordinator's 30s LOST timeout) and a pull-reconcile loop (the always-on backstop for missed events), auto-discovering each circuit's model from a replica'sGET /v1/modelsand deregistering on shutdown. The shared in-memoryAppProxyRegistryis indexed by subdomain and circuit id (#719). - Host/subdomain ingress resolver that turns a manager-issued endpoint subdomain into a concrete circuit and pins the request model so the existing selection path serves that circuit's replicas, with HS256 circuit-bearer verification that checks the decoded
idagainst the circuit id and rejects analg:nonedowngrade, an optionalaggregation_hostsfield for cross-circuit model aggregation, and a pure fall-through when nowildcard_domainis configured (#720). - Redis Pub/Sub circuit-event overlay that gives legacy-mode coordinators sub-second circuit updates: a subscriber loop with exponential-backoff reconnect, a base64 + msgpack envelope codec, handlers for
circuit_created/circuit_removed/circuit_route_updated, and the ack envelope that prevents the coordinator's E10001 "Proxy worker not responding" error (#721). - Claude Fable 5 (
claude-fable-5) and Mythos 5 (claude-mythos-5) model support (#747). Both are 1M-context, 128K-max-output models priced at \(10/\)50 per MTok; Mythos 5 is the same underlying model with safety classifiers lifted, shipped only through the limited Project Glasswing release. A newis_mythos_classhelper routes both ids through the Anthropic capability gates: adaptive thinking is required (legacybudget_tokensis rejected with HTTP 400 and normalized to adaptive),temperature/top_p/top_kare dropped, themaxeffort level is supported (xhighmaps tomax), and mid-conversation system messages are preserved. Both reject an explicitthinking.type == "disabled", soexplicit_thinking_for_modelnow returns anOptionand the router omits the thinking parameter entirely instead of forwarding a value that would 400.opus_supports_max_effortis renamed tosupports_max_effortbecause themaxeffort level is no longer Opus-only. The same handling applies to the OpenAI Responses API conversion path. - Gemma 4 QAT model metadata for the five quantization-aware-training checkpoints (E2B, E4B, 12B Unified, 26B-A4B MoE, 31B dense) with load-bearing
-it-qataliases and resolution/drift-guard tests (#723, closes #722). - Gemma 4 12B Unified model metadata (#705).
Changed¶
- Log request-body extractor rejections at
warn, so a malformed or oversized body that Axum rejects before the handler runs is visible in the logs instead of failing silently (#707).
Fixed¶
- Stop the retry loop from hammering the same upstream on HTTP 429 by distinguishing transient rate limits from non-transient quota/credit exhaustion and by honoring the upstream
Retry-Afterhint (#742, closes #740). Previously every 429 was retried up tomax_attempts(default 3) with a fixed exponential backoff that ignored the provider'sRetry-After, so for a model served by a single upstream the router re-hit the same exhausted endpoint, amplifying load and adding latency before an inevitable failure. TheRouterError::RateLimitedvariant now carries aretryableflag: non-transient 429s (OpenAIinsufficient_quota/billing_hard_limit_reached, and clear credit-depletion language such as "prepayment credits are depleted") are classified as non-retryable so the router fails fast after a single call and passes the provider status and body through, while transient signals (bareRESOURCE_EXHAUSTED/RPM throttling,rate_limit_exceeded,rate_limit_error) stay retryable. The classifier is deliberately narrow: the over-broad "billing" and "exceeded your current quota" markers that Google reuses verbatim for transient throttling were removed so a recoverable Google 429 is no longer flipped to fail-fast. A retried 429 uses the upstreamRetry-Afterfor its backoff (capped tomax_delay) and fails fast without sleeping when the requested interval would exceed the remaining total-timeout budget. The hint is preserved end to end (parsed from Google'sRetryInfo.retryDelayand the integer-secondsRetry-Afterheader, then reflected in the client-facingRetry-Afterheader). The budget probe now usessaturating_addand the parsed hint is clamped to 24 hours (MAX_RETRY_AFTER_SECS), closing a remotely triggerable panic where a hostile upstream's near-u64::MAXRetry-Afteroverflowed theDurationaddition and aborted the request task. - Persist the accumulated
/v1/responsesresponse on the streaming conversion paths (Anthropic, Chat-Completions/Gemini fallback) so a follow-up request that references a streamed output item via{"type":"item_reference","id":"item_..."}(the default behavior of the OpenAI and Vercel AI SDKs) resolves instead of returning HTTP 400, even when step 1 usedstore:true(#746, closes #745). The completed response is stored before the firstresponse.completedevent reaches the client; error-terminated streams that never emitresponse.completedare not stored, matching the non-streaming error paths, and passthrough streaming is unchanged because the upstream owns storage there. - Resolve
item_referenceinput items before strategy dispatch so/v1/responsesno longer returns HTTP 400 for Anthropic/Claude backends on a multi-step tool round-trip that submits anitem_referenceinstead of an inline item (#743, refs #741). References are rewritten to inlineFunctionCall/Message/FunctionCallOutputitems (de-duplicated bycall_id, first-wins),build_context_for_userreconstructs stored function-call output items as proper tool_use/tool_result pairs, the OpenAI/Azure passthrough path still forwards references unchanged, an unresolvable reference returns a descriptive400naming the id, and a 256-item cap (MAX_ITEM_REFERENCES) bounds the per-request session-store scan. - Propagate
server.workersto the Tokio runtime (#736, refs #734). The value was documented and shipped inconfig.yaml.examplebut had no effect, becausemainused an argument-less#[tokio::main]and the runtime always ran withnum_cpus::get()worker threads.mainis now synchronous: it peeksserver.workersfrom the config file, builds a correctly sized multi-thread runtime through the existingRuntimeConfig::build_runtimepath, and runs the async body on it, falling back to the CPU count when the value is unset or 0. - Emit conformant
function_calloutput items and argument events on/v1/responsesstreaming for non-passthrough providers (Anthropic and chat-completions-backed routes), preserving text output payloads and tracking interleaved parallel tool-call arguments by upstream index (#725). - Reuse the shared
ApiKeyStorein the Files API routes instead of constructing a separate store, so a runtime-managed API key is recognized consistently across the Files API and the rest of the router (#706). - AppProxy: preserve sibling circuits on single-circuit events (#737, closes #731). A worker serving two or more circuits previously wiped every unaffected sibling's
appproxy-*backends on any single-circuit event (leaving them 404/502 until the next pull-reconcile, up to 15s), becauseRegistryEntrycarried no route info and the rebuilt set held only the delta circuit. The full circuit is now cached on eachRegistryEntryand unchanged siblings are rebuilt from it, so only the delta circuit's backends change. - AppProxy: reach the fallback chain from wildcard subdomain ingress (#738, closes #735). A registered circuit whose replicas are all down is no longer a dead end; after per-circuit authorization it is pinned to its canonical model and handed to the normal pipeline, where
FallbackServicetakes over (the "deployment went down, traffic goes to a cross-provider model" behavior operators expect). The fall-through is scoped to registered-but-down circuits, a truly unknown subdomain stays a 404, and the open-to-public / bearer-token / IP-allow-list gates still run first. - AppProxy: preserve event-known models during periodic reconcile (#739). When the Redis event overlay learned a circuit's model before any successful pull probe, a reconcile could evict the registered-but-down registry entry and break scoped fallback; reconcile now reuses the shared registry's known model before probing replicas.
- AppProxy: align
WorkerRegisterResponsedeserialization with the Backend.AI coordinator's actual response shape, which carriesslotsas an array plusavailable_slotsas the count (#726).
Documentation¶
- Rewrite the Zensical user documentation as a current-state manual: drop development-log narration, the roadmap, and "coming soon" entries; correct configuration-reference drift against the actual config structs (nonexistent sections and keys, retry field names, the admin auth shape, the environment-variable tables, and the config discovery order); document previously missing shipped behavior (seven CLI flags, the auth login subcommand, Windows AF_UNIX support and SSE over Unix sockets, and the Windows/musl and .deb release artifacts); and bring the Korean docs to parity with English.
- Add the AppProxy worker mode design document (#708).
- Condense the README "Recent Updates" list to one concise line per release.
Dependencies¶
- Remove the
validatorderive dependency and its transitiveproc-macro-error2, clearing the RUSTSEC-2026-0173 advisory that previously needed a temporarycargo-denyignore whilevalidatorhad no safe upgrade path (#733, #732). - Update Rust package versions (#732).
v1.8.2 - 2026-06-02¶
Fixed¶
- Stop forwarding the client
Accept-Encodingheader on the/v1/responsespath (#702). When a client sentAccept-Encoding: gzip, deflate, br, the responses-path header filter omittedaccept-encodingfrom its block list and forwarded it to the upstream backend, which then negotiated gzip and returned compressed bytes. Because reqwest disables automatic decompression once anyAccept-Encodingheader is set manually (the explicit.header("Accept-Encoding", "identity")call only appended a second value rather than replacing the forwarded one), the SSE transform received raw gzip bytes, parsed them as text, and dropped the leadingresponse.created,output_item.added,content_part.added, andoutput_text.deltaevents, leaving only tail fragments with emptyitem_idandtext."accept-encoding"is now inFILTERED_HEADERSfor both the primary convert path (src/http/handlers/responses.rs) and the responses-native passthrough path (src/proxy/responses_only.rs), restoring parity with the chat-completions proxy (src/proxy/backend.rs) so the upstream only ever receivesAccept-Encoding: identity. - Stop double-wrapping Responses SSE lines so the non-GPT
/v1/responsesstreaming conversion emits single-layer OpenAI-compatible SSE records on the converted Anthropic and Chat-Completions paths instead of nested ones (#701).
Dependencies¶
- Bump
uuid1.23.1 → 1.23.2,redis1.2.1 → 1.2.2,socket20.6.3 → 0.6.4, andserial_test3.4.0 → 3.5.0 (#699).
Tests¶
- Hardening regression coverage for the production StreamService conversion processors, asserting single-layer SSE output on the converted Anthropic and Chat-Completions paths (#701).
- Integration regression for the
/v1/responsesstreaming path with an Anthropic backend and a gzip-requesting client, asserting the upstream request receives onlyAccept-Encoding: identityand the transformed Responses SSE stream retains the full event sequence with populated text and item ids (#702).
v1.8.1 - 2026-05-29¶
Added¶
- Claude Opus 4.8 recognition with a
claude_family_versionparser that replaces the hardcodedopus-4-7/opus-4-6substring gates (#693, part of #687). The four Anthropic capability predicates now compare a parsed(major, minor)version:uses_adaptive_thinking_api≥ (4,6),model_requires_adaptive_thinking/model_forbids_sampling_params≥ (4,7), andopus_supports_max_effort= Opus and ≥ (4,6). The parser treats the first integer token as the major and the next version-like token (1 to 2 digits, value < 100) as the minor, so an 8-digit date suffix like20250514yields minor 0 and is never mistaken for a version, and new minor releases are recognized without per-version edits. Adds theclaude-opus-4-8metadata entry (1M context, 128K output, \(5/\)25 pricing, adaptive thinking, Jan 2026 cutoff) and registersclaude-opus-4-8/claude-opus-4-8-latestin the built-in supported models and config samples. Behavior for 4.5/4.6/4.7 and Sonnet variants is preserved. - Anthropic fast mode behind a per-backend
anthropic_fast_modeopt-in (default off) (#694, part of #687).is_fast_mode_eligiblereturns true only for Opus 4.6/4.7/4.8 and later Opus minors;merge_beta_headercomma-joins and de-duplicates beta tokens while preserving any client-suppliedanthropic-beta. On the native/anthropic/v1/messagespath,resolve_fast_mode_betainjects the mergedfast-mode-2026-02-01beta header only when the request isspeed: "fast", the model is eligible, the backend is native Anthropic (never Bedrock), and the opt-in is enabled; when fast mode does not apply,speedis stripped from the outgoing body so it cannot trigger a spurious upstream 400. The OpenAI-compatible path forwardsspeed: "fast"and injects the beta header only for eligible, opted-in, native Anthropic targets. TheAnthropic -> OpenAIandAnthropic -> Googlefallback parameter mappings removespeedso a fast-mode request that falls back to a non-Anthropic backend does not leak the native-only field.usage.speedis preserved on the response. - Mid-conversation system messages for Claude Opus 4.8+ (#695, part of #687). A
role:"system"entry inside themessagesarray (which earlier Claude families reject with HTTP 400) is now accepted, gated on a newsupports_mid_conversation_system(model_id)predicate that reusesclaude_family_versionand matches family version ≥ (4,8). The native handler round-trips the entry unchanged to a native Anthropic backend; the cross-provider transforms map theSystemrole onto the OpenAIsystemrole and preserve it as user-role text for Gemini and Responses. The OpenAI-compatible transform emits mid-conversationsystem/developermessages (after the first user turn) as in-arrayrole:"system"entries for supporting models, while leading system messages still fill the top-levelsystemfield. Non-supporting models (Opus 4.7 and below, all Sonnet/Haiku, Bedrock-prefixed ids, non-Claude ids) keep the historical flattening into the single top-levelsystem. - Refusal
stop_detailsand therefusalstop reason propagated through the full Anthropic response pipeline (#696, part of #687).map_anthropic_finish_reasonmaps"refusal"to"content_filter"; the non-streaming transform and the streamingmessage_deltahandler attach thestop_detailsobject to the choice whenstop_reasonis"refusal", and omit the key (rather than forwarding a null) when upstream sends an explicit null.
Fixed¶
- Accept an
input_imagecontent part that references a Files API upload viafile_idinstead of an inlineimage_urlonPOST /v1/responses(#686, refs #681). Because the parent enums are#[serde(untagged)], a{"type":"input_image","file_id":"file-..."}part previously failed deserialization with a generic untagged-enum error and Axum returned HTTP 422.image_urlis now optional with an addedfile_id, mirroringinput_file; a sharedresolve_local_file_to_data_urlhelper resolves a localfile_idto an inline base64 image_url data URL through the same metadata, ownership, size, load, and base64 sequence, honoring ownership and the 10MB size limit. The OpenAI/Anthropic/Gemini converters handle the optionalimage_url, emitting the image when resolved and warning + skipping an unresolvedfile_id.validate_requestwalks message content and rejects aninput_imagewith neitherimage_urlnorfile_idwith a clear 400 before file resolution. - Harden the Claude Opus 4.8 routing gates so fast-mode speed is only forwarded when the transport has confirmed native-Anthropic opt-in and beta-header injection, non-Opus Claude families keep mid-conversation system messages flattened, and OpenAI-compatible Anthropic responses preserve
usage.speed(#698, refs #687).
Documentation¶
- Document Claude Opus 4.8 support in English and Korean (#697, closes #692, part of #687): add
claude-opus-4-8-*to the adaptive-thinking model list and the sampling-params-deprecated warning inreasoning-effort.md; add the Claude Opus 4.8 model detail, an Anthropic Fast Mode section, a Mid-Conversation System Messages section, and the refusalstop_reason->content_filtermapping tobackends.md; and add a changelog entry covering model recognition, fast mode, mid-conversation system messages, and refusalstop_details. - Document
input_imagefile_idsupport inapi.md(English and Korean), noting that exactly one ofimage_urlorfile_idis required and thatfile_idis resolved to an inline base64 data URL before reaching the backend under the same ownership and 10MB size limit as theinput_filepath (#686, refs #681).
Tests¶
claude_family_versionandsupports_mid_conversation_systemboundary tests (4.7 false, 4.8 true, Sonnet/Haiku/older false, Bedrock-prefixed and cross-region/ARN ids false), fast-mode eligibility, beta-header merge/dedup,speed/usage.speed(de)serialization,resolve_fast_mode_betagating with client-beta merge, andspeednon-leak versus preservation across fallback providers (#693, #694, #695).- Refusal coverage: non-streaming and streaming refusal with and without
stop_details, the explicit-null omission path, and regression tests forend_turn/max_tokens/stop_sequence/tool_use(#696). input_imagefile_id-only andimage_url-only deserialization, the full failing request payload as a regression, FileResolver resolution honoring ownership and the size limit, converter output across all three backends, and the neither-field validation case (#686).
v1.8.0 - 2026-05-28¶
Added¶
- Per-API-key backend access control via an optional
allowed_backendsallow-list on client API keys (#677, closes #674). When the list is non-empty, requests authenticated with that key may only route to the named backends; an empty or absent list keeps the existing unrestricted behavior. The field is integrated end-to-end: config file and hot-reload, the runtimeApiKeyandAuthContext, the backend-selection chokepoint and the Responses / Anthropic selection paths, cross-provider fallback, the Admin REST API (create/update/get/list), runtime-key persistence, and the models-listing endpoints. select_backend_with_retryand the Responses (StreamService), Anthropic-native,count_tokens, and image handlers filter candidates by the key's allow-list. When the model exists but the allow-list rejects every candidate, the request is rejected with a newRouterError::Forbiddenvariant mapped to403witherror_type = "permission_error"(non-retryable), distinct from the401 AuthError./v1/models,/v1/models/extended, and/anthropic/v1/modelsare filtered to models served by at least one allowed backend when a restricted key is authenticated;GET /v1/models/{model}returns404for a model the key cannot reach. Unauthenticated or unrestricted callers see the full list.- A new
api_optional_auth_middlewareis layered in permissive mode. It validates a presented bearer token on a best-effort basis and attachesAuthContextwithout ever rejecting, so per-key restrictions apply to authenticated callers while anonymous and invalid-token callers pass through unrestricted. The existing blocking-modeapi_auth_middlewareis unchanged; the two are never layered together. - Config validation warns (does not hard-fail) when a key's
allowed_backendsreferences an unknown backend name, so a backend rename does not brick the router before operators update the keys. fallback.mid_stream_enabledconfig field (defaulttrue) so operators can keep cheap pre-stream backend re-selection on the initial connection while turning off the per-stream mid-stream buffering (#680, closes #676). Previouslyfallback.enabledwas all-or-nothing: on meant both pre-stream and mid-stream fallback (with a per-streamStreamAccumulatorbuffering roughly 100 to 200 KB), off meant no fallback at all. Memory-constrained or high-concurrency hosts now have a middle ground.- The streaming dispatch becomes a three-way decision factored into a pure
decide_streaming_fallback_dispatchhelper. Withfallback.enabledtrue and a chain configured,mid_stream_enabled = truekeeps the buffering path (handle_streaming_with_mid_stream_fallback),mid_stream_enabled = falseroutes to the revivedhandle_streaming_with_pre_stream_fallback(noStreamAccumulatororMidStreamFallbackContextallocation; mid-stream failures surface as a normal stream error), and otherwise the standard no-fallback path runs. - The previously dead
handle_streaming_with_pre_stream_fallbackandadvance_to_next_fallbackare now live; the#[allow(dead_code)]markers are removed and the per-key allow-list is threaded through fallback re-selection so the newly-live path does not bypass per-API-key access control.
Changed¶
- Removed the streaming-local copies of
transform_payload_for_openaiand itsrequires_max_completion_tokenshelper fromsrc/http/streaming/handler.rs; both call sites now resolve to the canonical implementations incrate::proxy::utils(#679, closes #660). The two copies were byte-for-byte identical, creating a drift risk where a change to one would not be mirrored to the other. The streaming-local duplicate unit tests are removed since the canonical tests insrc/proxy/utils.rsalready cover the same contract.
Fixed¶
- Enforce the per-API-key backend allow-list in the default mid-stream fallback handler when it re-selects a backend after a mid-stream failure (#683). With
fallback.mid_stream_enabled = true(the default), a restricted key whose fallback chain mapped to a disallowed backend was transparently switched to it on a mid-stream failure, an access-control bypass that the newmid_stream_enabledwork in PR #680 had only closed on the pre-stream path.handle_streaming_with_mid_stream_fallbacknow takes an ownedallowed_backends: Option<Vec<String>>(moved into its spawned streaming task), and a newresolve_allowed_backend_name_for_modelhelper wrapsresolve_backend_name_for_modeland applies the sameallowed_backends.filter(|l| !l.is_empty())+ exact-name membership semantics used everywhere else in the module (try_get_healthy_backend_for_model,get_backend_for_model_streaming,get_healthy_backend_for_streaming). All three fallback re-selection sites resolve through it; a disallowed or unresolvable candidate returnsNone, so the existingwarn + continuearm skips it and the chain index advances. When every remaining candidate is filtered out the loop terminates via the existing chain-exhausted paths and surfaces an error to the client. An empty orNoneallow-list is byte-for-byte the prior unrestricted behavior. - Resolve an allowed fallback backend before rebuilding the fallback payload in mid-stream fallback, so a disallowed chain entry can no longer mutate
current_payloadbefore being skipped (#684). - Anthropic-native
x-api-keycallers now supply the sameallowed_backendspolicy asAuthorization: Bearer-authenticated callers when noAuthContextis present, covering Messages,count_tokens, and models listing (#684). The previously documented limitation that the per-key allow-list was unenforced on the native Anthropic surface is now closed. - Close a HIGH-severity authorization-bypass on
POST /v1/responses/compactfound in the post-merge security audit of #677. The endpoint was the one client-facing model-routing handler that never readAuthContextfrom request extensions, so a key scoped to backend set A could reach a disallowed passthrough backend B (OpenAI / Azure) through compaction whenever B served the requested model.compact_responsenow mirrorscreate_responseexactly: it derives the per-key allow-list viaallow_list_from_auth, filters the model's candidate backends against it before any health-check or passthrough, and returns a deterministic403 permission_errorwhen the model exists but the key cannot route to any backend serving it. The filter precedes backend forwarding, so the rejection happens without contacting an upstream.
Documentation¶
- Clarify in
config.yaml.example,src/services/streaming/mid_stream_config.rsrustdoc, anddocs/en/configuration/advanced.mdthatmid_stream_fallback.enableddoes not disable mid-stream buffering or reduce memory: regardless of its value theStreamAccumulatoris still constructed and still buffers up to roughly 100 KB per stream, and mid-stream fallback still activates on backend failure (#678, closes #675). The flag only selects continuation versus restart mode for the fallback request. The real kill-switch for the buffering and memory isfallback.enabled: falseor omitting the model fromfallback.fallback_chains. The Korean docs (docs/ko/configuration/advanced.md) had no corresponding Mid-Stream Fallback section so no Korean change is made for the clarification, but #680 separately added a translated "미드스트림 버퍼링 비활성화" subsection covering the new toggle. - Document the per-API-key
allowed_backendsallow-list in the security docs (English and Korean) including the now-closedx-api-keyAnthropic Messages limitation (#677, #684).
Tests¶
- Deterministic unit tests for
resolve_allowed_backend_name_for_model: allow-list excludes the resolved backend returnsNone, includes it returnsSome, empty list returnsSome,Nonematches the unfiltered resolver, unresolvable model returnsNone(#683). - Integration tests in
tests/per_key_backend_access_test.rsdriving/v1/responses/compactbehind blocking auth: a restricted key requesting a model served only by a disallowed backend gets403 permission_error(the security case), and a restricted key requesting an allowed backend's model passes the filter (#677). - Unit test for the new
decide_streaming_fallback_dispatchhelper covering the three-way decision matrix betweenfallback.enabled,mid_stream_enabled, and chain presence (#680).
v1.7.1 - 2026-05-28¶
Fixed¶
- Anthropic
web_search_20250305server-tool emulation no longer masks provider failures as empty result sets. A missing API key, HTTP 401/403/429, timeout, or parse error now surfaces to the client as aweb_search_tool_result_errorcontent block with a mappederror_code("too_many_requests"for HTTP 429,"unavailable"for all other failures). A genuinely empty-but-successful search continues to emit an emptycontent: []array, so the two outcomes remain distinguishable. Both non-streaming and streaming SSE code paths are covered, with an explicit SSE event-ordering assertion guarding the streaming error path. Serper response-shape drift (a body that parses cleanly but omits theorganickey) is now detected and logged with a provider-taggedwarn!that records only the observed top-level keys, never the user query text. (#671, #672, #673)
Dependencies¶
- Bump
lru0.16 → 0.18,reqwest0.13.3 → 0.13.4, andrusqlite0.39 → 0.40 (libsqlite3-sys0.37 → 0.38, unpinned now that CI runs Rust 1.95), plus a transitive lockfile refresh (aws-lc-rs1.16.3 → 1.17.0,h20.4.13 → 0.4.14,http1.4.0 → 1.4.1, andhyper/axum/reqwestknock-on revisions);cargo auditreports zero vulnerabilities across the dependency graph (#669, #670).
v1.7.0 - 2026-05-26¶
Added¶
- AWS Bedrock Claude backend Phase 1 (bedrock-mantle) over a Bearer token (#616, closes #613)
- New
type: bedrockbackend with serde aliases (aws-bedrock,bedrock-anthropic,AwsBedrock, ...).is_commercial()returns true andowned_by()returnsSome("anthropic")so OpenAI-shaped clients see the expected model lineage. endpoint_type: mantle(default) speaks the native Anthropic Messages API at the region-templatedhttps://bedrock-mantle.{region}.api.aws, routes to/anthropic/v1/messages, usesAuthorization: Bearer, and omitsanthropic-version(Bedrock returns HTTP 400 if present). An expliciturl:field overrides the template for proxies and tests; empty or uppercase regions are rejected at load time.model_ids.rsrecognizes plain (anthropic.<family>), geographic (us./eu./jp./au.), global (global.anthropic.<family>), and full-ARN identifiers, forwarded unchanged with no automatic alias mapping because the geo prefix carries real billing and residency consequences.-
The existing OpenAI/Anthropic body transforms and the Anthropic SSE stream transformer are reused unchanged, so per-model quirks (Opus 4.7 sampling-param ban, adaptive thinking) apply identically to Bedrock without duplication.
endpoint_type: runtimeis reserved here and implemented in Phase 2 below. -
Request-path rate limiting is now enforced, plus a Redis storage backend (#635, #632, closes #626)
state.rspreviously dropped theMiddlewareLayerreturned byinitialize_rate_limiting, so therate_limiting.*config was a silent no-op. The layer now flows throughServiceHandles→ContinuumRouterBuilder::build_router→Router::layerand attaches to the assembled Axum app, so configured budgets actually return429.- All five
rate_limitingdimensions are now optional:per_client,per_backend, andglobaljoin the already-optionalper_api_key/per_model. Operators may omit any dimension and load with it disabled; theDefaultimpl retains all three so existing deployments are unaffected (#632). -
New
redis-cache-gatedrate_limit_v2::redis_backendruns a token-bucket and a sliding-window Lua script, each as a single atomicEVAL, reusing the sharedcreate_redis_poolhelper with keys likecr:rl:per_client:10.0.0.1. On any Redis failure (pool unavailable, timeout, Lua error) the backend reportsBackendUnavailableand the caller falls through to the in-process token bucket, degrading to per-replica enforcement rather than dropping requests. -
Cross-provider fallback is now wired into request dispatch and hot-reload (#631, #637, #665)
- The completed
src/core/fallback/module (~4,900 lines, 36 unit tests) was never called from the request path, so configuredfallback.fallback_chainswere a silent no-op.FallbackServicenow runs forchat_completions(non-web-search),completions,embeddings,rerank,sparse_embeddings, and image generation, with anexecute_with_optional_fallbackwrapper (no overhead when fallback is unconfigured),X-Fallback-*response headers, and aFrom<RouterError>toTriggerReasonmapping that satisfies the executor bound (#631). -
fallback.fallback_chainsandfallback.fallback_policychanges now apply at runtime through the hot-reload subscriber viaFallbackService::update_config; togglingfallback.enabledremains restart-only and is documented as such inconfig.yaml.example(#637, #665). -
POST /v1/models/refreshforce-refresh endpoint for interactive desktop use (#593, #664) - Clears the
ModelCache(all_modelskey) and synchronously re-aggregates from all configured backends before responding, returning the same{"object":"list","data":[...]}shape asGET /v1/models. Desktop clients (e.g. backend.ai-go "Refresh models" button) can use the response immediately without a second round-trip. - Rate-limited per verified API key, with anonymous or invalid-token callers sharing one global anonymous bucket: 3 requests per 5-second burst window, 12 per minute. Callers that exceed the limit receive
429 Too Many Requests. The limit is intentionally tighter than the regular list endpoint because each call triggers an upstream fetch from every configured backend. - Gated by a new
model_aggregation.allow_force_refresh: boolconfig field (defaulttrue). Setting it tofalsemakes the endpoint return403 Forbidden, suitable for hardened deployments where clients must rely on TTL-based expiry. - Each refresh logs the verified API key ID when available, or
anonymousotherwise, atINFOlevel for audit correlation. config.yaml.exampleextended with desktop-embedded guidance:cache_ttl: 10,soft_ttl_ratio: 0.5, andallow_force_refresh: truefor backends.ai-go style embedded proxy.-
New
force_refresh(state)helper onModelAggregationServiceencapsulates the clear-then-aggregate flow.allow_force_refresh()accessor exposes the config flag to handlers. -
AWS Bedrock Claude backend Phase 2:
bedrock-runtimewith SigV4 + AWS binary event-stream (#614) - New
endpoint_type: runtimevalue ontype: bedrockbackends targetshttps://bedrock-runtime.{region}.amazonaws.com/model/{modelId}/invoke[-with-response-stream]. The router signs each request with AWS Signature V4 (service: "bedrock"), wraps the OpenAI → Anthropic body with"anthropic_version": "bedrock-2023-05-31", strips the top-level"model"field (Bedrock takes the model ID from the URL path), and percent-encodes the model identifier into the path so versioned foundation IDs (anthropic.claude-3-5-sonnet-20240620-v1:0) and full ARNs round-trip cleanly. - Streaming responses arrive in the AWS
application/vnd.amazon.eventstreambinary frame format. A newbedrock::event_stream::EventStreamDecoderreassembles frames that span multiple TCP reads, base64-decodes eachchunkpayload, and emits syntheticevent: <type>\ndata: <json>\n\nSSE bytes for the existingAnthropicStreamTransformerto translate into OpenAI-shape SSE. Exception frames (ThrottlingException,ValidationException, ...) surface as syntheticevent: errorSSE chunks instead of being silently dropped. - AWS credentials resolve in this order: inline
auth.aws.access_key_id+auth.aws.secret_access_key(+ optionalsession_token), then a named profile viaauth.aws.profile, then the standard AWS chain (env vars, shared config, IMDS, IRSA / EKS pod identity, ECS task role). The resolver is fronted byaws_credential_types::provider::SharedCredentialsProviderso temporary credentials refresh transparently between requests. - New
BackendAuthType::Sigv4variant onBackendAuthConfig, plus anAwsAuthConfigsub-block underauth.aws. Both are wired throughDebugredaction so static credentials never leak into logs.BackendAuthTypeacceptssigv4,aws_sigv4, andaws-sigv4as YAML spellings. - Health check for runtime probes
POST /model/{probe_model}/invokewith a single-token body; HTTP 2xx, 400, 401, 403, and 429 all count as healthy because they prove the AWS surface is reachable and the operator can address auth/billing issues separately. - All AWS SDK crates (
aws-sigv4,aws-smithy-eventstream,aws-credential-types,aws-config) sit behind a new optionalbedrock-sigv4Cargo feature. Default builds do not pull them in; configuringendpoint_type: runtimewithout the feature returns a clear error pointing at the rebuild flag instead of failing at the AWS edge. The Phase 1 mantle path is unaffected and works with or without the feature. - Proxy header policy in
src/proxy/backend.rsandsrc/http/handlers/anthropic/handler.rssplits onendpoint_type: Bedrock-mantle keepsAuthorization: Bearer, Bedrock-runtime suppresses the static-Bearer injection because the Backend trait implementation signs each request with SigV4. -
English and Korean docs in
docs/{en,ko}/configuration/backends.mdextended in place with the newendpoint_type: runtimeconfiguration, build requirement, credential chain, IAM policy snippet, geo/global profile behaviour, and a streaming-pipeline overview. -
EXAONE 4.0 (vLLM) registration with a request-gated hybrid-model thinking transform (#640, refs #639)
- EXAONE 4.0 (e.g.
EXAONE-4.0-32B-FP8-RNGD) is a hybrid reasoning model: in reasoning mode it streams chain-of-thought inline incontentended by a lone</think>; in non-reasoning mode it emits a plain answer with no</think>. - The
assume_reasoning_first(unterminated_start) transform is now gated on the request actually enabling thinking (chat_template_kwargs.enable_thinkingor top-levelenable_thinking; conservative default false) at both the HTTP and Unix-socket streaming decision points, so non-reasoning mode no longer emits the whole answer asreasoning_contentwith emptycontent. Standard-pattern models keyed off a real<think>marker are unaffected. -
Registers
exaone-4.0-32bwith theunterminated_startconfig; the served-RNGDname resolves via the hardware-suffix peel below. -
NPU/accelerator hardware-variant suffix normalization in model-id matching (#662)
- Adds a HARDWARE/ACCELERATOR category (
rngd,warboy,atom,atommax,rebel) tois_recognized_format_token()so FuriosaAI (RNGD/WARBOY) and Rebellions (ATOM/ATOMMAX/REBEL) serving-target suffixes normalize to the canonical base metadata entry through the existing layered peel chain, without a per-model alias.layered_format_strip()lowercases before peeling, so runtime-emitted upper-case names resolve correctly. - Exact-id and exact-alias phases run before the peel, so any model legitimately registered as
*-atomor*-rebelwins via exact match first; a grep of shippedmodel-metadata.yamlconfirms zero current collisions.EXAONE-4.0-32B-FP8-RNGDnow normalizes toexaone-4.0-32bby peel (-rngdthen-fp8).
Changed¶
- Narrow the fallback handler's LM Studio compatibility shim so only
/and/v1/modelsreturn200; all other unmatched routes now return404. The JSON error body shape is preserved unchanged, so any consumer already reading the body continues to work (#628).
Fixed¶
- Strip seven non-OpenAI top-level fields (
chat_template_kwargs,thinking_budget_tokens,enable_thinking,preserve_thinking,top_k,min_p,repeat_penalty) before forwarding to Gemini's/v1beta/openai/chat/completionsendpoint, which returns HTTP 400INVALID_ARGUMENTfor unknown keys; theextra_bodyescape hatch is untouched andreasoning_effortstays (Google maps it tothinking_level). Also extend the3.5-flashthinking-disable and is-thinking matchers, since3-flashdid not substring-match3.5-flash(#642). - Wire request-stats recording into every Anthropic handler code path (native HTTP/Unix, Bedrock mantle/runtime, OpenAI-compatible, Responses API) for both streaming and non-streaming, adding an
AnthropicStreamUsageTrackerthat accumulates input/output tokens from raw passthrough SSE (#627, #634). - Use the configured
timeouts.request.streaming.chunk_intervalinstead of a hardcoded 60s for mid-stream inactivity, and emit bounded keep-alives so a silent backend now advances to the next fallback model viaStreamOutcome::Failedrather than emitting keep-alive comments forever (#633). - Accept partial
model_overrides.<model>.streaming/standardblocks via newStreamingTimeoutOverride/StandardTimeoutOverridestructs whoseOption<String>fields merge over the base config, fixing YAML parse failures on the--generate-configoutput path (#630). - Gate
admin/metrics/metrics-persistence/webuiimports and functions behind their Cargo features, and add a#[cfg(not(feature = "metrics"))]no-op metrics stub mirroring the public surface used by always-compiled callers, so feature-reduced builds compile cleanly (#629, #666, closes #636). - Harden force-refresh rate limiting so anonymous and invalid-token callers share one global bucket, preventing spoofed
Authorization/X-Forwarded-For/X-Real-IPheaders from bypassing the budget. - Metrics history query limiting, UTF-8-safe metric label truncation, and Bedrock runtime routing through the typed SigV4 implementation (follow-up to #608, #609, #613, #614).
Documentation¶
- Document the Bedrock backend (region selection, geographic vs global inference profiles, model ID format, credential chain, IAM policy snippet, and streaming pipeline) in
docs/{en,ko}/configuration/backends.md; add a Force-Refresh Models section todocs/en/api.md; and extendconfig.yaml.examplewith desktop-embedded model-aggregation guidance and fallback hot-reload annotations.
Tests¶
- Negative and positive case coverage for
transform_payload_for_openai(#661). - Rate-limit middleware hot-reload tests with documented bucket-reset behaviour, plus
router_wiring_teststhat build a realContinuumRouterand assert429fires when the burst is exhausted (#635, #638, #667). - Verify
web_searchinjection interacts correctly with the passthrough contract (#663). - MLxcel streaming passthrough integration test (#659).
- Bedrock unit and integration coverage: serde aliases, URL templating, header policy, model-ID parsing (geo/global/ARN), runtime SigV4, and event-stream frame decoding driven against a wiremock server (#616, #614).
Dependencies¶
- Bump tokio 1.52.1 → 1.52.3, tower-http 0.6.8 → 0.6.11, dashmap 6.1.0 → 6.2.1, serde_json 1.0.149 → 1.0.150, aws-config 1.8.16 → 1.8.17, aws-sigv4 1.4.3 → 1.4.4, and aws-smithy-types 1.4.7 → 1.4.8 (#619, #658).
v1.6.3 - 2026-05-12¶
Added¶
- Per-API-key LLM token usage metrics (#608, #610)
- New Prometheus
llm_tokens_total{api_key_id, model, backend, kind}counter that records actual prompt and completion token consumption per API key, model, backend, and token kind. The hot-path counter's label set is intentionally minimal — extra dimensions live on the companion info-metric below. - Companion
api_key_info{api_key_id, ...}info-metric exposes a configurable allowlist of per-API-key annotation labels (e.g.email,team,environment) so dashboards can group/filter the token counter via standard PromQL* on(api_key_id) group_left(...)joins without bloating the hot-path counter's label set. derive_api_key_idreturns either the configuredid(when the auth layer matched the request) or a SHA-256 first-12-hex prefixk_<hex>of the raw bearer token. The raw key is never used as a label. A dedicatedApiKeyCardinalityTracker(default cap: 1000 unique key IDs) prevents label-cardinality explosion.ApiKeyConfigand the in-memoryApiKeygain anannotations: HashMap<String, String>field.MetricsConfiggainsannotation_labels: Vec<String>— the allowlist that materializes as labels onapi_key_info. Reserved canonical annotation keys are documented (email,uuid,owner,team,environment); operators may add custom keys.- Streaming and non-streaming paths record at the existing
usageparse site through a newStreamObservabilityContextfield onStreamTransformConfig, threaded throughhandle_anthropic_streaming/handle_gemini_streaming/handle_successful_backend_responseso OpenAI-compat / Anthropic / Gemini / thinking-pattern streaming response builders all emit the counter without duplicate parsing. The router already injectsstream_options.include_usage=truefor OpenAI-compat backends, so streaming metrics work uniformly regardless of client opt-in. api_key_infois initialized once at startup frommetrics.annotation_labels; label names are frozen at registration (Prometheus does not allow renaming labels). Annotation values hot-reload through the existing config-watch path viaApiKeyStore::refresh_info_metric, called fromload_from_config,add_key, andremove_key_by_idso admin operations stay in sync.- All label values flow through
CardinalityManager/sanitize_label_value. Annotation values use a slightly less strictsanitize_annotation_valuethat preserves@,+, and:so emails and namespaced identifiers round-trip cleanly. - Persistent local metrics log backed by SQLite with configurable retention (#609, #611)
- New
MetricsStoreasync trait + bundledrusqlitev1 implementation undersrc/metrics/persistence/with WAL mode, prepared-statement cache, andPRAGMA user_versionschema versioning (store.rs/sqlite.rs/snapshot.rs/snapshot_task.rs). Histograms and summaries are fanned into row-per-sample form. - Counters and gauges are NEVER restored on startup — the persistent log is a separate read path so the live
/metricsendpoint keeps Prometheus monotonic-counter semantics. Historical samples are read through a newGET /admin/metrics/history?metric=...&from=...&to=...surface (src/admin_metrics_history.rs) that returns 404 when persistence is disabled at runtime and 503 when the feature is not compiled in. No PromQL in v1. - Hot-reload pipeline in
src/server/serve.rstranslates config changes intoPersistenceCommand::{SetSnapshotInterval, SetRetentionDays, SetCompaction}messages, atomically rebuilding the ticker and prune cutoff without dropping in-flight snapshots. - Compaction schedule honors a
minute hour * * *cron subset to avoid pulling in a full cron crate for what is effectively a daily timer. - Defaults to
enabled: true; switch off viametrics.persistence.enabled: false. Theredbandduckdbvariants are reserved keywords in the YAML schema and returnNotImplementedat startup until they get implementations. - Disk usage: measured ~119 bytes/sample on a synthetic 100-series × 10-snapshot workload (see
tests/metrics_persistence_test::disk_usage_smoke_check_under_synthetic_load). Documented formula indocs/en/persistent-metrics.mdandconfig.yaml.example.
Fixed¶
- Coerce token-usage label values to
&strinwith_label_valuesso the release build no longer fails type inference. Mixing&Stringlabel variables with a&strliteral ("prompt"/"completion") made the compiler pick&[&String]and reject the literal — regression introduced in #610.
Documentation¶
- Korean translations for the two metrics features (#612)
docs/ko/metrics.md: new### API 키별 LLM 토큰 사용량section coveringllm_tokens_total,api_key_idderivation,annotation_labelsallowlist,api_key_infoinfo-metric, PromQL examples, Grafana panel, and verification steps.docs/ko/persistent-metrics.md: new page translatingdocs/en/persistent-metrics.md(SQLite-backed snapshot semantics, configuration fields, disk-usage formula,/admin/metrics/historysurface, schema layout, operational notes).docs/ko/admin-api.md: insert## 지속 메트릭 로그 APIsection between Stats and Response Cache, plus a matching TOC entry.zensical.ko.toml: add지속 메트릭 로그nav entry under 운영 so the page is reachable from the Korean sidebar.- New
docs/en/metrics.md### Per-API-Key LLM Token Usagesection covering metric definition,api_key_idderivation rules, annotation config schema, cardinality and hot-reload semantics, example PromQL (tokens-per-email, top-10 keys, per-team rollup), a Grafana panel example, and verification steps.config.yaml.examplegains a documentedmetrics.annotation_labelsblock and anannotations:example under each API-key entry.
Tests¶
- Per-API-key token-usage unit coverage:
derive_api_key_idpriority (configured id wins; otherwise hash; otherwise anonymous), determinism, hash format^k_[0-9a-f]{12}$, annotation-label normalization, info-gauge one-time-init, refresh atomicity, cardinality bounds, email-preserving annotation sanitizer; streaming-transformer write-through; middleware annotation-snapshot exposure; integration coverage intests/metrics_integration_test.rs(4-label counter with both kinds, anonymous fallback, hash regex). (#610) - Persistent-metrics SQLite store unit coverage (insert, query by time range, retention deletion, idempotent open, unknown-kind round-trip) and integration coverage in
tests/metrics_persistence_test.rs(snapshot task lands rows in SQLite, retention prunes only old samples, retention hot-reload preserves in-flight snapshots, disk-usage smoke check). (#611)
v1.6.2 - 2026-05-10¶
Fixed¶
/v1/responsesand/v1/chat/completionsnow accept the OpenAI reasoning-APIdeveloperrole (#603, #605, #606)- Add
MessageRole::Developerwith a serde lowercase rename so"developer"deserializes as a first-class variant. The previous failure surfaced as a misleadingdid not match any variant of untagged enum ResponseInputrather than naming the unknown role; the implicit-message deserialization error now names the offending role string and lists the valid roles. - Per-backend translation: pass through as
developerfor OpenAI-compatible servers; merge into the Anthropic top-levelsystemparameter (concatenated with\n\nwhen both system and developer text are present, fixing a pre-existing overwrite bug); merge into Geminisystem_instruction; map tosystemfor Ollama (older builds rejectdeveloper). - Chat Completions → Responses converter recognizes
developeras instruction-bearing: the first occurrence becomes top-levelinstructions; subsequent occurrences remain as input items with their original role preserved on the wire. - Treat
developerandsystemequivalently in cross-cutting string-based recognition sites: prefix-cache key extraction, cross-provider fallback translation, the OpenAI-to-Anthropic transform's system-content extraction, the global-prompt injector's existing-system-message lookup, and the smart-routing classifier / LLM prompt builder.
Documentation¶
- Migrate the docs site from MkDocs to Zensical and restore brand styling (#602)
- Remove
mkdocs.ymlandmkdocs.ko.ymlin favor of nativezensical.tomlandzensical.ko.toml, both rooted under the[project]namespace per Zensical's TOML schema; per-extension options live inside[project.markdown_extensions]as a dict (Zensical's config loader ignores any separatemdx_configstable). - Replace
docs/en/sharedanddocs/ko/sharedsymlinks withrsync -a --delete docs/shared/ docs/{en,ko}/shared/invoked before each build, since Zensical does not follow symlinks for asset directories. - Register the lablup brand color via Zensical's documented
primary = "custom"mechanism plus a[data-md-color-scheme="default"][data-md-color-primary="custom"]selector indocs/shared/stylesheets/extra.cssthat defines the orange CSS variables. - Mermaid is registered as a
pymdownx.superfencescustom fence rather than relying on the now-incompatiblemermaid2plugin; favicon falls back tologo.pngwhen missing. - Restore Zensical render output for icons, diagrams, and brand color (#604)
- Re-enable
pymdownx.emojiwith thezensical.extensions.emojitwemoji index/generator (replaces the removedmaterialx) so:material-*:icon syntax stops rendering as literal text. - Reimplement the
<!-- diagram: PATH --> ... <!-- /diagram -->ASCII-replacement as a Python-Markdown extension (docs/hooks/diagram_extension.py); the prior MkDocson_page_contenthook does not run because Zensical exposes no MkDocs hook lifecycle. Adddocs/__init__.pyand prefix builds withPYTHONPATH=.so the extension is importable from Zensical's console-script entry point. - Set
--md-primary-bg-coloron the custom palette and override.md-header/.md-tabsso the orange brand band paints on top of Zensical's modern layout. - Move the
navtable above the first[project.X]sub-table in both TOMLs so it stops being silently parsed under[[project.extra.social]](alphabetical fallback was producing an unsorted top menu and wrong prev/next footer neighbors).
Tests¶
- Regression coverage for system/developer concatenation in the Anthropic transform on both streaming and non-streaming paths, plus per-backend converter mapping for the
developerrole across all five backends and the Chat Completions → Responses converter's developer-then-system ordering (#605, #606).
Dependencies¶
- Bump
redis1.2.0 → 1.2.1 (#598).
v1.6.1 - 2026-05-07¶
Fixed¶
- Claude Opus 4.7 (
claude-opus-4-7) now routes correctly through the Anthropic backend (#599, #600, #601) - Extended the adaptive thinking API gate (
uses_adaptive_thinking_api) to include 4.7-series model IDs. Claude Opus 4.7 requiresthinking.type == "adaptive"+output_config.effort; sending the legacybudget_tokensshape produces HTTP 400. - Added
model_requires_adaptive_thinkingandmodel_forbids_sampling_paramspredicates for 4.7-series request-shape rules: explicit manual thinking is normalized to adaptive thinking andtemperature,top_p, andtop_kare dropped unconditionally before forwarding. - Extended
opus_supports_max_effortto include Opus 4.7 soxhighreasoning effort maps tooutput_config.effort = "max"on Opus 4.7. - Added
claude-opus-4-7andclaude-opus-4-7-latestto the built-in supported-models list and tomodel-metadata.yaml; the speculativeclaude-sonnet-4-7entry is intentionally not advertised until Anthropic publishes it (defensive request-shape matching is retained for user-supplied configurations).
Documentation¶
- Update reasoning-effort docs (EN + KO) and
backends.mdto cover the Claude 4.7 family adaptive-thinking requirement and unconditional sampling-parameter deprecation (#600).
Tests¶
- Responses API regression coverage for Opus 4.7 adaptive thinking and unconditional sampling-parameter stripping; both transform paths (Chat Completions and Responses) for the 4.7 family with negative regression on Opus 4.6 / Sonnet 4.6 / Haiku 4.5 / Haiku 3.5 (#600, #601).
v1.6.0 - 2026-05-04¶
Added¶
- ChatGPT subscription / Codex backend authentication via OAuth device flow (#551, #592)
continuum-router auth login --backend <name>runs the OpenAI Codex three-step headless device-code flow:POST /api/accounts/deviceauth/usercodeto mint a one-timeuser_code,POST /api/accounts/deviceauth/tokenpolling, and a PKCE exchange at/oauth/token. Standards-compliant RFC 8628 device flow remains available for any future provider that implements it; the newOpenAICodexDeviceFlowClientis selected automatically forprovider: openai.- Tokens are wrapped in
SecretString, written to the configuredtoken_storewith mode0600on Unix using anO_CREAT|O_EXCLopen + atomic rename; a random tempfile suffix prevents concurrent saves from colliding, and a partial write is unlinked on failure so secret material does not linger on disk. - Access-token expiry is parsed from the JWT
expclaim (with a 1-hour fallback for non-JWT tokens) and clamped to a useful minimum so a degenerateexpires_infrom the provider cannot trigger a refresh storm. - Proactive refresh fires 60 s before expiry, single-flighted with a
tokio::sync::Mutex. A401from the upstream backend triggers exactly one forced refresh and a single retry; the previous refresh token is preserved race-free when the provider omitsrefresh_tokenfrom a refresh response. - The strategy reports an
identity_fingerprint()(backend name,client_id,token_store) so that hot-reload rebuilds the strategy when any of those rotate, instead of silently keeping the prior in-memory state. - The CLI strips C0/C1 control characters from
verification_uri_completeanduser_codebefore printing, so a hostile provider response cannot inject ANSI escapes that rewrite the terminal. - Every device-flow and runtime request to
auth.openai.com/chatgpt.com/backend-api/codexcarriesoriginator: codex_cli_rs(configurable viaauth.oauth.originator) and acodex_cli_rs/<version>User-Agent(configurable viaauth.oauth.user_agent), matching the official Codex CLI so Cloudflare admits the traffic instead of returning a 403 JS challenge. auth.type: oauthis accepted in YAML alongside the legacyo_authsnake_case rendering.client_idandscopedefault to the public Codex CLI values; onlytoken_storeis required for the ChatGPT-subscription case.- Anthropic Messages and Chat Completions surfaces both transparently route to the ChatGPT Codex backend (#592)
- Any backend whose
auth.typeisoauthand whose provider uses the Codex flow (currentlyopenai) is forced through the Responses API for every request, regardless of per-modelresponses_onlymetadata.chatgpt.com/backend-api/codexexposes/responsesonly — no/chat/completions— so chat-shaped models (e.g.gpt-5.5, alias-mappedclaude-haiku-4-5) and unknown model IDs all dispatch through/v1/responses→…/backend-api/codex/responses. Non-OAuth OpenAI backends continue to honor the per-modelresponses_onlyflag. - New
core::url_utils::compose_backend_urlcentralizes backend URL composition for the three OpenAI-compatible roots (/v1,/openai,/backend-api/codex). Replaces ad-hocends_with("/v1") || ends_with("/openai")checks acrossproxy/backend.rs,http/handlers/responses.rs,http/streaming/handler.rs,services/responses/stream_service.rs, and the Anthropic handler so the/backend-api/codexrule applies uniformly. - The proxy hot path (
proxy/backend.rs,proxy/responses_only.rs,proxy/image_gen.rs,proxy/image_edit.rs) now flows through a backend-name-keyedAuthStrategyRegistryexposed onAppStateviasrc/proxy/oauth_helper.rs. The helper looks up the strategy, callsrefresh_if_needed()before sending, replaces the static-bearer header with one derived from the strategy, and force-refreshes + retries once on a 401. Staticapi_keyauth continues to work unchanged when no strategy is registered. - The Anthropic-compatible handler (
src/http/handlers/anthropic/handler.rs) consults the same registry. Client-suppliedAuthorization: sk-ant-…andx-api-keyheaders are dropped when the backend has an OAuth strategy, instead of being forwarded to OpenAI as the bearer. - Model fetcher detects OAuth-authed backends and falls back to the configured
modelslist rather than probing/v1/models, sincechatgpt.com/backend-api/codexdoes not expose a models endpoint. - Codex-compatible Responses API extensions (#536, #537)
POST /v1/responses/compactendpoint for context compaction — passthrough to OpenAI / Azure OpenAI native/v1/responses/compact; other backend types return501.storefield onResponsesRequest(defaults totrue) controls upstream session persistence; Codex sendsstore: falsefor ephemeral requests.output_textcontent part type alongsideinput_textso converters can differentiate assistant vs. user content in input items. All converters (OpenAI, Anthropic, Gemini) handle the new variant.
Documentation¶
- Sync Codex / Responses-API extensions across the root
CHANGELOG.mdand the Korean docs (ko/configuration/backends.md,ko/configuration/advanced.md,ko/api.md,ko/architecture.md); resolve all zensical build warnings on both EN and KO builds and preserve unicode in toc anchor slugs viapymdownx.slugs.slugify(#596). - Clean up AI-slop patterns across English and Korean mkdocs sources — replace em dashes in prose, remove filler/slop words, rewrite trailing participial clauses and inflated verbs, collapse colon+bullet AI-style intros, and replace closing summary slop with concrete next-action links (#597).
CI/CD¶
- Bump
apple-actions/import-codesign-certsfrom 6 to 7 (#590).
Dependencies¶
- Bump
tokio1.51.0 → 1.52.1,axum0.8.8 → 0.8.9,reqwest0.13.2 → 0.13.3,clap4.6.0 → 4.6.1,fastrand2.4.0 → 2.4.1,uuid1.23.0 → 1.23.1,rand0.10.0 → 0.10.1, andlru0.16.3 → 0.16.4 (#595).
v1.5.6 - 2026-04-29¶
Fixed¶
/v1/chat/completionsreturned HTTP 502responses_parse_failedforresponses_onlyreasoning models (gpt-5.4-pro, gpt-5.5-pro). OpenAI's/v1/responsespayload for these models contains output items shaped like{ "id": "rs_...", "type": "reasoning", "summary": [] }, butOutputItem::Reasoningrequiredcontentandstatus, so serde rejected the payload withmissing field 'content'. The Anthropic Messages surface bypassed the strict variant on a different conversion path, masking the bug until directly tested.contentandstatusare now optional onOutputItem::Reasoning; reasoning items are dropped before reaching Chat Completions clients (per existing project policy), so body shape is irrelevant beyond successful deserialization. (#594)
Changed¶
- Realign
gemini-3.1-pro-previewas the canonical metadata id for the Gemini 3.1 Pro family inmodel-metadata.yaml, withgemini-3.1-pro(and existing-latest/-customtoolsforms) demoted to aliases. Matches whatgenerativelanguage.googleapis.comactually serves today — the canonicalgemini-3.1-proform returns 404 from upstream — and avoids implying GA availability that does not exist yet. The metadata cache still resolves both forms to the same entry. Note: alias-to-canonical rewriting on the upstream-bound payload is out of scope for this release; clients calling with thegemini-3.1-proalias will still hit upstream 404 until that work lands. (#594) - Sample
config.yamlregisters the newly-available pro / 5.5 family models so theresponses_onlydispatch path can be exercised end-to-end against real upstreams (gpt-5.4-pro,gpt-5.2-pro,gpt-5.5,gpt-5.5-pro,claude-opus-4-7,gemini-3.1-pro,gemini-3.1-pro-preview); duplicateclaude-haiku-4-5entry removed.
v1.5.5 - 2026-04-27¶
Added¶
- Transparent Responses-API routing for OpenAI Pro models (epic #581)
- New
responses_only: truecapability flag inmodel-metadata.yamland the built-in OpenAI registry marksgpt-5.2-pro,gpt-5.4-pro, andgpt-5.5-proas served only on/v1/responsesupstream (#574, #582) /v1/chat/completionsrequests forresponses_onlymodels are dispatched to the upstream/v1/responsesendpoint and translated back into a strict-modechat.completion(orchat.completion.chunkfor streaming) envelope, transparent to the client. Streamusageis gated bystream_options.include_usage, and per-model latency / success counters are recorded for the responses_only path (#578, #584)/anthropic/v1/messagesrequests forresponses_onlymodels are converted to the Responses API shape, dispatched to/v1/responses, and translated back into Anthropic Messages JSON (or the Anthropic SSE event sequence for streaming) — tool-call round-trips, web-search emulation, and Unix-socket transports all branch on the flag (#575, #577, #583, #585, #586)- Anthropic Messages <-> Responses request transformer covers system → instructions, tools, tool_choice (including
disable_parallel_tool_use→parallel_tool_calls: false),max_tokens→max_output_tokens, reasoning effort derivation, and multi-turn tool round-trips; the response transformer preserves thinking/text/tool_use ordering and stop-reason fidelity (#575, #583) - SSE streaming bridge (
AnthropicResponsesStreamTranslator) maps Responses API events to Anthropic Messages events while preserving Anthropic's strict event-ordering invariants (singlemessage_start, pairedcontent_block_start/content_block_stop, terminalmessage_stop); handles mid-streamerror/response.failed/response.cancelled,response.incomplete→stop_reason: max_tokens, deferred input tokens, and graceful early-close synthesis (#576, #585) - Only OpenAI and Azure OpenAI backends serve
/v1/responses; pairing aresponses_onlymodel with another backend type produces a400 invalid_request_errorbefore any upstream call (rejection fires on both/v1/chat/completionsand/anthropic/v1/messagessurfaces) (#577, #589) - The first dispatch per
(backend, model)pair logs atinfolevel so operators can confirm Responses-API routing without enabling debug logs - Anthropic Messages → Responses requests explicitly send
store: falseto avoid upstream side-effects (#589) - 22 deterministic, in-process integration tests covering the {Anthropic, Chat} × {gpt-5.4-pro, gpt-5.2-pro} × {non-streaming, streaming} × {plain, tool-call, reasoning} matrix, mid-stream backend-failure negatives on both surfaces, and an upstream byte-fragmentation regression guard (#579, #588)
- Documented in
docs/en/configuration/advanced.md(Responses-API-only Models section split into Models-marked-out-of-the-box, Marking-a-new-model, Dispatch-behavior, and Backend-type-constraint subsections),docs/en/architecture.md(Responses-API Routing data-flow diagram), and thedocs/en/api.mdChat Completions and Anthropic Messages surface notes with a Transparent-Responses-API-routing subsection (#580, #587)
Fixed¶
- Chat Completions responses-only routing now rejects incompatible-only backend configs before upstream dispatch and chooses a compatible OpenAI/Azure Responses backend when available (#589)
- Chat assistant
tool_calls[]are preserved as Responsesfunction_callinput items for stateless tool-result turns over/v1/chat/completions(#589)
v1.5.4 - 2026-04-25¶
Changed¶
- Refresh
model-metadata.yamlfor late-April 2026 frontier model releases (#572, #573) - Add GPT-5.5 ($5/$30 per 1M, 1M context, knowledge cutoff 2025-12, omnimodal, leads Terminal-Bench 2.0 at 82.7%) and GPT-5.5 Pro ($30/$180 per 1M, Responses API only, deep reasoning) — released 2026-04-23
- Add DeepSeek V4 Pro (1.6T total / 49B active MoE, 1M context, 384K max output, three reasoning effort modes) and DeepSeek V4 Flash (284B total / 13B active MoE, 1M context, 384K max output) with
deepseek-chatanddeepseek-reasonerretained as deprecated aliases per official API docs — released 2026-04-24 - Add
gpt-image-2(token-billed instead of per-image: text $5/$30, image $8/$30 per 1M tokens; 1K/2K/4K resolution tiers; ~99% text accuracy in any language; built-in reasoning before generation; context-aware multi-turn editing;gpt-image-2-latestalias) — released 2026-04-21 - Add Claude Opus 4.7 ($5/$25 per 1M, 1M context, 128K max output, knowledge cutoff 2026-01, high-resolution image support up to 2576px / 3.75MP, new tokenizer with ~1.0–1.35× token usage vs prior models, new
xhigheffort level) — released 2026-04-16 - Promote Gemini 3.1 series from preview to GA, retaining
-previewsuffix as alias for fallback compatibility (#573) gemini-3.1-pro-preview→gemini-3.1-pro(withgemini-3.1-pro-preview,gemini-3.1-pro-preview-customtools, andgemini-3.1-pro-latestaliases)gemini-3.1-flash-image-preview→gemini-3.1-flash-image(withgemini-3.1-flash-image-preview,nano-banana-2, andgemini-3.1-flash-image-latestaliases)gemini-3.1-flash-lite-preview→gemini-3.1-flash-lite(withgemini-3.1-flash-lite-previewandgemini-3.1-flash-lite-latestaliases)- Updated
gemini-3-flash-previewdeprecation note to point to the new GAgemini-3.1-proid
v1.5.3 - 2026-04-23¶
Added¶
- HuggingFace repo-prefix stripping as a new matching phase (phase 5) in
src/models/pattern_matching.rs(#555) try_strip_hf_repo_prefix()validates avendor/repo(ororg/team/repo) prefix against aMAX_PREFIX_SEGMENTS = 3bound, rejects empty segments (/repo,vendor/,vendor//repo), and rejects any ASCII whitespace before returning the residual- Phase 5 re-enters phases 1-4 on the stripped residual with a structurally-enforced recursion depth of exactly 1 (the re-entry call clears the
allow_prefix_stripgate), so prefix stripping composes with the existing layered suffix peel in a single lookup — the motivating caseunsloth/Qwen3.6-35B-A3B-GGUFnow resolves toqwen3.6-35b-a3bwithout any hand-registered alias - Phase 5 runs before the wildcard phase; the blast-radius audit confirmed no
*-bearing alias inmodel-metadata.yamlcontains/, so the ordering change is behavior-neutral for existing routing - Phase numbering in tracing output realigned to match the documented phase chain (previous code emitted
phase = 7for the namespace fallback while comments called it phase 6) - 12 new unit tests covering standard HF form, composition with suffix peel, case-sensitive vendor, registered-alias precedence, unresolvable residual, three-segment form, segment-cap rejection, no-slash input, whitespace rejection, empty segments, re-entry bounding, and alias-phase precedence
- 9 new integration tests in
tests/format_suffix_normalization_test.rsexercising the fullRouterConfig/BackendConfigpublic API through phase 5 - Pipeline doc updated in
docs/en/configuration/advanced.md(and Korean counterpart) with a new "HuggingFace repo-prefix stripping (phase 5)" section covering the composition semantics, security bounds, and out-of-scope list (hyphen prefixes, HF API discovery)
Changed¶
- Replaced the previous phase-6 namespace fallback with the new phase-5 HuggingFace prefix-strip layer. The previous phase was case-sensitive and did not compose with suffix peel; the new phase applies stricter input validation (segment cap, empty-segment rejection, whitespace rejection) but composes with phase 4's case-insensitive peel through the bounded re-entry. Pathological inputs above
MAX_PREFIX_SEGMENTS(3) — such asprovider/deep/nested/model— are now rejected by phase 5 rather than silently matched via recursiversplit_oncefallback (#555) - Aliases currently classified as
vendor-prefixin the #560 audit (e.g.,Qwen/Qwen3.6-35B-A3B,MiniMaxAI/MiniMax-M2.5) are now peel-coverable-adjacent post-#555: phase 2 still wins on the explicit alias, but phase 5 + phase 4 together reach the same metadata. Retroactive removal is deferred to a follow-up audit per #555 design section 7
Fixed¶
POST /anthropic/v1/messagesnow works when the selected backend is configured with aunix://URL (#567)- Native Anthropic backends and OpenAI-compatible backends both work over Unix sockets, for both non-streaming and streaming requests
- Socket paths containing spaces (e.g. macOS
~/Library/Application Support/...) are handled correctly - Auth header selection (
x-api-keyfor Anthropic backends,Authorization: Bearerfor OpenAI-compatible backends) is correct on the Unix socket path anthropic-versionheader is added automatically for Anthropic backends on the Unix socket path, matching the HTTP path behavior
v1.5.2 - 2026-04-21¶
Added¶
- Regression tests locking down the transport-layer passthrough contract for llama.cpp and MLxcel backends (#562)
- New
tests/llamacpp_passthrough_test.rsandtests/mlxcel_passthrough_test.rscovering all four passthrough call sites: direct backendexecute_chat_completion, factory-built backend (BackendFactory -> LlamaCppBackend),proxy/backend.rsHTTP handler, and the streaming handler - New
test_mlxcel_factory_backend_passthrough_nonstandard_fieldsasserts thatBackendFactory -> LlamaCppBackend::execute_chat_completionpreserves non-standard fields byte-for-byte at transport time - Anthropic input test (
tests/anthropic_input_test.rs) extended with explicit passthrough coverage docs/en/architecture/backend-passthrough.mdand its Korean counterpartdocs/ko/architecture/backend-passthrough.mddocumenting the passthrough contract, the four guarded call sites, and the list of router-side transforms that run before transport (global_prompts,transform_payload_for_openaifor o1/o3/gpt-5*,web_searchinjection) (#562, #563)docs/reports/alias-audit-2026-04.mdclassifying every alias inmodel-metadata.yamlinto peel-redundant, peel-redundant-but-kept, and peel-independent categories, with an "aliases vs peel" policy section added todocs/en/configuration/advanced.md(and the Korean counterpart) explaining when to prefer each mechanism (#560)
Changed¶
- Narrowed the passthrough contract from an implied "byte-equivalent" global guarantee to a transport-layer scope — the router may still run
global_promptsinjection, o1/o3/gpt-5* payload transforms, andweb_searchtool injection before transport, but no provider-specific rewriting happens at the transport boundary (#563) - Comment-only clarifications in
src/http/streaming/handler.rs,src/infrastructure/backends/factory/backend_factory.rs,src/infrastructure/backends/llamacpp/backend.rs, andsrc/proxy/backend.rs - Audited
model-metadata.yamlaliases for peel-normalization redundancy: removed aliases that differ from the canonical ID only by suffixes already handled by the layered peel (-4bit,-q4_k_m,-fp8,-gguf,-mlx,-awq, etc.), while preserving aliases that encode canonical flavor variants (-qat,-instruct) or disambiguate parameter counts (#557) - New
tests/alias_audit_helper.rsandtests/format_suffix_normalization_test.rsenforce the peel-vs-alias boundary going forward
CI¶
- Target Ubuntu 26.04 LTS (Resolute) instead of 25.10 (Questing) in the Debian build workflow
- Fall back to
createdAtwhen releasepublishedAtis null indebian/update-changelog.shto prevent changelog regression when the latest release is still in draft
v1.5.1 - 2026-04-20¶
Added¶
- Built-in
web_searchtool for self-hosted LLM backends (#553) - Router-level tool transparently injected into chat completion requests for vLLM, Ollama, llama.cpp, MLxcel, LM Studio, Continuum Router, and Generic backends
- Pluggable
SearchProvidertrait undersrc/services/search/withSerperProviderimplementation; Exa and Brave scaffolded behind the same trait - Configurable
inject_policy(auto/always/never) with per-backend overrides; commercial backends (OpenAI, Azure, Gemini, Anthropic) left untouched so their nativeweb_searchcontinues to flow through unchanged - Bounded non-streaming tool-execution loop parses
web_searchtool calls, executes the provider, appends tool-role results, and re-invokes the backend up tomax_tool_iterationsrounds - New
BackendTypeConfig::is_self_hosted/is_commercialhelpers covered by unit tests enforcing the commercial/self-hosted partition invariant - API keys redacted in Debug output and never logged; hot-reload friendly
WebSearchConfigwith${ENV}substitution - Prometheus counters for tool calls, injections, and iteration-cap hits under
src/metrics/web_search - Layered quantization and format suffix normalization for model metadata lookup (#549)
- New
layered_format_strip()insrc/models/pattern_matching.rsiteratively peels allowlisted quantization/format/flavor tokens from the right side of a model ID, retrying exact-id/alias/date-suffix matches after each peel - Token categories:
BIT_WIDTH,GGUF_QUANT,FP_FORMAT,INT_FORMAT,LIBRARY,IMATRIX,UNSLOTH,CONTAINER,FLAVOR(all case-insensitive) - Parameter-count suffixes preserved:
-Nbitstripped as quantization;-Nb,-aNb,-eNb,-0.6bkept as parameter counts - Canonical base IDs ending in allowlisted flavors (e.g.
gemma-3-12b-qat) win via exact-id match before peel runs - Normalization pipeline wired into
find_matching_config,BackendConfig::get_model_metadata,RouterConfig::get_model_metadata,RouterConfig::get_thinking_pattern_config,resolve_model_tier(routing), andget_model_profile(admin) - Model metadata for GLM 5.1, Qwen 3.6, and MiniMax M2.7 (#548)
- Teams release notification posted to Microsoft Teams via Power Automate webhook after build and Docker jobs
Changed¶
- Migrate documentation toolchain from MkDocs + Material for MkDocs to Zensical — reads
mkdocs.ymlnatively and bundles required extensions
Fixed¶
- Security: Cap layered peel phase with
MAX_MODEL_ID_LEN=256andMAX_PEEL_ITERATIONS=8to eliminate DoS via pathological model IDs (previously O(n²) allocation on inputs like-4bit-4bit-4bit-...) - Security: Enforce 256-char model field length at
/v1/chat/completions,/v1/completions,/v1/embeddings, and/v1/embeddings/sparse(parity with existing/v1/responsescheck) - Consolidate 7-phase metadata matching pipeline into a single implementation (
find_matching_config_slice) with thin adapters at each call site, eliminating drift betweenBackendConfig,Config::get_model_metadata,Config::get_thinking_pattern_config, andfind_matching_config - Replace
cfg.to_ascii_lowercase() == peelwithstr::eq_ignore_ascii_caseon the hot path (~4000 fewer per-request String allocations) - Pin Pygments <2.20 to fix MkDocs build failure (superseded by Zensical migration)
CI¶
- Bump softprops/action-gh-release from 2 to 3 (#544)
- Bump actions/github-script from 8 to 9 (#545)
- Bump actions/upload-pages-artifact from 4 to 5 (#554)
Documentation¶
- Document suffix-order ambiguity (
-qat-4bitvs-4bit-qat) and internal peel phase bounds indocs/en/configuration/advanced.md - Add
pattern_matching.rsto Model Aggregation Service module listing indocs/en/architecture.mdwith cross-reference to suffix normalization section - New
docs/en/web-search.mdfeature documentation;config.yaml.exampleextended withweb_searchsection
v1.5.0 - 2026-04-11¶
Added¶
- Smart routing system with model tier & capability profile registry (#525, #531)
- Rule-based request classifier & smart routing policy engine (#526, #532)
- Load-aware dynamic tier adjustment (#527, #533)
- LLM-based request classifier with hybrid mode (#528, #534)
- Smart routing observability, admin API & documentation (#529, #535)
- Codex-compatible Responses API extensions (#536, #537)
Changed¶
- Upgrade core dependencies — axum 0.8, sha2 0.11, rand 0.10 (#523)
- Add Gemma 4 model family metadata (#538)
Fixed¶
- Complete smart routing integration gaps
- Increase DefaultTransformer PDF size limit from 20MB to 32MB (#542)
CI¶
- Bump actions/deploy-pages from 4 to 5 (#521)
Dependencies¶
- Bump the minor-and-patch dependency group with 4 updates (#539)
Documentation¶
- Add Codex-compatible Responses API gap analysis report
v1.4.5 - 2026-03-27¶
Fixed¶
- Return 400 error when file references are used without file service configured (#519)
Changed¶
- Add GLM-5-Turbo model metadata (#516)
Documentation¶
- Fix Korean anti-AI-slop violations in ko/ documentation
- Fix slop word and transition word in api.md
v1.4.4 - 2026-03-18¶
Fixed¶
- Fix Anthropic thinking failing for high/xhigh reasoning effort —
budget_tokens(32768) exceeded defaultmax_tokens(16384), causing API rejection (#514) - Auto-adjust
max_tokenstobudget_tokens + 4096when thinking is enabled and budget exceeds max
Changed¶
- Add GPT-5.4 model family:
gpt-5.4,gpt-5.4-pro,gpt-5.4-mini,gpt-5.4-nanowith 1M context window (#515) - Update Gemini 3 series: add
gemini-3.1-pro-preview,gemini-3-flash-preview,gemini-3.1-flash-lite-preview; markgemini-3-pro-previewas deprecated - Recognize Gemini 3 Flash and 3.1 Flash-Lite as thinking models for
include_thoughtsauto-injection - Update Claude 4.6 models: context window to 1M (GA), fix Sonnet 4.6 max_output to 64K, correct knowledge cutoffs
- Update config examples and documentation with latest model names across 8 files
v1.4.3 - 2026-03-18¶
Fixed¶
- Fix Gemini thinking models (2.5 Pro, 3 Pro, etc.) not returning
reasoning_contentin streaming responses through the router (#513) - Replaced
transform_payload_for_gemini()withtransform_request_gemini()across all three Gemini streaming paths to ensureinclude_thoughts: trueauto-injection
v1.4.2 - 2026-03-17¶
Changed¶
- Change mid-stream fallback default to enabled for improved streaming reliability (#504)
- Breaking: Mid-stream fallback is now enabled by default; set
mid_stream_fallback.enabled: falseto restore previous behavior
Documentation¶
- Add failover latency tuning guide for optimizing fallback behavior
v1.4.1 - 2026-03-17¶
Added¶
- Mid-stream fallback for streaming inference (#497) — when a backend fails mid-stream during SSE streaming, the router transparently retries with a fallback backend
Changed¶
- Decouple pre-stream fallback from mid-stream fallback (#500) — each can now be independently enabled/disabled
- Bump dependency versions to latest major releases
Fixed¶
- Fix streaming config changes not detected in hot reload system (#503)
- Fix mid-stream connection errors leaking to client during fallback (#502)
- Remove unused config crate dependency
CI¶
v1.4.0 - 2026-03-14¶
Added¶
- Prefix-aware routing: PrefixAwareHash selection strategy with Consistent Hash with Bounded Loads (CHWBL) (#455, #457, #461)
- Response caching: SHA256-based cache key computation with streaming response buffering and post-completion caching (#456, #459, #462)
- Multi-tier CacheStore: in-memory backend (#466), Redis/Valkey backend with connection pooling (#467), and S3-backed tiered L1/L2 cache (#483)
- KV cache index: shared data structure (#470), KV event consumer for vLLM backend streams (#471), prefix overlap scoring integrated into backend selection (#473), configuration/metrics/admin endpoints (#474)
- Tiered KV cache with storage-tier awareness (GPU hot / external warm) (#484)
- Disaggregated prefill/decode orchestration with external KV tensor transfer (#485)
- Anthropic cache_control breakpoint auto-injection (#460)
- Multimodal embedding support for Gemini Embedding 2 (#492)
- Shared cache configuration and operational metrics (#468)
- 30 new models added to model-metadata.yaml (#472)
Changed¶
- Rename VAST-specific identifiers to generic S3/external storage names (#490) — update configuration files if using VAST-specific field names
Fixed¶
- Make RequestExecutor transport-aware for Unix socket paths with spaces (#488)
- Replace relative source tree links with GitHub URLs in docs
CI¶
- Bump docker/setup-qemu-action from 3 to 4 (#428)
- Bump docker/metadata-action from 5 to 6 (#426)
- Bump docker/setup-buildx-action from 3 to 4 (#429)
- Bump docker/build-push-action from 6 to 7 (#430)
- Bump docker/login-action from 3 to 4 (#427)
Documentation¶
- Comprehensive KV cache feature documentation, benchmarks, and config examples (#477)
- VAST Data connection guide and integration examples (#486)
- Sync Korean documentation with English counterparts
- Split monolithic configuration.md into 6 smaller files
v1.3.0 - 2026-03-12¶
Added¶
- Agent Communication Protocol (ACP) support with JSON-RPC 2.0 protocol layer and stdio transport (#414, #420)
- ACP session management with protocol lifecycle, initialize/shutdown handshake (#415, #421)
- ACP-to-LLM inference pipeline with streaming support (#416, #422)
- ACP tool call reporting and permission delegation (#417, #423)
- MCP-over-ACP bridge for MCP server tunneling (#418, #424)
- ACP agent registry with metadata and configuration support (#419, #425)
- ACP integration tests for protocol lifecycle and session management
Fixed¶
- Resolve clippy
field_reassign_with_defaultwarnings in ACP integration tests
CI¶
- Bump actions/upload-artifact from 6 to 7 (#398)
Documentation¶
- ACP architecture documentation with MkDocs integration
- ACP practical usage guide with IDE integration examples
- KV cache integration plan for router-level caching strategies
v1.2.1 - 2026-03-07¶
Added¶
- MLxcel backend type support for MLX-based model serving (#412, #413) — fully API-compatible with llama-server, reusing the same backend implementation for health checks, model discovery, and proxying
v1.2.0 - 2026-03-06¶
Added¶
- Admin Statistics API with comprehensive request-level statistics collection and reporting (#409)
- Endpoints:
GET /admin/stats,GET /admin/stats/models,GET /admin/stats/backends,POST /admin/stats/reset - Time-windowed queries, token usage tracking, latency percentiles (p50, p95, p99)
- Statistics persistence with configurable snapshot path, interval, and staleness checks (#410, #411)
- Atomic writes, restore on startup, final snapshot on graceful shutdown
Documentation¶
- Add admin stats and persistence to configuration guide
- Add post-refactoring benchmark report for v1.1.0 (#407)
v1.1.1 - 2026-03-04¶
Added¶
- Embeddable library crate (Phase 1) — use continuum-router as a Rust dependency (#394)
- Type-safe config builders for programmatic library usage (#400)
- Cargo feature flags for optional library dependencies (#399)
- Persistent storage for runtime API keys (#405)
- New LLM model metadata entries (#403)
Fixed¶
- Fix Gemini-specific transforms incorrectly applied in Anthropic handler (#404)
v1.1.0 - 2026-03-01¶
Added¶
- Embedded WebUI for configuration management and API key administration (#388)
- Windows AF_UNIX socket support via socket2 crate (#390)
- Nano Banana 2 (Gemini Image Generation) support
Fixed¶
- Resolve compilation error in
ClientAddr::is_unixfor tuple variant matching - Resolve Windows AF_UNIX socket accept failure and config validation
- Accept Windows absolute paths in Unix socket config validation (#393)
- Resolve Windows compilation errors in Unix socket tests and transport parsing (#392)
v1.0.0 - 2026-02-19¶
Added¶
- Continuum Router federation — router-to-router chaining as a new backend type (#385)
- LM Studio as a dedicated backend type (#381)
- Anthropic adaptive thinking effort parameter (
output_config.effort) (#384) - Adaptive thinking and auto reasoning effort level across backends (#378)
- Cohere/Jina-compatible rerank and sparse embedding endpoints (#374)
- BGE-M3 and multilingual embedding model support (#373)
- Claude Opus 4.6 model metadata
- Qwen3-Coder-Next, Qwen3-VL-30B/8B model metadata
Changed¶
- Handle SIGTERM for graceful shutdown on Unix systems (#370)
- Reduce per-backend filter and model metadata log verbosity during model refresh (#371, #375)
CI¶
- Replace Ubuntu 24.10 with 25.10 in deb build matrix (#376)
v0.36.1 - 2026-01-30¶
Fixed¶
- Trigger immediate health check after sync_backends during hot reload (#368) — new backends now available within 1-2 seconds instead of up to 30 seconds
- Sync health_check_info and use URL-based updates during hot reload (#369) — new backends properly receive API key authentication
- Accelerate health checks for recently added backends — 1-second check interval for 5 minutes after addition
- Trigger model cache refresh when backends transition to healthy state with 5-second debounce
v0.36.0 - 2026-01-27¶
Added¶
- Native Anthropic Messages API handler with endpoint routing (#355)
- Anthropic to OpenAI request/response transformation (#356, #357)
- Anthropic streaming response format (#358)
- Direct Anthropic to Gemini request/response transformation (#359)
- File_id source type and file resolution for Anthropic input (#360)
- Claude Code compatibility for Anthropic handler (#365)
- Tiered token counting for all backend types
- Parallel file reference resolution for improved performance
- Anthropic-version header format validation
Fixed¶
- Require HTTPS for image and document URLs to prevent SSRF
- Return generic error messages to clients instead of backend details
- Use authenticated user_id from API key for file ownership checks
- Use UUID v4 for secure message/tool ID generation
- Place tool messages before user text in Anthropic-to-OpenAI conversion
- Override stop_reason to tool_use when tool_use blocks are present
- Apply max_completion_tokens conversion for OpenAI-routed Anthropic requests
- Propagate file access denied and not found errors to client
- Call current_config() once per request for consistent behavior
Refactored¶
- Extract common SSE event type and data extraction logic
- Add parse_bytes method to SseParser for proper UTF-8 handling
- Remove unnecessary Arc wrapper in AnthropicFileResolver
- Box FileResolutionResult::Resolved to reduce enum size
v0.35.0 - 2026-01-23¶
Added¶
- Gemini 3 thoughtSignature support in function calling (#354)
- PDF support for OpenAI and Anthropic file transformers (#340)
- Text/plain support for AnthropicFileTransformer (#342)
Fixed¶
- Add PDF support to DefaultTransformer and file resolution (#343)
- Add tool message transformation to non-streaming Anthropic requests (#344)
- Reject non-image files in DefaultTransformer with clear error message (#338)
- Fix AI SDK incompatibility with Responses API streaming format (#335)
v0.34.0 - 2026-01-16¶
Added¶
- Automatic quality parameter conversion between DALL-E and GPT Image models (#330)
Changed¶
- Native Anthropic conversion for Responses API PDF file uploads (#332)
Fixed¶
- Gemini streaming tool_calls compatibility fixes (#333) — missing index field, tool_choice format preservation, unnecessary transformation removal
v0.33.0 - 2026-01-13¶
Added¶
/v1/embeddingsendpoint for embedding API support (#319)- Resolve local file_id references in Responses API requests (#326)
user_dataandevalspurpose values for Files API (#322)
Fixed¶
- Use flat tool format for Responses API function tools (#324)
- Improve Unix socket test stability for parallel execution (#328)
v0.32.0 - 2026-01-09¶
Added¶
- Reasoning effort documentation and improved xhigh fallback logging (#317)
Fixed¶
- Support implicit message type inference in Responses API InputItem (#316)
Refactored¶
- Optimize InputItem deserializer and add invalid role test
v0.31.5 - 2026-01-09¶
Added¶
- Responses API pass-through support for native OpenAI backends (#313) — smart routing based on backend type with direct forwarding to
/v1/responsesendpoint - OpenAI Responses API file input types (#311) — support for
input_text,input_file,input_imagecontent parts with SSRF validation
Fixed¶
- Forward raw backend error responses in pass-through mode
- Address security and performance issues in Responses API pass-through
v0.31.4 - 2026-01-07¶
Fixed¶
- Use current_config() for hot reload support in proxy handlers (#310) — API key and configuration changes via hot reload now properly apply to new requests
v0.31.3 - 2026-01-06¶
Fixed¶
- Add Anthropic transformations to Unix socket transport (#308) — Unix socket transport now applies the same request/response transformations as HTTP transport
- Preserve stream parameter for non-streaming Anthropic requests (#306)
v0.31.2 - 2026-01-05¶
Added¶
- Non-streaming support for Anthropic backend requests
- Tool call and tool result transformation for Anthropic backend — enables multi-turn tool use conversations
v0.31.1 - 2026-01-04¶
Fixed¶
- Non-streaming Anthropic requests failing with wrong authentication header (#301) — now correctly uses
x-api-keyheader instead ofAuthorization: Bearer
v0.31.0 - 2026-01-04¶
Added¶
- Unix socket server binding alongside TCP (#298) — supports
unix:URI scheme, socket_mode configuration, auto-cleanup - Reasoning parameter support for Responses API (#296) with nested format and low/medium/high/xhigh effort levels
- xhigh reasoning effort support for GPT-5.2 thinking models with auto-downgrade for unsupported models
- Configurable health check endpoints per backend type (#293) — custom endpoint, fallback endpoints, method, body, accept_status, and headers
Changed¶
- Comprehensive reasoning parameter normalization across backends (#294)
v0.30.0 - 2026-01-01¶
Added¶
- Wildcard patterns and date suffix handling in model aliases (#286) — automatic date suffix normalization,
*pattern matching (prefix, suffix, infix), zero-config date handling
Fixed¶
- Apply default URL for Anthropic backend when not specified (#288)
- Replace owned_by placeholders with backend-type-specific values (#287)
Documentation¶
- Translate wildcard pattern and date suffix handling documentation to Korean (#289)
v0.29.0 - 2026-01-01¶
Added¶
- Accelerated health checks during backend warmup (#282) — 1s interval on HTTP 503, configurable via
warmup_check_intervalandmax_warmup_duration --model-metadataCLI option for specifying model metadata file path at runtime (#281)
Fixed¶
- Replace OpenAI owned_by placeholder with 'openai' (#280)
- Prevent race condition in Admin API concurrent backend creation (#278)
- Fix missing processing steps in hot reload (#277)
- Cloud backends now show
available: truein/v1/models/{model_id}(#272)
v0.28.0 - 2025-12-31¶
Added¶
- SSE streaming support for tool calls (#258)
- llama.cpp tool calling auto-detection via
/propsendpoint (#263) - Extended
/v1/models/{model_id}endpoint with rich metadata fields (#262) - Tool result message transformation for multi-turn conversations (#265)
- Backend-specific owned_by placeholders for llamacpp, vllm, ollama, http (#267)
Changed¶
- Improved
--helpoutput formatting with title header and project attribution (#269)
Fixed¶
- Sync model metadata cache with ConfigManager (#270)
v0.27.0 - 2025-12-29¶
Added¶
- Complete Unix socket support for model discovery and SSE streaming (#248, #252, #253, #254, #256)
- SSE/streaming for Unix socket backends
- Backend type auto-detection for Unix sockets
- vLLM and llama.cpp model discovery via Unix sockets
- Tool call transformation across all backends (#244, #245, #246) — tool definitions, tool_choice, and tool call responses for Anthropic, Gemini, and llama.cpp
v0.26.0 - 2025-12-27¶
Added¶
- GET
/v1/models/{model}endpoint for single model retrieval with real-time availability status (#236)
v0.25.0 - 2025-12-26¶
Added¶
- CORS (Cross-Origin Resource Sharing) support (#234) — configurable origins, wildcard patterns, custom schemes (e.g.,
tauri://localhost), preflight cache - Unix Domain Socket backend support (#232) —
unix:///path/to/socketscheme, lower latency than localhost TCP
v0.24.0 - 2025-12-26¶
Added¶
- llama.cpp backend support for local LLM inference (#230)
- Allow router to start without any backends configured (#226)
Changed¶
- Enable hot reload for backend additions/removals from config (#229)
v0.23.1 - 2025-12-25¶
CI¶
- Add Windows x86_64 build target to release workflow (#224)
v0.23.0 - 2025-12-23¶
Added¶
- GLM 4.7 model support with thinking capabilities (#222)
- GCP Service Account authentication support for Gemini (#208)
- Distributed tracing with correlation ID propagation (#207) — W3C Trace Context with traceparent header
- Thinking pattern metadata for models with implicit start tags (#218)
- Model metadata for NVIDIA Nemotron 3 Nano, Qwen Image Layered, and Kakao Kanana-2 (#202)
- ASCII diagram to image replacement system for MkDocs (#200)
Fixed¶
- Prevent cache stampede with singleflight, stale-while-revalidate, and background refresh (#220)
- Apply global_prompts changes via hot reload (#219)
- Invalidate model cache when backend config changes (#206)
CI¶
- Skip Rust tests in CI when only non-code files change (#204)
- Bump actions/github-script from 7 to 8 (#210)
- Bump apple-actions/import-codesign-certs from 3 to 6 (#212)
- Bump actions/cache from 4 to 5 (#211)
- Bump actions/checkout from 4 to 6 (#209)
v0.22.0 - 2025-12-19¶
Added¶
- Docker support with pre-built binary images — Debian (~50MB) and Alpine (~10MB) with multi-arch support (#198)
- Container health check CLI (
--health-check) for orchestration (#198) - Docker Compose quick start configuration
- Automated Docker image publishing to ghcr.io in release workflow
- MkDocs documentation website with Material theme (#183)
- Korean documentation translation (i18n) — complete localization of all 20 documentation files (#190)
- Security policy with vulnerability reporting process (#191)
- Dependency security auditing with cargo-deny and Dependabot (#192)
Changed¶
- Integrate orphaned architecture documentation into MkDocs site (#186)
- Rename documentation files to lowercase kebab-case for URL-friendly filenames
Fixed¶
- Fix health check response validation logic bug (operator precedence)
- Fix address parsing fallback silently hiding configuration errors
- Fix IPv6 address formatting in health check
v0.21.0 - 2025-12-19¶
Added¶
- Gemini 3 Flash Preview model support (#168)
- Default authentication mode for API endpoints (#173) —
permissive(default) orblockingmode - Backend error passthrough for 4xx responses (#177) — parse and forward original error messages from OpenAI, Anthropic, and Gemini
Fixed¶
- Handle UTF-8 multi-byte character corruption in streaming responses (#179)
- Strip response_format parameter for GPT Image models (#176)
- Allow auto-discovery for all backends except Anthropic (#172)
- Always return b64_json field for Gemini image generation responses (#181)
v0.20.0 - 2025-12-18¶
Added¶
- Image variations support for Gemini (nano-banana) models (#165)
- Image edit support for Gemini (nano-banana) models (#164)
- Enhanced
/v1/images/generationswith streaming and GPT Image features (#161) - gpt-image-1.5 model support (#159)
/v1/images/variationsendpoint (#155)/v1/images/editsendpoint for image editing and inpainting (#156)- External Markdown file support for system prompts with REST API management (#146)
- Automatic model discovery for backends without explicit model list (#142)
- Solar Open 100B model
Security¶
- API key redaction to prevent credential exposure in logs and error messages (#150)
Changed¶
- Optimized release binary size from 20MB to 6MB (70% reduction) (#144)
Refactored¶
v0.19.0 - 2025-12-13¶
Added¶
- Runtime Configuration Management API (#139)
- Configuration query, modification, save/restore, and backend management APIs
- Sensitive information masking, JSON Schema generation, configuration history with rollback (up to 50 entries)
- Comprehensive Admin REST API reference documentation
- 33 integration tests for configuration API endpoints
Security¶
- Input validation with 1MB content limit and 32-level nesting depth
- Audit logging for sensitive data exports with 30+ sensitive field patterns
v0.18.0 - 2025-12-13¶
Added¶
- Per-API-key rate limiting (#137)
- API key management and configuration system
- Files API authentication and authorization (#131)
- Hot reload for runtime configuration updates (#130)
Fixed¶
- Add ConnectInfo extension for admin/metrics/files endpoints
- Address security vulnerabilities in API key management
Refactored¶
- Extract CLI and app utilities into modular structure (#132)
- Split converter.rs into modular structure (#132)
- Split large source files into modular components
v0.17.0 - 2025-12-12¶
Added¶
- Anthropic backend file content transformation (#126)
- Gemini backend file content transformation (#127)
Fixed¶
- Streaming file uploads to prevent memory exhaustion (#128)
v0.16.0 - 2025-12-12¶
Added¶
- OpenAI-compatible Files API endpoints (#111)
- File resolution middleware for chat completions (#120)
- OpenAI backend file handling strategy (#121, #122)
- Persistent metadata storage for Files API (#125)
- GPT-5.2 model support (#124)
- Circuit breaker pattern for automatic backend failover
- Admin endpoint authentication and audit logging
- Configurable fallback models for unavailable model scenarios with cross-provider support
Fixed¶
- Sanitize fallback error headers and metric labels
- Use index-based lookup for fallback chain traversal
- Reduce lock contention in FallbackService with snapshot pattern
v0.15.0 - 2025-12-05¶
Added¶
- Nano Banana (Gemini Image Generation) API support (#102)
- Split
/v1/modelsendpoint — standard lightweight vs extended metadata response (#101)
Changed¶
- Optimize LRU cache to use read lock for cache lookups (#105)
Fixed¶
- Replace
.expect()panics with proper error propagation in HttpClientFactory (#104)
Refactored¶
- Extract streaming handler logic to dedicated StreamService (#106)
- Eliminate retry logic code duplication in proxy.rs (#103)
v0.14.2 - 2025-12-05¶
Added¶
- Log token usage (input/output tokens) on request completion (#92)
v0.14.1 - 2025-12-05¶
Fixed¶
- Optimize Anthropic backend TTFT with connection pooling and HTTP/2 (#90)
- Optimize Gemini backend TTFT with connection pooling and HTTP/2 (#88)
- Apply base name fallback matching to aliases in model metadata lookup (#84)
v0.14.0 - 2025-12-04¶
Added¶
- Router-wide global system prompt injection (#82)
CI¶
- Replace deprecated actions-rs/toolchain with dtolnay/rust-toolchain
- Add RUSTFLAGS for macOS ARM64 ring build
- Switch to rustls-tls for musl cross-compilation support
v0.13.0 - 2025-12-04¶
Added¶
- OpenAI
/v1/responsesAPI support with session management (#49) - True SSE streaming for
/v1/responsesAPI - Background cleanup task for expired sessions
- Override
/v1/modelsresponse fields via model-metadata.yaml (#75)
Security¶
- SecretString for API key storage across all backends (#76)
- Session access control and input validation for Responses API
Changed¶
- Immediate mode for SseParser for reduced first-response latency
Refactored¶
- String allocation optimizations and error handling standardization
v0.12.0 - 2025-12-04¶
Fixed¶
- Handle exact hash matches in consistent hash binary search (#72)
- Replace panics with Option returns and implement stats aggregation (#71)
- Remove hardcoded auth requirement from
/v1/modelsendpoint
Refactored¶
- Reorganize OpenAI model metadata by family (#74)
- Extract AnthropicStreamTransformer to dedicated module (#73)
- Split backends mod.rs into separate modules (#69)
- Extract embedded tests to separate files (#68)
- Create HttpClientFactory for centralized HTTP client creation (#67)
- Create UrlValidator module with SSRF prevention (#66)
- Extract RequestExecutor to shared common module (#65)
- Extract HeaderBuilder with auth strategies (#64)
- Extract AtomicStatistics to shared common module
v0.11.0 - 2025-12-03¶
Added¶
- Native Anthropic Claude API backend with extended thinking support
- OpenAI to Claude reasoning parameter conversion
- Flat reasoning_effort parameter for Anthropic
- Claude 4, 4.1, 4.5 model metadata
Fixed¶
- Improve health check and model fetching for Anthropic/Gemini backends
- Accept-Encoding fixes for streaming — use
identityheader and disable compression
v0.10.0 - 2025-12-03¶
Added¶
- Native Google Gemini API backend support
- OpenAI Images API support for image generation
- Authenticated health checks for OpenAI and API-key backends
- Built-in OpenAI model metadata for
/v1/modelsresponse - API key authentication for streaming requests
- Configurable image generation timeout
- Response_format validation for image generation API
Fixed¶
- Convert max_tokens to max_completion_tokens for newer OpenAI models
- Correct URL construction for all API endpoints
- Request body size limits to prevent DoS attacks
Security¶
- Remove sensitive data from debug logs
Refactored¶
- Unify request retry logic with RequestType enum
v0.9.0 - 2025-12-02¶
Added¶
- Enhanced rate limiting with token bucket algorithm
- Comprehensive Prometheus metrics and monitoring (#10)
Security¶
- Prevent IP spoofing via X-Forwarded-For manipulation
- Prevent header injection vulnerabilities
- Eliminate race condition in token refill
- Protect API keys with SHA-256 hashing
- Prevent memory exhaustion via unbounded bucket growth
- Comprehensive authentication for metrics endpoint
- Cardinality limits and label sanitization to prevent metric explosion DoS
Fixed¶
- Implement singleton pattern for metrics to prevent memory leaks
- Improve error handling to prevent panic conditions
- Resolve environment variable race condition in config test
- Fix integration test failures in metrics
v0.8.0 - 2025-09-09¶
Added¶
- Model ID alias support for metadata sharing (#27)
Fixed¶
- Return empty list instead of 503 when all backends are unhealthy (#28)
v0.7.1 - 2025-09-08¶
Fixed¶
- Improve config path validation for home directory and executable paths (#26)
v0.7.0 - 2025-09-07¶
Added¶
- Rich metadata support for
/v1/modelsendpoint (#23, #25) - Enhanced configuration management (#9, #22)
- Advanced load balancing strategies (Weighted, Least-Latency, Consistent-Hash) with enhanced error handling (#21)
Fixed¶
- Use streaming timeout configuration from config.yaml instead of hardcoded 25s limit
v0.6.0 - 2025-09-03¶
Fixed¶
- Use timeout configuration from config.yaml instead of hardcoded values (#19)
Documentation¶
- Comprehensive timeout configuration and model documentation updates
v0.5.0 - 2025-09-02¶
Added¶
- Optional retry configuration with sensible defaults
- Comprehensive integration tests and performance optimizations
- Complete service layer implementation
- Middleware architecture and enhanced backend abstraction
Fixed¶
- Handle streaming requests without model field gracefully
- Resolve floating-point precision and timing issues in tests
- Resolve test failures and deadlocks in object pool and SSE parser
- Resolve initial health check race condition
Refactored¶
- Split oversized modules into layered architecture
- Extract complex types into type aliases for better readability
v0.4.0 - 2025-08-25¶
Added¶
- Model-based routing with health monitoring
Fixed¶
- Improve health check integration and SSE parsing
v0.3.0 - 2025-08-25¶
Added¶
- SSE streaming support for real-time chat completions (#5)
- Model aggregation from multiple endpoints (#4)
v0.2.0 - 2025-08-25¶
Added¶
- Multiple backends support with round-robin load balancing (#1)
v0.1.0 - 2025-08-24¶
Added¶
- Initial release with OpenAI-compatible endpoints and proxy functionality