AppProxy Worker Mode¶
This document specifies how Continuum Router operates as a Backend.AI AppProxy inference worker — a data-plane node that an AppProxy coordinator controls, aggregating many LLM serving containers behind a single OpenAI-compatible address.
It is the canonical reference for the worker mode, written to be precise enough that any engineer (or agent) can audit or extend each component without re-deriving the protocol.
1. Motivation¶
Backend.AI's AppProxy fronts model services with a transparent L4/L7
proxy worker: one circuit maps to one frontend slot (a port or a wildcard
subdomain) and forwards bytes to the serving container(s) behind it, load
balancing replicas by traffic_ratio.
Continuum Router can take over that inference data plane and add L7, LLM-aware behaviour on top:
- model-name routing and cross-endpoint aggregation behind one
/v1surface; - protocol translation (OpenAI ↔ Anthropic ↔ Gemini), smart routing, prefix/KV-cache-aware routing, disaggregated prefill/decode;
- fallback chains, circuit breaking, retries, response caching, Files API.
In worker mode, an AppProxy coordinator drives Continuum Router's backend set at runtime exactly the way it drives a stock worker (registration, heartbeat, circuit assignment), while Continuum Router realises each circuit as an LLM-aware, health-checked, weighted backend pool.
2. Background: AppProxy architecture¶
AppProxy has three parts (Backend.AI src/ai/backend/appproxy/):
- Coordinator — the control plane. An aiohttp REST server backed by PostgreSQL (the source of truth for workers, circuits, endpoints, tokens). It schedules circuits onto workers and pushes routing changes out.
- Worker — the data plane. Registers with the coordinator, heartbeats, and proxies traffic for the circuits assigned to it.
- Common — shared types, the event bus, and config.
Key entities¶
| Entity | Meaning |
|---|---|
| Worker | A proxy node, identified by a unique authority (shared across HA replicas via a nodes counter). Has a frontend_mode (wildcard/port), a protocol (http/h2/tcp/…), a hostname, an api_port, and a slot space (port_range or wildcard_domain). status ∈ ALIVE/LOST/TERMINATED. |
| Endpoint | An inference deployment (model service); id == DeploymentID. 1:1 with a circuit. Carries optional health_check_config. |
| Circuit | The central routing object pushed to workers. Binds a frontend slot (a port or a subdomain) to a list of backend targets (route_info). app_mode ∈ interactive/inference. For inference it carries endpoint_id and runtime_variant. |
| RouteInfo | One backend target inside a circuit: kernel_host, kernel_port, protocol, traffic_ratio, session_id, route_id. |
| Slot | A unit of frontend capacity (one port in a range, or one subdomain). The coordinator allocates slots; the worker honours them. |
Transport (how coordinator and worker communicate)¶
There are three distinct channels:
- Worker → Coordinator: HTTP REST. Registration, heartbeat, deregistration,
and the initial circuit pull. Authenticated with a shared
X-BackendAI-Token: <api_secret>header. - Coordinator → Worker: Redis Pub/Sub (legacy mode) — circuit create/route-
update/remove broadcast on channel
events_all-appproxy, with a worker ack on create. - Coordinator → Traefik: etcd (Traefik mode) — the coordinator writes Traefik dynamic config to etcd and Traefik proxies; the worker is not signalled per-circuit in this mode.
The mode is a coordinator-global setting (proxy_coordinator.enable_traefik).
This distinction drives one of our design decisions (see §4 and §5.5).
3. Conceptual mapping¶
The inference path maps almost 1:1 onto Continuum Router's existing model:
| AppProxy | Continuum Router |
|---|---|
| Worker (authority, frontend_mode, slot space) | the router instance, registered as a worker |
| Endpoint (inference model service) | a model (the set of backends serving it) |
Circuit (app_mode=inference, route_info[]) |
a model → Vec<BackendConfig> mapping |
RouteInfo {kernel_host, kernel_port, traffic_ratio} |
BackendConfig {url: http://host:port, weight ∝ ratio, models: [model]} |
| Slot (subdomain/port) | the ingress addressing key (see §5.2) |
| RoutePool weighted-random + health | WeightedRoundRobin + HealthChecker + CircuitBreaker |
An AppProxy inference circuit is "one model's N replicas, weighted by
traffic_ratio." That is exactly a Continuum Router backend group whose members
share models = [<model>] and carry per-replica weight. The translation is
therefore mechanical, and the data plane (selection, health, breaker, fallback)
is reused unchanged.
4. Design decisions¶
Four decisions shape the integration:
- Native module, not an external adapter. The integration lives inside
Continuum Router behind a Cargo feature (
appproxy). It injects circuits through the existing hot-reloadconfig_senderchannel, so the data plane is not modified. - Continuum-router-only; the coordinator is unchanged. Continuum Router
conforms to the existing wildcard inference-worker protocol. Model names are
obtained by auto-discovering each replica's
/v1/models. - Wildcard ingress that honours the slot. The router registers a wildcard
slot space and resolves each request to a circuit by HTTP
Host(subdomain), with model-name aggregation available on a catch-all host. The coordinator-allocated slot is the ingress address, not a vestige (see §5.2). - Both transports. A pull-based reconcile baseline (works in any coordinator mode) plus a Redis Pub/Sub event overlay (legacy-mode support + low latency). Pull is also the backstop for missed events.
5. Architecture¶
5.1 Component overview¶
wildcard DNS: *.models.example.com ──► continuum-router host
│
external client │ single socket (e.g. :443)
POST https://ep-abc.models.example.com/v1/chat/completions
│ Host: ep-abc.models.example.com
▼
┌────────────────────────────────────────────────────────────────┐
│ continuum-router (one worker, frontend_mode = wildcard) │
│ │
│ appproxy module (feature = "appproxy") │
│ ├── coordinator client (REST: register/heartbeat/pull) │
│ ├── worker service (lifecycle loops) ── circuit registry │
│ ├── reconcile (circuit → BackendConfig → config_sender) ──┐ │
│ ├── events (Redis Pub/Sub subscribe + ack) │ │
│ └── ingress middleware (Host subdomain → model) ──┐ │ │
│ ▼ ▼ │
│ existing pipeline: model router → backend pool ◄── hot reload │
│ (health, circuit breaker, fallback) │
└────────────────────────────────────────────────────────────────┘
│ │
▼ ▼
kernel1:port kernel2:port (LLM serving containers = backends)
The only new code is the appproxy module and one ingress middleware. Circuit
state becomes backend state through the existing hot-reload machinery; request
routing reuses the existing model router.
5.2 Registration and the slot model¶
The router registers as a wildcard inference worker. "Single address" and "honour the slot" are not in conflict: the wildcard domain is the single address, and each circuit's subdomain is a virtual address into the same socket.
Registration advertises a slot space:
frontend_mode = wildcard
wildcard_domain = ".models.example.com"
wildcard_traffic_port = 443 # the router's /v1 socket
hostname = <router host>
available_slots = -1 # wildcard → unbounded; never runs out
accepted_traffics = [inference]
The coordinator then allocates one subdomain per inference circuit inside that
domain, and its existing Circuit.get_endpoint_url() produces
https://ep-abc.models.example.com/ — the endpoint-URL contract the manager
hands to users is preserved, with no coordinator change. The operator
configures wildcard DNS (*.models.example.com → router) once.
PORT mode (a port per circuit) would require the router to open and close listening sockets dynamically and is not supported (see §10). Wildcard + TLS is the norm for externally served inference.
5.3 Circuit → backend translation¶
Each inference circuit is translated to one BackendConfig per RouteInfo
replica and applied through the existing runtime-mutation path:
for each circuit assigned to this authority:
model = discover_model(circuit) # from a replica's /v1/models, keyed by endpoint_id
for each route in circuit.route_info:
BackendConfig {
name: "appproxy-<circuit_id>-r<route_id>",
backend_type: Generic, # OpenAI-compatible; Vllm if known
url: "http://{route.kernel_host}:{route.kernel_port}",
weight: weight_from(route.traffic_ratio),
models: [model], # what find_backends_for_model matches
..Default
}
The apply step reuses the admin API's exact pattern
(src/admin_config/backend_api.rs):
let _guard = config_modification_lock().write().await; // serialise with admin API
let cfg = state.current_config(); // re-read under lock
let mut new_cfg = (*cfg).clone();
reconcile new_cfg.backends so that the set of appproxy-* backends
equals the desired set derived from the current circuits;
state.config_sender.send(Arc::new(new_cfg)); // drives hot reload
HotReloadService then diffs old vs new, adds new backends, gracefully drains
removed ones, syncs the health checker, and invalidates the model cache. No
backend-pool code is touched. Backends owned by this module are namespaced
with an appproxy- prefix so reconcile only ever adds/removes its own entries
and never disturbs statically configured backends.
Two existing characteristics make this a good fit:
- Runtime config changes are in-memory only (never written to disk). The coordinator is the source of truth; the router re-syncs on restart via the initial pull. This is the desired behaviour, not a limitation.
- The typed backend pool is not hot-reloaded, which is irrelevant here: serving containers are generic OpenAI-compatible HTTP backends routed through the URL-based pool.
5.4 Ingress resolution (Host/subdomain → model)¶
A new Axum middleware resolves the target circuit/model from the request:
- Read the
Hostheader; strip the configuredwildcard_domainsuffix to get the subdomain. - Look the subdomain up in the in-memory circuit registry (owned by the worker service, updated on reconcile/events) → circuit → canonical model.
- Insert an
IngressTarget { circuit_id, model }request extension. - For non-public inference circuits (
open_to_public == false), verify theAuthorization: Bearer <jwt>(HS256 withjwt_secret; the decodedidmust equal the circuit id), matching AppProxy worker auth.
Handlers prefer the injected model over the body model field at the single
existing read site (src/proxy/handlers.rs, the payload.get("model") block)
and in its siblings; everything from select_backend_with_retry downward is
unchanged, because selection resolves the model name against each backend's
models list.
Requests to the bare wildcard domain (or a configured aggregation host) skip subdomain scoping and use normal model-name routing across all circuits — this is the cross-endpoint aggregation surface.
Fallback participation (scoped fallback)¶
A registered circuit does participate in fallback.fallback_chains. When a
request resolves to a circuit whose replicas are all down (its route_info is
empty, so it has no live backend), the ingress middleware still pins the
request to the circuit's canonical model and passes it to the normal pipeline.
select_backend_with_retry then finds no backend for that model and
FallbackService takes over, so a chain keyed on the circuit's model (e.g.
vllm-real-poc → gpt-4o-mini) is reached — the "deployment went down, traffic
goes to OpenAI" behaviour. Per-circuit auth (open_to_public, bearer token,
allowed_client_ips) is enforced before this fall-through, so the fallback
path is never an unauthenticated bypass.
The fall-through is scoped: it applies only to a registered circuit. A
request to an unknown subdomain (no circuit in the registry) is still a
404 endpoint_not_found. An unknown subdomain is a circuit identifier, not a
model name, so it does not enter the model-registry / fallback path.
5.5 Update transport¶
The worker keeps its circuit set current through two cooperating mechanisms:
- Pull reconcile (baseline, always on). After registering, the worker
GETs/api/worker/{id}/circuitsand reconciles; it then repeats on a timer (reconcile_interval). This alone is fully correct in Traefik mode (where the coordinator writes etcd and never signals workers) and is the backstop for any missed event. - Redis Pub/Sub overlay (legacy mode + low latency). The worker subscribes
to
events_all-appproxyand applies create/route-update/remove deltas within ~1s, acking creates. This is required in legacy mode: the coordinator blocks up to 15 s on the worker's ack during circuit creation (initialize_legacy_circuit) and raisesE10001 Proxy worker not respondingon timeout. Route updates and removals are fire-and-forget.
Because the four circuit events are all broadcast (Pub/Sub), the worker needs
only SUBSCRIBE (3 inbound) + PUBLISH (1 ack). Redis Streams /
consumer-groups are not required for the circuit lifecycle.
6. Wire protocol reference¶
6.1 Coordinator REST API (worker scope)¶
Base URL = coordinator_url. Every request carries:
X-BackendAI-Token: <api_secret>X-BackendAI-RequestID: <uuid4>
| Method & path | Purpose | Notes |
|---|---|---|
PUT /api/worker |
register / upsert (idempotent by authority) |
returns {id, slots, …}; HA: re-register increments nodes |
PATCH /api/worker/{id} |
heartbeat | body-less; every heartbeat_period (default 10 s); coordinator timeout 30 s |
DELETE /api/worker/{id} |
deregister | decrements nodes; last node → LOST |
GET /api/worker/{id}/circuits |
full circuit snapshot | {circuits: [SerializableCircuit, …]} |
GET /api/circuit/{id} |
one circuit | |
DELETE /api/circuit/{id} |
remove a circuit |
Registration request body (WorkerRequestModel), wildcard mode:
{
"authority": "continuum-router-1",
"frontend_mode": "wildcard",
"protocol": "http",
"hostname": "router.example.com",
"tls_listen": false,
"tls_advertised": true,
"api_port": 8080,
"accepted_traffics": ["inference"],
"filtered_apps_only": false,
"app_filters": [],
"traefik_last_used_marker_path": null,
"wildcard_domain": ".models.example.com",
"wildcard_traffic_port": 443
}
The response includes the assigned id (worker UUID, cached for subsequent
calls) and the computed slots.
6.2 Circuit and route data models¶
SerializableCircuit (the JSON shape returned by the REST snapshot and embedded
in events):
| Field | Type | Notes |
|---|---|---|
id |
UUID | |
app |
string | "" for inference |
protocol |
enum | http/grpc/h2/tcp/preopen/vnc/rdp |
worker |
UUID | hosting worker |
app_mode |
enum | interactive/inference |
frontend_mode |
enum | wildcard/port |
port |
int? | set iff frontend_mode == port |
subdomain |
string? | set iff frontend_mode == wildcard |
endpoint_id |
UUID? | inference only |
runtime_variant |
string? | inference only |
open_to_public |
bool | skip auth when true |
allowed_client_ips |
string? | comma-separated CIDRs |
route_info |
RouteInfo[] | the backend targets |
session_ids |
UUID[] | |
envs |
object | |
created_at / updated_at |
datetime | ISO-8601 |
RouteInfo:
| Field | Type | Notes |
|---|---|---|
route_id |
UUID? | a different route_id on the same host:port means a kernel swap |
session_id |
UUID | required |
session_name |
string? | |
kernel_host |
string? | None → localhost |
kernel_port |
int | 1–65535 |
protocol |
enum | |
traffic_ratio |
float | default 1.0 → maps to backend weight |
Rust serde notes:
- Accept both kebab-case and snake_case input aliases (e.g.
route-idandroute_id); emit snake_case. extra = "ignore"semantics: tolerate unknown fields (#[serde(default)]/ ignore unknown) so coordinator additions never break parsing.
6.3 Redis event envelope¶
All four circuit events are broadcast as a JSON object PUBLISHed to
events_all-appproxy:
{
"name": "<event_name>",
"source": "<agent-id>",
"args": "<base64(msgpack(args_tuple))>",
"metadata": "{\"request_id\":null,\"user\":null}"
}
argsis base64 of a msgpack array. For these events the array elements are strings only — no msgpack ext types, no UUID/datetime/enum encoding at the msgpack layer (those are pre-encoded inside the inner JSON). A Rust impl needs only: JSON object → base64-decodeargs→ msgpack array-of-strings → JSON for each element.metadatais a JSON string with exactlyrequest_idanduser(additional keys make the coordinator's parser raise). Emit{"request_id":null,"user":null}or echo the inboundrequest_id.sourcefor worker-emitted events is"appproxy-worker". It is not used for routing; the worker filters inbound events ontarget_worker_authority.
Event payloads:
name |
Direction | args tuple |
|---|---|---|
appproxy_circuit_created_event |
inbound | (authority, circuits_json) where circuits_json = JSON array of SerializableCircuit |
appproxy_circuit_removed_event |
inbound | (authority, circuits_json) |
appproxy_circuit_route_updated_event |
inbound | (authority, circuit_json, routes_json) (single circuit + RouteInfo[]) |
appproxy_worker_circuit_added_event |
outbound (ack) | (authority, circuits_json) — echo the inbound circuits_json verbatim |
Worked ack example (authority = "worker01", circuits_json = "[]"):
msgpack(["worker01","[]"]) = 92 a8 worker01 a2 5b 5d → base64
kqh3b3JrZXIwMaJbXQ==, PUBLISHed to events_all-appproxy with
name = appproxy_worker_circuit_added_event, source = appproxy-worker.
The Redis DB index for the event bus is the deployment's "stream" role DB and
must be configured (redis_url / DB selector); confirm it against the
coordinator's Redis profile.
7. Configuration¶
A new optional section is added to the router config, gated by the appproxy
feature:
appproxy:
enabled: true
coordinator_url: "http://coordinator:10200"
api_secret: "${APPPROXY_API_SECRET}" # X-BackendAI-Token
jwt_secret: "${APPPROXY_JWT_SECRET}" # HS256 circuit/bearer verification
redis_url: "redis://valkey:6379/4" # event bus DB (stream role)
authority: "continuum-router-1"
hostname: "router.example.com"
frontend_mode: "wildcard"
wildcard_domain: ".models.example.com"
aggregation_hosts: [] # extra Hosts that skip subdomain scoping
wildcard_traffic_port: 443
tls_advertised: true
heartbeat_period: "10s"
reconcile_interval: "15s"
events_enabled: true # Redis Pub/Sub overlay on/off
Secrets support ${ENV_VAR} interpolation, consistent with backends[].api_key.
aggregation_hosts is optional and empty by default. The bare wildcard apex
(wildcard_domain without its leading dot, e.g. models.example.com) is always
an aggregation surface implicitly; list any additional vanity or aggregation
hostnames here. A request whose Host matches one of these (or the apex) skips
per-circuit subdomain scoping and uses normal model-name routing across all
circuits (§5.4).
8. Module layout¶
src/appproxy/ # feature = "appproxy"
├── mod.rs # run_worker(cfg, state, shutdown_rx) entry; re-exports
├── config.rs # AppProxyWorkerConfig (also re-exported via core::config)
├── types.rs # SerializableCircuit, RouteInfo, enums (serde, dual aliases)
├── client.rs # coordinator REST client (X-BackendAI-Token)
├── worker.rs # lifecycle: register → pull → heartbeat → reconcile; circuit registry
├── reconcile.rs # circuit set → Vec<BackendConfig> → config_sender (under the lock)
├── events.rs # Redis Pub/Sub subscribe + ack; msgpack/base64/JSON envelope codec
├── ingress.rs # Host subdomain → IngressTarget middleware; handler override helper
└── jwt.rs # HS256 verify for non-public circuits
Wiring points (all behind #[cfg(feature = "appproxy")]):
Cargo.toml:appproxy = ["dep:redis", "dep:deadpool-redis", "dep:jsonwebtoken"](Streams not needed). Leave out offullso it stays opt-in.src/lib.rs:pub mod appproxy;.src/core/config/models/config.rs:pub appproxy: Option<AppProxyWorkerConfig>.src/server/mod.rs::build_router: register the worker/statusroute and insert the ingress middleware just outside the rate-limit layer (so the resolved model is visible to the rate limiter).src/server/serve.rs: after the hot-reload block, spawnappproxy::run_worker(cfg, state.clone(), shutdown_rx.clone())whencfg.appproxy.enabled.
9. Security¶
- Coordinator auth. All REST calls send
X-BackendAI-Token: <api_secret>. Keep the secret in env/secret storage; never log it. - Data-plane auth. Non-public inference circuits require an
Authorization: Bearer <jwt>whose decodedidequals the circuit id (HS256,jwt_secret). Public circuits (open_to_public == true) skip it. UsejsonwebtokenwithValidation::new(Algorithm::HS256)— never the unverified payload decode that exists elsewhere in the tree. - Client IP allow-list. Honour
allowed_client_ips(comma-separated CIDRs) when present on a circuit. - Shared secrets.
api_secretandjwt_secretmust be identical across the whole AppProxy cluster (coordinator + workers).
10. Limitations¶
- Model-name source. The served model name is auto-discovered from each
replica's
/v1/models; the coordinator does not supply a model name (the integration is continuum-router-only, per decision 2). - PORT mode. Not supported. It would require dynamic per-port listeners; the router registers as a wildcard worker only.
- Interactive apps. Not served. The router registers
accepted_traffics = [inference]only, so interactive circuits stay on stock workers. - Health/load reporting. The AppProxy heartbeat is a bare keepalive; the router does not export per-circuit load or health metrics to the coordinator.