Skip to content

AppProxy Worker Mode

This document specifies how Continuum Router operates as a Backend.AI AppProxy inference worker — a data-plane node that an AppProxy coordinator controls, aggregating many LLM serving containers behind a single OpenAI-compatible address.

It is the canonical reference for the worker mode, written to be precise enough that any engineer (or agent) can audit or extend each component without re-deriving the protocol.

1. Motivation

Backend.AI's AppProxy fronts model services with a transparent L4/L7 proxy worker: one circuit maps to one frontend slot (a port or a wildcard subdomain) and forwards bytes to the serving container(s) behind it, load balancing replicas by traffic_ratio.

Continuum Router can take over that inference data plane and add L7, LLM-aware behaviour on top:

  • model-name routing and cross-endpoint aggregation behind one /v1 surface;
  • protocol translation (OpenAI ↔ Anthropic ↔ Gemini), smart routing, prefix/KV-cache-aware routing, disaggregated prefill/decode;
  • fallback chains, circuit breaking, retries, response caching, Files API.

In worker mode, an AppProxy coordinator drives Continuum Router's backend set at runtime exactly the way it drives a stock worker (registration, heartbeat, circuit assignment), while Continuum Router realises each circuit as an LLM-aware, health-checked, weighted backend pool.

2. Background: AppProxy architecture

AppProxy has three parts (Backend.AI src/ai/backend/appproxy/):

  • Coordinator — the control plane. An aiohttp REST server backed by PostgreSQL (the source of truth for workers, circuits, endpoints, tokens). It schedules circuits onto workers and pushes routing changes out.
  • Worker — the data plane. Registers with the coordinator, heartbeats, and proxies traffic for the circuits assigned to it.
  • Common — shared types, the event bus, and config.

Key entities

Entity Meaning
Worker A proxy node, identified by a unique authority (shared across HA replicas via a nodes counter). Has a frontend_mode (wildcard/port), a protocol (http/h2/tcp/…), a hostname, an api_port, and a slot space (port_range or wildcard_domain). statusALIVE/LOST/TERMINATED.
Endpoint An inference deployment (model service); id == DeploymentID. 1:1 with a circuit. Carries optional health_check_config.
Circuit The central routing object pushed to workers. Binds a frontend slot (a port or a subdomain) to a list of backend targets (route_info). app_modeinteractive/inference. For inference it carries endpoint_id and runtime_variant.
RouteInfo One backend target inside a circuit: kernel_host, kernel_port, protocol, traffic_ratio, session_id, route_id.
Slot A unit of frontend capacity (one port in a range, or one subdomain). The coordinator allocates slots; the worker honours them.

Transport (how coordinator and worker communicate)

There are three distinct channels:

  1. Worker → Coordinator: HTTP REST. Registration, heartbeat, deregistration, and the initial circuit pull. Authenticated with a shared X-BackendAI-Token: <api_secret> header.
  2. Coordinator → Worker: Redis Pub/Sub (legacy mode) — circuit create/route- update/remove broadcast on channel events_all-appproxy, with a worker ack on create.
  3. Coordinator → Traefik: etcd (Traefik mode) — the coordinator writes Traefik dynamic config to etcd and Traefik proxies; the worker is not signalled per-circuit in this mode.

The mode is a coordinator-global setting (proxy_coordinator.enable_traefik). This distinction drives one of our design decisions (see §4 and §5.5).

3. Conceptual mapping

The inference path maps almost 1:1 onto Continuum Router's existing model:

AppProxy Continuum Router
Worker (authority, frontend_mode, slot space) the router instance, registered as a worker
Endpoint (inference model service) a model (the set of backends serving it)
Circuit (app_mode=inference, route_info[]) a model → Vec<BackendConfig> mapping
RouteInfo {kernel_host, kernel_port, traffic_ratio} BackendConfig {url: http://host:port, weight ∝ ratio, models: [model]}
Slot (subdomain/port) the ingress addressing key (see §5.2)
RoutePool weighted-random + health WeightedRoundRobin + HealthChecker + CircuitBreaker

An AppProxy inference circuit is "one model's N replicas, weighted by traffic_ratio." That is exactly a Continuum Router backend group whose members share models = [<model>] and carry per-replica weight. The translation is therefore mechanical, and the data plane (selection, health, breaker, fallback) is reused unchanged.

4. Design decisions

Four decisions shape the integration:

  1. Native module, not an external adapter. The integration lives inside Continuum Router behind a Cargo feature (appproxy). It injects circuits through the existing hot-reload config_sender channel, so the data plane is not modified.
  2. Continuum-router-only; the coordinator is unchanged. Continuum Router conforms to the existing wildcard inference-worker protocol. Model names are obtained by auto-discovering each replica's /v1/models.
  3. Wildcard ingress that honours the slot. The router registers a wildcard slot space and resolves each request to a circuit by HTTP Host (subdomain), with model-name aggregation available on a catch-all host. The coordinator-allocated slot is the ingress address, not a vestige (see §5.2).
  4. Both transports. A pull-based reconcile baseline (works in any coordinator mode) plus a Redis Pub/Sub event overlay (legacy-mode support + low latency). Pull is also the backstop for missed events.

5. Architecture

5.1 Component overview

                wildcard DNS: *.models.example.com  ──►  continuum-router host
external client                   │   single socket (e.g. :443)
  POST https://ep-abc.models.example.com/v1/chat/completions
        │  Host: ep-abc.models.example.com
┌────────────────────────────────────────────────────────────────┐
│ continuum-router  (one worker, frontend_mode = wildcard)         │
│                                                                  │
│  appproxy module (feature = "appproxy")                          │
│   ├── coordinator client (REST: register/heartbeat/pull)         │
│   ├── worker service (lifecycle loops) ── circuit registry       │
│   ├── reconcile (circuit → BackendConfig → config_sender) ──┐    │
│   ├── events (Redis Pub/Sub subscribe + ack)                │    │
│   └── ingress middleware (Host subdomain → model) ──┐        │    │
│                                                     ▼        ▼    │
│  existing pipeline:  model router → backend pool ◄── hot reload   │
│                      (health, circuit breaker, fallback)         │
└────────────────────────────────────────────────────────────────┘
        │              │
        ▼              ▼
   kernel1:port    kernel2:port    (LLM serving containers = backends)

The only new code is the appproxy module and one ingress middleware. Circuit state becomes backend state through the existing hot-reload machinery; request routing reuses the existing model router.

5.2 Registration and the slot model

The router registers as a wildcard inference worker. "Single address" and "honour the slot" are not in conflict: the wildcard domain is the single address, and each circuit's subdomain is a virtual address into the same socket.

Registration advertises a slot space:

frontend_mode         = wildcard
wildcard_domain       = ".models.example.com"
wildcard_traffic_port = 443         # the router's /v1 socket
hostname              = <router host>
available_slots       = -1          # wildcard → unbounded; never runs out
accepted_traffics     = [inference]

The coordinator then allocates one subdomain per inference circuit inside that domain, and its existing Circuit.get_endpoint_url() produces https://ep-abc.models.example.com/ — the endpoint-URL contract the manager hands to users is preserved, with no coordinator change. The operator configures wildcard DNS (*.models.example.com → router) once.

PORT mode (a port per circuit) would require the router to open and close listening sockets dynamically and is not supported (see §10). Wildcard + TLS is the norm for externally served inference.

5.3 Circuit → backend translation

Each inference circuit is translated to one BackendConfig per RouteInfo replica and applied through the existing runtime-mutation path:

for each circuit assigned to this authority:
    model = discover_model(circuit)            # from a replica's /v1/models, keyed by endpoint_id
    for each route in circuit.route_info:
        BackendConfig {
            name:         "appproxy-<circuit_id>-r<route_id>",
            backend_type: Generic,             # OpenAI-compatible; Vllm if known
            url:          "http://{route.kernel_host}:{route.kernel_port}",
            weight:       weight_from(route.traffic_ratio),
            models:       [model],             # what find_backends_for_model matches
            ..Default
        }

The apply step reuses the admin API's exact pattern (src/admin_config/backend_api.rs):

let _guard = config_modification_lock().write().await;   // serialise with admin API
let cfg    = state.current_config();                      // re-read under lock
let mut new_cfg = (*cfg).clone();
reconcile new_cfg.backends so that the set of appproxy-* backends
    equals the desired set derived from the current circuits;
state.config_sender.send(Arc::new(new_cfg));              // drives hot reload

HotReloadService then diffs old vs new, adds new backends, gracefully drains removed ones, syncs the health checker, and invalidates the model cache. No backend-pool code is touched. Backends owned by this module are namespaced with an appproxy- prefix so reconcile only ever adds/removes its own entries and never disturbs statically configured backends.

Two existing characteristics make this a good fit:

  • Runtime config changes are in-memory only (never written to disk). The coordinator is the source of truth; the router re-syncs on restart via the initial pull. This is the desired behaviour, not a limitation.
  • The typed backend pool is not hot-reloaded, which is irrelevant here: serving containers are generic OpenAI-compatible HTTP backends routed through the URL-based pool.

5.4 Ingress resolution (Host/subdomain → model)

A new Axum middleware resolves the target circuit/model from the request:

  1. Read the Host header; strip the configured wildcard_domain suffix to get the subdomain.
  2. Look the subdomain up in the in-memory circuit registry (owned by the worker service, updated on reconcile/events) → circuit → canonical model.
  3. Insert an IngressTarget { circuit_id, model } request extension.
  4. For non-public inference circuits (open_to_public == false), verify the Authorization: Bearer <jwt> (HS256 with jwt_secret; the decoded id must equal the circuit id), matching AppProxy worker auth.

Handlers prefer the injected model over the body model field at the single existing read site (src/proxy/handlers.rs, the payload.get("model") block) and in its siblings; everything from select_backend_with_retry downward is unchanged, because selection resolves the model name against each backend's models list.

Requests to the bare wildcard domain (or a configured aggregation host) skip subdomain scoping and use normal model-name routing across all circuits — this is the cross-endpoint aggregation surface.

Fallback participation (scoped fallback)

A registered circuit does participate in fallback.fallback_chains. When a request resolves to a circuit whose replicas are all down (its route_info is empty, so it has no live backend), the ingress middleware still pins the request to the circuit's canonical model and passes it to the normal pipeline. select_backend_with_retry then finds no backend for that model and FallbackService takes over, so a chain keyed on the circuit's model (e.g. vllm-real-poc → gpt-4o-mini) is reached — the "deployment went down, traffic goes to OpenAI" behaviour. Per-circuit auth (open_to_public, bearer token, allowed_client_ips) is enforced before this fall-through, so the fallback path is never an unauthenticated bypass.

The fall-through is scoped: it applies only to a registered circuit. A request to an unknown subdomain (no circuit in the registry) is still a 404 endpoint_not_found. An unknown subdomain is a circuit identifier, not a model name, so it does not enter the model-registry / fallback path.

5.5 Update transport

The worker keeps its circuit set current through two cooperating mechanisms:

  • Pull reconcile (baseline, always on). After registering, the worker GETs /api/worker/{id}/circuits and reconciles; it then repeats on a timer (reconcile_interval). This alone is fully correct in Traefik mode (where the coordinator writes etcd and never signals workers) and is the backstop for any missed event.
  • Redis Pub/Sub overlay (legacy mode + low latency). The worker subscribes to events_all-appproxy and applies create/route-update/remove deltas within ~1s, acking creates. This is required in legacy mode: the coordinator blocks up to 15 s on the worker's ack during circuit creation (initialize_legacy_circuit) and raises E10001 Proxy worker not responding on timeout. Route updates and removals are fire-and-forget.

Because the four circuit events are all broadcast (Pub/Sub), the worker needs only SUBSCRIBE (3 inbound) + PUBLISH (1 ack). Redis Streams / consumer-groups are not required for the circuit lifecycle.

6. Wire protocol reference

6.1 Coordinator REST API (worker scope)

Base URL = coordinator_url. Every request carries:

  • X-BackendAI-Token: <api_secret>
  • X-BackendAI-RequestID: <uuid4>
Method & path Purpose Notes
PUT /api/worker register / upsert (idempotent by authority) returns {id, slots, …}; HA: re-register increments nodes
PATCH /api/worker/{id} heartbeat body-less; every heartbeat_period (default 10 s); coordinator timeout 30 s
DELETE /api/worker/{id} deregister decrements nodes; last node → LOST
GET /api/worker/{id}/circuits full circuit snapshot {circuits: [SerializableCircuit, …]}
GET /api/circuit/{id} one circuit
DELETE /api/circuit/{id} remove a circuit

Registration request body (WorkerRequestModel), wildcard mode:

{
  "authority": "continuum-router-1",
  "frontend_mode": "wildcard",
  "protocol": "http",
  "hostname": "router.example.com",
  "tls_listen": false,
  "tls_advertised": true,
  "api_port": 8080,
  "accepted_traffics": ["inference"],
  "filtered_apps_only": false,
  "app_filters": [],
  "traefik_last_used_marker_path": null,
  "wildcard_domain": ".models.example.com",
  "wildcard_traffic_port": 443
}

The response includes the assigned id (worker UUID, cached for subsequent calls) and the computed slots.

6.2 Circuit and route data models

SerializableCircuit (the JSON shape returned by the REST snapshot and embedded in events):

Field Type Notes
id UUID
app string "" for inference
protocol enum http/grpc/h2/tcp/preopen/vnc/rdp
worker UUID hosting worker
app_mode enum interactive/inference
frontend_mode enum wildcard/port
port int? set iff frontend_mode == port
subdomain string? set iff frontend_mode == wildcard
endpoint_id UUID? inference only
runtime_variant string? inference only
open_to_public bool skip auth when true
allowed_client_ips string? comma-separated CIDRs
route_info RouteInfo[] the backend targets
session_ids UUID[]
envs object
created_at / updated_at datetime ISO-8601

RouteInfo:

Field Type Notes
route_id UUID? a different route_id on the same host:port means a kernel swap
session_id UUID required
session_name string?
kernel_host string? Nonelocalhost
kernel_port int 1–65535
protocol enum
traffic_ratio float default 1.0 → maps to backend weight

Rust serde notes:

  • Accept both kebab-case and snake_case input aliases (e.g. route-id and route_id); emit snake_case.
  • extra = "ignore" semantics: tolerate unknown fields (#[serde(default)] / ignore unknown) so coordinator additions never break parsing.

6.3 Redis event envelope

All four circuit events are broadcast as a JSON object PUBLISHed to events_all-appproxy:

{
  "name": "<event_name>",
  "source": "<agent-id>",
  "args": "<base64(msgpack(args_tuple))>",
  "metadata": "{\"request_id\":null,\"user\":null}"
}
  • args is base64 of a msgpack array. For these events the array elements are strings only — no msgpack ext types, no UUID/datetime/enum encoding at the msgpack layer (those are pre-encoded inside the inner JSON). A Rust impl needs only: JSON object → base64-decode args → msgpack array-of-strings → JSON for each element.
  • metadata is a JSON string with exactly request_id and user (additional keys make the coordinator's parser raise). Emit {"request_id":null,"user":null} or echo the inbound request_id.
  • source for worker-emitted events is "appproxy-worker". It is not used for routing; the worker filters inbound events on target_worker_authority.

Event payloads:

name Direction args tuple
appproxy_circuit_created_event inbound (authority, circuits_json) where circuits_json = JSON array of SerializableCircuit
appproxy_circuit_removed_event inbound (authority, circuits_json)
appproxy_circuit_route_updated_event inbound (authority, circuit_json, routes_json) (single circuit + RouteInfo[])
appproxy_worker_circuit_added_event outbound (ack) (authority, circuits_json) — echo the inbound circuits_json verbatim

Worked ack example (authority = "worker01", circuits_json = "[]"): msgpack(["worker01","[]"]) = 92 a8 worker01 a2 5b 5d → base64 kqh3b3JrZXIwMaJbXQ==, PUBLISHed to events_all-appproxy with name = appproxy_worker_circuit_added_event, source = appproxy-worker.

The Redis DB index for the event bus is the deployment's "stream" role DB and must be configured (redis_url / DB selector); confirm it against the coordinator's Redis profile.

7. Configuration

A new optional section is added to the router config, gated by the appproxy feature:

appproxy:
  enabled: true
  coordinator_url: "http://coordinator:10200"
  api_secret: "${APPPROXY_API_SECRET}"     # X-BackendAI-Token
  jwt_secret: "${APPPROXY_JWT_SECRET}"      # HS256 circuit/bearer verification
  redis_url: "redis://valkey:6379/4"        # event bus DB (stream role)
  authority: "continuum-router-1"
  hostname: "router.example.com"
  frontend_mode: "wildcard"
  wildcard_domain: ".models.example.com"
  aggregation_hosts: []                      # extra Hosts that skip subdomain scoping
  wildcard_traffic_port: 443
  tls_advertised: true
  heartbeat_period: "10s"
  reconcile_interval: "15s"
  events_enabled: true                       # Redis Pub/Sub overlay on/off

Secrets support ${ENV_VAR} interpolation, consistent with backends[].api_key.

aggregation_hosts is optional and empty by default. The bare wildcard apex (wildcard_domain without its leading dot, e.g. models.example.com) is always an aggregation surface implicitly; list any additional vanity or aggregation hostnames here. A request whose Host matches one of these (or the apex) skips per-circuit subdomain scoping and uses normal model-name routing across all circuits (§5.4).

8. Module layout

src/appproxy/                 # feature = "appproxy"
├── mod.rs                    # run_worker(cfg, state, shutdown_rx) entry; re-exports
├── config.rs                 # AppProxyWorkerConfig (also re-exported via core::config)
├── types.rs                  # SerializableCircuit, RouteInfo, enums (serde, dual aliases)
├── client.rs                 # coordinator REST client (X-BackendAI-Token)
├── worker.rs                 # lifecycle: register → pull → heartbeat → reconcile; circuit registry
├── reconcile.rs              # circuit set → Vec<BackendConfig> → config_sender (under the lock)
├── events.rs                 # Redis Pub/Sub subscribe + ack; msgpack/base64/JSON envelope codec
├── ingress.rs                # Host subdomain → IngressTarget middleware; handler override helper
└── jwt.rs                    # HS256 verify for non-public circuits

Wiring points (all behind #[cfg(feature = "appproxy")]):

  • Cargo.toml: appproxy = ["dep:redis", "dep:deadpool-redis", "dep:jsonwebtoken"] (Streams not needed). Leave out of full so it stays opt-in.
  • src/lib.rs: pub mod appproxy;.
  • src/core/config/models/config.rs: pub appproxy: Option<AppProxyWorkerConfig>.
  • src/server/mod.rs::build_router: register the worker /status route and insert the ingress middleware just outside the rate-limit layer (so the resolved model is visible to the rate limiter).
  • src/server/serve.rs: after the hot-reload block, spawn appproxy::run_worker(cfg, state.clone(), shutdown_rx.clone()) when cfg.appproxy.enabled.

9. Security

  • Coordinator auth. All REST calls send X-BackendAI-Token: <api_secret>. Keep the secret in env/secret storage; never log it.
  • Data-plane auth. Non-public inference circuits require an Authorization: Bearer <jwt> whose decoded id equals the circuit id (HS256, jwt_secret). Public circuits (open_to_public == true) skip it. Use jsonwebtoken with Validation::new(Algorithm::HS256) — never the unverified payload decode that exists elsewhere in the tree.
  • Client IP allow-list. Honour allowed_client_ips (comma-separated CIDRs) when present on a circuit.
  • Shared secrets. api_secret and jwt_secret must be identical across the whole AppProxy cluster (coordinator + workers).

10. Limitations

  • Model-name source. The served model name is auto-discovered from each replica's /v1/models; the coordinator does not supply a model name (the integration is continuum-router-only, per decision 2).
  • PORT mode. Not supported. It would require dynamic per-port listeners; the router registers as a wildcard worker only.
  • Interactive apps. Not served. The router registers accepted_traffics = [inference] only, so interactive circuits stay on stock workers.
  • Health/load reporting. The AppProxy heartbeat is a bare keepalive; the router does not export per-circuit load or health metrics to the coordinator.