Skip to content

Architecture Guide

This document provides a comprehensive overview of Continuum Router's architecture, design decisions, and extension points.

Table of Contents

Overview

Continuum Router is designed as a high-performance, production-ready LLM API router using a clean 4-layer architecture that provides clear separation of concerns, testability, and maintainability. The architecture follows Domain-Driven Design principles and dependency inversion to create a robust, extensible system.

Architecture Goals

  1. Separation of Concerns: Each layer has a single, well-defined responsibility
  2. Dependency Inversion: Higher layers depend on abstractions, not concrete implementations
  3. Testability: Each component can be unit tested in isolation
  4. Extensibility: New features can be added without modifying existing code
  5. Performance: Minimal overhead while maintaining clean architecture
  6. Reliability: Fail-fast design with comprehensive error handling

4-Layer Architecture

Layer Descriptions

1. HTTP Layer (src/http/)

Responsibility: Handle HTTP requests, responses, and web-specific concerns

Components

  • Routes (routes.rs): Define HTTP endpoints and route handling
  • Middleware (middleware/): Cross-cutting concerns (auth, logging, metrics, rate limiting)
  • DTOs (dto/): Data Transfer Objects for HTTP serialization/deserialization
  • Streaming (streaming/): Server-Sent Events (SSE) handling

Key Files

src/http/
├── mod.rs              # HTTP layer exports
├── routes.rs           # Route definitions and handlers
├── dto.rs              # Request/Response DTOs
├── handlers/           # Request handlers
│   ├── mod.rs
│   └── responses.rs    # Responses API handlers
├── middleware/         # HTTP middleware components
│   ├── mod.rs
│   ├── auth.rs         # API key authentication middleware
│   ├── admin_auth.rs   # Admin API authentication middleware
│   ├── files_auth.rs   # Files API authentication middleware
│   ├── admin_audit.rs  # Admin operations audit logging
│   ├── cors.rs         # CORS (Cross-Origin Resource Sharing) middleware
│   ├── logging.rs      # Request/response logging
│   ├── metrics.rs      # Metrics collection
│   ├── metrics_auth.rs # Metrics endpoint authentication
│   ├── model_extractor.rs # Model extraction from requests
│   ├── prometheus.rs   # Prometheus metrics integration
│   ├── rate_limit.rs   # Rate limiting middleware (legacy)
│   └── rate_limit_v2/  # Enhanced rate limiting (modular)
│       ├── mod.rs      # Module exports
│       ├── middleware.rs # Rate limiting middleware
│       ├── store.rs    # Rate limit storage and tracking
│       └── token_bucket.rs # Token bucket algorithm
└── streaming/          # SSE streaming handlers
    ├── mod.rs
    └── handler.rs      # Streaming response handling

Middleware Components

The HTTP layer includes several middleware components that provide cross-cutting concerns:

  • auth.rs: API key authentication for main endpoints (/v1/chat/completions, /v1/models, etc.)

    • Validates API keys from Authorization: Bearer <key> header
    • Supports multiple API keys configured in config.yaml
    • Returns 401 Unauthorized for invalid/missing keys
  • admin_auth.rs: Separate authentication for admin endpoints (/admin/*)

    • Uses dedicated admin API keys distinct from user API keys
    • Protects sensitive operations (config reload, circuit breaker control, health management)
    • Configurable via admin.api_keys in configuration
  • files_auth.rs: Authentication middleware for Files API (/v1/files/*)

    • Validates API keys specifically for file upload/download/deletion operations
    • Prevents unauthorized file access and manipulation
    • Integrates with file storage service for permission checks
  • admin_audit.rs: Audit logging middleware for admin operations

    • Records all admin API calls with timestamps and caller identification
    • Logs parameters and outcomes of sensitive operations
    • Provides audit trail for compliance and security monitoring
    • Configurable log levels and retention policies
  • cors.rs: CORS (Cross-Origin Resource Sharing) middleware

    • Enables embedding the router in web applications, Tauri apps, and Electron apps
    • Supports wildcard origins (*), exact origins, and port wildcards (http://localhost:*)
    • Custom scheme support for desktop apps (e.g., tauri://localhost)
    • Configurable methods, headers, credentials, and preflight cache duration
    • Applied early in the middleware stack for proper preflight handling
  • ratelimitv2/: Enhanced rate limiting system (see Rate Limiting section)

    • Token bucket algorithm with per-client tracking
    • Separate limits for sustained rate and burst protection
    • Automatic cleanup of expired client entries
    • Detailed metrics for monitoring

2. Services Layer (src/services/)

Responsibility: Orchestrate business logic and coordinate between infrastructure components

Components

  • Backend Service (backend_service.rs): Manage backend pool, load balancing, health checks
  • Model Service (model_service.rs): Aggregate models from backends, handle caching, enrich with metadata
  • Proxy Service (proxy_service.rs): Route requests, handle retries, manage streaming
  • Health Service (health_service.rs): Monitor service health, track status
  • Service Registry (mod.rs): Manage service lifecycle and dependencies

Key Files

src/services/
├── mod.rs              # Service registry and management
├── backend_service.rs  # Backend management service
├── model_service.rs    # Model aggregation service
├── proxy_service.rs    # Request proxying and routing
├── health_service.rs   # Health monitoring service
├── deduplication.rs    # Request deduplication service
├── responses/          # Responses API support
│   ├── mod.rs
│   ├── converter.rs    # Response format conversion
│   ├── router.rs       # Routing strategy determination (pass-through vs conversion)
│   ├── passthrough.rs  # Direct pass-through for native OpenAI/Azure backends
│   ├── session.rs      # Session management
│   ├── stream_service.rs # Streaming service orchestration
│   └── streaming.rs    # Streaming response handling
└── streaming/          # Streaming utilities
    ├── mod.rs
    ├── parser.rs       # Stream parsing logic
    └── transformer.rs  # Stream transformation (OpenAI/Anthropic)

3. Infrastructure Layer (src/infrastructure/)

Responsibility: Provide concrete implementations of external systems and technical capabilities

Components

  • Backends (backends/): Specific backend implementations (OpenAI, vLLM, Ollama)
  • Cache (cache/): Caching implementations (LRU, TTL-based)
  • Configuration (config/): Configuration loading, watching, validation
  • HTTP Client (http_client.rs): HTTP client management and optimization

Key Files

src/infrastructure/
├── mod.rs              # Infrastructure exports and utilities
├── backends/           # Backend implementations
│   ├── mod.rs
│   ├── anthropic/      # Native Anthropic Claude backend
│   │   ├── mod.rs      # Backend implementation & request transformation
│   │   └── stream.rs   # SSE stream transformer (Anthropic → OpenAI)
│   ├── gemini/         # Native Google Gemini backend
│   │   ├── mod.rs      # Backend implementation with TTFB optimization
│   │   └── stream.rs   # SSE stream transformer (Gemini → OpenAI)
│   ├── openai/         # OpenAI-compatible backend
│   │   ├── mod.rs
│   │   ├── backend.rs  # OpenAI backend implementation
│   │   └── models/     # OpenAI-specific model definitions
│   ├── factory/        # Backend factory pattern
│   │   ├── mod.rs
│   │   └── backend_factory.rs  # Creates backends from config
│   ├── pool/           # Backend pooling and management
│   │   ├── mod.rs
│   │   ├── backend_pool.rs     # Connection pool management
│   │   └── backend_manager.rs  # Backend lifecycle management
│   ├── generic/        # Generic backend implementations
│   │   └── mod.rs
│   └── vllm.rs         # vLLM backend implementation
├── common/             # Shared infrastructure utilities
│   ├── mod.rs
│   ├── executor.rs     # Request execution with retry/metrics
│   ├── headers.rs      # HTTP header utilities
│   ├── http_client.rs  # HTTP client factory with pooling
│   ├── statistics.rs   # Backend statistics collection
│   └── url_validator.rs # URL validation and security
├── transport/          # Transport abstraction layer
│   ├── mod.rs          # Transport enum (HTTP, UnixSocket)
│   └── unix_socket.rs  # Unix Domain Socket client
├── cache/              # Caching implementations
│   ├── mod.rs
│   ├── lru_cache.rs    # LRU cache implementation
│   └── retry_cache.rs  # Retry-aware cache
├── config/             # Configuration management
│   ├── mod.rs
│   ├── loader.rs       # Configuration loading
│   ├── validator.rs    # Configuration validation
│   ├── timeout_validator.rs # Timeout configuration validation
│   ├── watcher.rs      # File watching for hot-reload
│   ├── migrator.rs     # Configuration migration orchestrator
│   ├── migration.rs    # Migration types and traits
│   ├── migrations.rs   # Specific migration implementations
│   ├── fixer.rs        # Auto-correction logic
│   ├── backup.rs       # Backup management
│   └── secrets.rs      # Secret/API key management
└── lock_optimization.rs # Lock and concurrency optimization

4. Core Layer (src/core/)

Responsibility: Define domain models, business rules, and fundamental abstractions

Components

  • Models (models/): Core domain entities (Backend, Model, Request, Response)
  • Traits (traits.rs): Core interfaces and contracts
  • Errors (errors.rs): Domain-specific error types and handling
  • Retry (retry/): Retry policies and strategies
  • Container (container.rs): Dependency injection container

Key Files

src/core/
├── mod.rs              # Core exports and utilities
├── models/             # Domain models
│   ├── mod.rs
│   ├── backend.rs      # Backend domain model
│   ├── model.rs        # LLM model representation
│   ├── request.rs      # Request models
│   └── responses.rs    # Response models (Responses API)
├── traits.rs           # Core traits and interfaces
├── errors.rs           # Error types and handling
├── container.rs        # Dependency injection container
├── async_utils.rs      # Async utility functions
├── duration_utils.rs   # Duration parsing utilities
├── streaming/          # Streaming models
│   ├── mod.rs
│   └── models.rs       # Streaming-specific models
├── retry/              # Retry mechanisms
│   ├── mod.rs
│   ├── policy.rs       # Retry policies
│   └── strategy.rs     # Retry strategies
├── circuit_breaker/    # Circuit breaker pattern
│   ├── mod.rs          # Module exports
│   ├── config.rs       # Configuration models
│   ├── state.rs        # State machine and breaker logic
│   ├── error.rs        # Circuit breaker errors
│   ├── metrics.rs      # Prometheus metrics
│   └── tests.rs        # Unit tests
├── files/              # File processing utilities
│   ├── mod.rs          # Module exports
│   ├── resolver.rs     # File reference resolution in chat requests
│   ├── transformer.rs  # Message transformation with file content
│   └── transformer_utils.rs # Transformation utility functions
├── tool_calling/       # Tool calling transformation
│   ├── mod.rs          # Module exports
│   ├── definitions.rs  # Tool/function type definitions
│   └── transform.rs    # Backend-specific transformations
└── config/             # Configuration models
    ├── mod.rs
    ├── models/            # Configuration data models (modular structure)
    │   ├── mod.rs         # Re-exports for backward compatibility
    │   ├── config.rs      # Main Config struct, ServerConfig, BackendConfig
    │   ├── backend_type.rs # BackendType enum definitions
    │   ├── model_metadata.rs # ModelMetadata, PricingInfo, CapabilityInfo
    │   ├── global_prompts.rs # GlobalPrompts configuration
    │   ├── samples.rs     # Sample generation configurations
    │   ├── validation.rs  # Configuration validation logic
    │   └── error.rs       # Configuration-specific errors
    ├── timeout_models.rs # Timeout configuration models
    ├── cached_timeout.rs # Cached timeout resolution
    ├── optimized_retry.rs # Optimized retry configuration
    ├── metrics.rs      # Metrics configuration
    └── rate_limit.rs   # Rate limit configuration

Core Components

Backend Pool

Location: src/backend.rs (legacy) → src/services/backend_service.rs

Purpose: Manages multiple LLM backends with intelligent load balancing

pub struct BackendPool {
    backends: Arc<RwLock<Vec<Backend>>>,
    load_balancer: LoadBalancingStrategy,
    health_checker: Option<Arc<HealthChecker>>,
}

impl BackendPool {
    // Round-robin load balancing with health awareness
    pub async fn select_backend(&self) -> Option<Backend> { /* ... */ }

    // Filter backends by model availability
    pub async fn backends_for_model(&self, model: &str) -> Vec<Backend> { /* ... */ }
}

Health Checker

Location: src/health.rssrc/services/health_service.rs

Purpose: Monitor backend health with configurable thresholds, automatic recovery, and accelerated warmup detection

pub struct HealthChecker {
    backends: Arc<RwLock<Vec<Backend>>>,
    config: HealthConfig,
    status_map: Arc<RwLock<HashMap<String, HealthStatus>>>,
}

pub struct HealthConfig {
    pub interval: Duration,
    pub timeout: Duration,
    pub unhealthy_threshold: u32,     // Failures before marking unhealthy
    pub healthy_threshold: u32,       // Successes before marking healthy
    pub warmup_check_interval: Duration, // Accelerated interval during warmup (default: 1s)
    pub max_warmup_duration: Duration,   // Max time in warmup mode (default: 300s)
}

pub enum HealthStatus {
    Healthy,      // Backend responding with HTTP 200
    Unhealthy,    // Connection failure or error
    WarmingUp,    // HTTP 503 - backend loading (accelerated checks)
    Unknown,      // Initial state
}

Accelerated Warmup Health Checks

When a backend returns HTTP 503 (Service Unavailable), it enters the WarmingUp state. During this state:

  • Health checks run at warmup_check_interval (default: 1 second) instead of the normal interval
  • This reduces model availability detection from ~30 seconds to ~1 second
  • After max_warmup_duration, the backend is marked as Unhealthy
  • Particularly useful for llama.cpp backends that return HTTP 503 during model loading

Transport Layer

Location: src/infrastructure/transport/

Purpose: Provide a unified transport abstraction for backend communication over HTTP/HTTPS or Unix Domain Sockets

The transport layer enables secure local LLM communication via Unix sockets, eliminating the need for TCP port exposure.

URL Schemes

Scheme Transport Example
http:// TCP/HTTP http://localhost:8080/v1
https:// TCP/HTTPS https://api.openai.com/v1
unix:// Unix Socket unix:///var/run/llama.sock

Transport Enum

pub enum Transport {
    Http { url: String },
    UnixSocket { socket_path: PathBuf },
}

impl Transport {
    pub fn from_url(url: &str) -> Result<Self, TransportError>;
    pub fn is_unix_socket(&self) -> bool;
    pub fn is_http(&self) -> bool;
}

Unix Socket Client

The UnixSocketClient provides HTTP-over-Unix-socket communication:

pub struct UnixSocketClient {
    socket_path: PathBuf,
    config: UnixSocketClientConfig,
}

impl UnixSocketClient {
    pub async fn get(&self, endpoint: &str, headers: Option<Vec<(String, String)>>)
        -> Result<UnixSocketResponse, UnixSocketError>;
    pub async fn post(&self, endpoint: &str, headers: Option<Vec<(String, String)>>, body: Bytes)
        -> Result<UnixSocketResponse, UnixSocketError>;
    pub async fn health_check(&self) -> Result<bool, UnixSocketError>;
}

Security Features

  • Path Traversal Protection: Validates socket paths to prevent directory traversal attacks
  • CRLF Injection Protection: Validates endpoints and headers for HTTP header injection
  • Response Size Limits: Configurable max response size (default: 100MB)

Platform Support

Platform Support
Linux Full support via AF_UNIX
macOS Full support via AF_UNIX
Windows Planned for future releases

Model Aggregation Service

Location: src/models/ (modular structure)

Purpose: Aggregate and cache model information from all backends, enrich with metadata, with cache stampede prevention

Module Structure (refactored from single models.rs file):

src/models/
├── mod.rs             # Re-exports for backward compatibility
├── types.rs           # Model, AggregatedModel, ModelList, SingleModelResponse types
├── metrics.rs         # ModelMetrics tracking (includes stampede metrics)
├── cache.rs           # ModelCache with stale-while-revalidate & singleflight
├── config.rs          # ModelAggregationConfig
├── fetcher.rs         # Model fetching from backends
├── handlers.rs        # HTTP handlers for /v1/models and /v1/models/{model} endpoints
├── background_refresh.rs  # Background periodic cache refresh service
├── utils.rs           # Utility functions (normalize_model_id, etc.)
└── aggregation/       # Core aggregation logic
    ├── mod.rs         # ModelAggregationService implementation
    └── tests.rs       # Unit tests

Cache Stampede Prevention

The model aggregation service implements three strategies to prevent cache stampede (thundering herd problem):

  1. Singleflight Pattern: Only one aggregation request runs at a time. Concurrent requests wait for the ongoing aggregation to complete, then share the result.

  2. Stale-While-Revalidate: When cache is stale (between soft and hard TTL), return stale data immediately while triggering a background refresh. Clients get fast responses with potentially slightly stale data.

  3. Background Periodic Refresh: A background task proactively refreshes the cache before expiration. Requests never block on cache refresh (except cold start).

pub struct ModelAggregationService {
    cache: ModelCache,  // With singleflight lock and background refresh tracking
    config: ModelAggregationConfig,
    fetcher: ModelFetcher,
}

impl ModelAggregationService {
    // Aggregate models with singleflight protection
    pub async fn aggregate_models_with_singleflight(&self, state: &Arc<AppState>)
        -> Result<AggregatedModelsResponse, StatusCode> { /* ... */ }

    // Find backends with stale-while-revalidate support
    pub async fn find_backends_for_model(&self, state: &Arc<AppState>, model_id: &str)
        -> Vec<String> { /* ... */ }

    // Clear cache (used during hot reload)
    pub fn clear_cache(&self) { /* ... */ }
}

Cache TTL Configuration

The cache uses a dual-TTL approach:

TTL Type Duration Behavior
Soft TTL 80% of hard TTL Triggers background refresh, returns stale data
Hard TTL Configured value Requires blocking refresh
Empty Response TTL 5 seconds Short TTL for empty responses to prevent DoS

Metrics

New metrics for cache stampede monitoring:

  • stale_while_revalidate: Requests that returned stale data during refresh
  • coalesced_requests: Requests that waited for ongoing aggregation
  • background_refreshes: Background refresh operations initiated
  • background_refresh_successes/failures: Background refresh outcomes
  • singleflight_lock_acquired: Times the aggregation lock was acquired

Proxy Module

Location: src/proxy/ (modular structure)

Purpose: Handle request proxying, backend selection, file resolution, and image generation/editing

Module Structure (refactored from single proxy.rs file):

src/proxy/
├── mod.rs             # Re-exports for backward compatibility
├── backend.rs         # Backend selection and routing logic
├── request.rs         # Request execution with retry logic
├── files.rs           # File reference resolution in requests
├── image_gen.rs       # Image generation handling (DALL-E, Gemini, GPT Image)
├── image_edit.rs      # Image editing support (/v1/images/edits)
├── image_utils.rs     # Image processing utilities (multipart, validation)
├── handlers.rs        # HTTP handlers for proxy endpoints
├── utils.rs           # Utility functions (error responses, etc.)
└── tests.rs           # Unit tests

Key Responsibilities

  • Backend Selection: Intelligent routing to available backends
  • File Resolution: Resolve file references in chat requests
  • Image Generation: Support for OpenAI (DALL-E, GPT Image) and Gemini (Nano Banana) image models
  • Image Editing: Image editing and variations endpoints
  • Request Retry: Automatic retry with exponential backoff
  • Error Handling: Standardized error responses in OpenAI format

Retry Handler

Location: src/services/deduplication.rs

Purpose: Implement exponential backoff with jitter and request deduplication

pub struct EnhancedRetryHandler {
    config: RetryConfig,
    dedup_cache: Arc<Mutex<HashMap<String, CachedResponse>>>,
    dedup_ttl: Duration,
}

pub struct RetryConfig {
    pub max_attempts: u32,
    pub base_delay: Duration,
    pub max_delay: Duration,
    pub exponential_backoff: bool,
    pub jitter: bool,
}

Circuit Breaker

Location: src/core/circuit_breaker/

Purpose: Prevent cascading failures by automatically stopping requests to failing backends

pub struct CircuitBreaker {
    states: Arc<DashMap<String, BackendCircuitState>>,
    config: CircuitBreakerConfig,
    metrics: Option<CircuitBreakerMetrics>,
}

pub struct CircuitBreakerConfig {
    pub enabled: bool,
    pub failure_threshold: u32,           // Failures before opening (default: 5)
    pub failure_rate_threshold: f64,      // Failure rate threshold (default: 0.5)
    pub minimum_requests: u32,            // Min requests before rate calculation
    pub timeout_seconds: u64,             // How long circuit stays open (default: 60s)
    pub half_open_max_requests: u32,      // Max requests in half-open state
    pub half_open_success_threshold: u32, // Successes needed to close
}

pub enum CircuitState {
    Closed,    // Normal operation - requests pass through
    Open,      // Failing fast - requests rejected immediately
    HalfOpen,  // Testing recovery - limited requests allowed
}

Key Features

  • Per-backend circuit breakers with independent state
  • Atomic operations for lock-free state checking in hot path
  • Automatic state transitions based on success/failure patterns
  • Sliding window for failure rate calculation
  • Prometheus metrics for observability
  • Admin endpoints for manual control

Container (Dependency Injection)

Location: src/core/container.rs

Purpose: Manage service lifecycles and dependencies

pub struct Container {
    services: Arc<RwLock<HashMap<TypeId, Box<dyn Any + Send + Sync>>>>,
    singletons: Arc<RwLock<HashMap<TypeId, Arc<dyn Any + Send + Sync>>>>,
}

impl Container {
    // Register singleton service
    pub async fn register_singleton<T>(&self, instance: Arc<T>) -> CoreResult<()> 
    where T: 'static + Send + Sync { /* ... */ }

    // Resolve service dependency
    pub async fn resolve<T>(&self) -> CoreResult<Arc<T>>
    where T: 'static + Send + Sync { /* ... */ }
}

Data Flow

Request Processing Flow

sequenceDiagram
    participant Client
    participant HTTPLayer as HTTP Layer
    participant ProxyService as Proxy Service
    participant BackendService as Backend Service
    participant ModelService as Model Service
    participant Backend as LLM Backend

    Client->>HTTPLayer: POST /v1/chat/completions
    HTTPLayer->>HTTPLayer: Apply Middleware (auth, logging, metrics)
    HTTPLayer->>ProxyService: Forward Request

    ProxyService->>ModelService: Get Model Info
    ModelService->>ModelService: Check Cache
    alt Cache Miss
        ModelService->>BackendService: Get Backends for Model
        BackendService->>Backend: Query Models
        Backend-->>BackendService: Model List
        BackendService-->>ModelService: Filtered Backends
        ModelService->>ModelService: Update Cache
    end
    ModelService-->>ProxyService: Model Available on Backends

    ProxyService->>BackendService: Select Healthy Backend
    BackendService->>BackendService: Apply Load Balancing
    BackendService-->>ProxyService: Selected Backend

    ProxyService->>Backend: Forward Request
    Backend-->>ProxyService: Response (streaming or non-streaming)

    ProxyService->>ProxyService: Apply Response Processing
    ProxyService-->>HTTPLayer: Processed Response
    HTTPLayer-->>Client: HTTP Response

Health Check Flow

sequenceDiagram
    participant HealthService as Health Service
    participant BackendPool as Backend Pool
    participant Backend as LLM Backend
    participant Cache as Health Cache

    loop Every Interval
        HealthService->>BackendPool: Get All Backends
        BackendPool-->>HealthService: Backend List

        par For Each Backend
            HealthService->>Backend: GET /v1/models (or /health)
            alt Success
                Backend-->>HealthService: 200 OK + Model List
                HealthService->>Cache: Update: consecutive_successes++
                HealthService->>HealthService: Mark Healthy if threshold met
            else Failure
                Backend-->>HealthService: Error/Timeout
                HealthService->>Cache: Update: consecutive_failures++
                HealthService->>HealthService: Mark Unhealthy if threshold met
            end
        end

        HealthService->>BackendPool: Update Backend Health Status
    end

Hot Reload Service

Location: src/infrastructure/config/hot_reload.rs, src/infrastructure/config/hot_reload_service.rs

Purpose: Provide runtime configuration updates without server restart

The hot reload system enables zero-downtime configuration changes through automatic file watching and intelligent component updates.

Key Architecture Components

  • ConfigManager: File system watching using notify crate, publishes updates via tokio::sync::watch channel
  • HotReloadService: Computes configuration differences, classifies changes (immediate/gradual/restart)
  • Component Updates: Interior mutability patterns (RwLock) for atomic updates to HealthChecker, CircuitBreaker, RateLimitStore, BackendPool

Backend Pool Hot Reload

The BackendPool uses interior mutability (Arc<RwLock<Vec<Arc<Backend>>>>) to support runtime backend additions and removals without service interruption.

Backend States:

  • Active: Normal operation, accepts new requests
  • Draining: Removed from config, existing requests continue, no new requests
  • Removed: Fully cleaned up after all references released

Graceful Draining Process:

  1. When a backend is removed from config, it's marked as Draining
  2. New requests skip draining backends
  3. Existing in-flight requests/streams continue uninterrupted (Arc reference counting)
  4. Background cleanup task runs every 10 seconds
  5. Backends are removed when: no references remain OR 5-minute timeout exceeded

This ensures zero impact on ongoing connections during configuration changes.

Change Classification

  • Immediate Update: logging.level, rate_limiting., circuit_breaker., retry., global_prompts.
  • Gradual Update: backends., health_checks., timeouts.*
  • Requires Restart: server.bind_address, server.workers

Admin API: /admin/config/hot-reload-status for inspecting hot reload capabilities

For detailed hot reload configuration, process flow, and usage examples, see configuration.md section on hot reload.

Configuration Migration System

Location: src/infrastructure/config/{migrator,migration,migrations,fixer,backup}.rs

Purpose: Automatically detect and fix configuration issues, migrate schemas, and ensure configuration validity

The configuration migration system provides a comprehensive solution for handling configuration evolution and maintenance. It automatically: - Detects and migrates outdated schema versions - Fixes common syntax errors in YAML/TOML files - Validates and corrects configuration values - Creates backups before making changes - Provides dry-run capability for previewing changes

Architecture Components

1. Migration Orchestrator (migrator.rs) - Main entry point for migration operations - Coordinates the entire migration workflow - Manages backup creation and restoration - Implements security validations (path traversal, file size limits)

2. Migration Framework (migration.rs) - Defines core types and traits for migrations - Migration trait for implementing version upgrades - ConfigIssue enum for categorizing problems - MigrationResult for tracking changes

3. Schema Migrations (migrations.rs) - Concrete migration implementations (e.g., V1ToV2Migration) - Transforms configuration structure between versions - Example: Converting backend_url to backends array

4. Auto-Correction Engine (fixer.rs) - Detects and fixes common configuration errors - Duration format correction (e.g., "10 seconds" → "10s") - URL validation and protocol addition - Field deprecation handling

5. Backup Manager (backup.rs) - Creates timestamped backups before modifications - Implements resource limits (10MB per file, 100MB total, max 50 backups) - Automatic cleanup of old backups - Preserves file permissions

Migration Workflow

graph TD
    A[Read Config File] --> B[Validate Path & Size]
    B --> C[Create Backup]
    C --> D[Parse Configuration]
    D --> E{Parse Success?}
    E -->|No| F[Fix Syntax Errors]
    F --> D
    E -->|Yes| G[Detect Schema Version]
    G --> H{Needs Migration?}
    H -->|Yes| I[Apply Migrations]
    H -->|No| J[Validate Values]
    I --> J
    J --> K{Issues Found?}
    K -->|Yes| L[Apply Auto-Fixes]
    K -->|No| M[Return Config]
    L --> N[Write Updated Config]
    N --> M

Security Features

  1. Path Traversal Protection: Validates paths to prevent directory traversal attacks
  2. File Size Limits: Maximum 10MB configuration files to prevent DoS
  3. Format Validation: Only processes .yaml, .yml, and .toml files
  4. System Directory Protection: Blocks access to sensitive system paths
  5. Test Mode Relaxation: Uses conditional compilation for test-friendly validation

Example Migration: v1.0 to v2.0

// V1ToV2Migration implementation
fn migrate(&self, config: &mut Value) -> Result<(), MigrationError> {
    // Convert single backend_url to backends array
    if let Some(backend_url) = config.get("backend_url") {
        let mut backends = Vec::new();
        let mut backend = Map::new();
        backend.insert("url".to_string(), backend_url.clone());

        // Move models to backend
        if let Some(model) = config.get("model") {
            backend.insert("models".to_string(), 
                Value::Sequence(vec![model.clone()]));
        }

        backends.push(Value::Mapping(backend));
        config["backends"] = Value::Sequence(backends);

        // Remove old fields
        config.remove("backend_url");
        config.remove("model");
    }
    Ok(())
}

Configuration Loading Flow

graph TD
    A[Application Start] --> B[Config Manager Init]
    B --> C{Config File Specified?}
    C -->|Yes| D[Load Specified File]
    C -->|No| E[Search Standard Locations]

    E --> F{Config File Found?}
    F -->|Yes| G[Load Config File]
    F -->|No| H[Use CLI Args + Env Vars + Defaults]

    D --> I[Parse YAML]
    G --> I
    H --> J[Create Config from Args]

    I --> K[Apply Environment Variable Overrides]
    J --> K

    K --> L[Apply CLI Argument Overrides]
    L --> M[Validate Configuration]
    M --> N{Valid?}
    N -->|Yes| O[Return Config]
    N -->|No| P[Exit with Error]

    O --> Q[Start File Watcher for Hot Reload]
    Q --> R[Application Running]

    Q --> S[Config File Changed]
    S --> T[Reload and Validate]
    T --> U{Valid?}
    U -->|Yes| V[Apply New Config]
    U -->|No| W[Log Error, Keep Old Config]

    V --> R
    W --> R

Dependency Injection

Service Registration

Services are registered in the container during application startup:

// In main.rs
async fn setup_services(config: Config) -> Result<ServiceRegistry, Error> {
    let container = Arc::new(Container::new());

    // Register infrastructure services
    container.register_singleton(Arc::new(
        HttpClient::new(&config.http_client)?
    )).await?;

    container.register_singleton(Arc::new(
        BackendManager::new(&config.backends)?
    )).await?;

    // Register core services
    container.register_singleton(Arc::new(
        BackendServiceImpl::new(container.clone())
    )).await?;

    container.register_singleton(Arc::new(
        ModelServiceImpl::new(container.clone())
    )).await?;

    // Create service registry
    let registry = ServiceRegistry::new(container);
    registry.initialize().await?;

    Ok(registry)
}

Service Dependencies

Services declare their dependencies through constructor injection:

pub struct ProxyServiceImpl {
    backend_service: Arc<dyn BackendService>,
    model_service: Arc<dyn ModelService>,
    retry_handler: Arc<dyn RetryHandler>,
    http_client: Arc<HttpClient>,
}

impl ProxyServiceImpl {
    pub fn new(container: Arc<Container>) -> CoreResult<Self> {
        Ok(Self {
            backend_service: container.resolve()?,
            model_service: container.resolve()?,
            retry_handler: container.resolve()?,
            http_client: container.resolve()?,
        })
    }
}

Benefits

  • Testability: Services can be mocked for unit testing
  • Flexibility: Implementations can be swapped without code changes
  • Lifecycle Management: Container manages service initialization and cleanup
  • Circular Dependency Detection: Container prevents circular dependencies

Error Handling Strategy

The router implements a comprehensive error handling strategy with typed errors, intelligent recovery, and user-friendly responses.

Error Type Hierarchy

  • CoreError: Domain-level errors (validation, service failures, timeouts, configuration)
  • RouterError: Application-level errors combining Core, HTTP, Backend, and Model errors
  • HttpError: HTTP-specific errors (400 BadRequest, 401 Unauthorized, 404 NotFound, 500 InternalServerError, etc.)

Error Handling Principles

  1. Fail Fast: Validate inputs early with clear error messages
  2. Error Context: Include relevant context (field names, operation details)
  3. Retryable Classification: Distinguish between retryable (timeout, 503) and non-retryable (400, 401) errors
  4. User-Friendly Responses: Convert internal errors to OpenAI-compatible error format
  5. Structured Logging: Log errors with appropriate severity and context

Error Recovery Mechanisms

  • Circuit Breaker: Prevent cascading failures (see Circuit Breaker)
  • Retry with Exponential Backoff: Automatically retry transient failures
  • Model Fallback: Route to alternative models when primary unavailable (see Model Fallback System)
  • Graceful Degradation: Continue with reduced functionality when components fail

For detailed error handling, recovery strategies, monitoring, and troubleshooting, see error-handling.md.

Extension Points

Backend Type Architecture

The router supports multiple backend types with different API formats. Each backend type handles request/response transformation automatically.

Supported Backend Types

Backend Type API Format Authentication Use Case
openai OpenAI Chat Completions Authorization: Bearer OpenAI, Azure OpenAI, vLLM, LocalAI
anthropic Anthropic Messages API x-api-key header Claude models via native API
gemini OpenAI-compatible Authorization: Bearer Google Gemini via OpenAI compatibility layer

Anthropic Backend Architecture

The Anthropic backend provides native support for Claude models with automatic format translation:

Key Transformations

Request Format Differences:

Aspect OpenAI Format Anthropic Format
System prompt messages[0].role="system" Separate system parameter
Auth header Authorization: Bearer x-api-key
Max tokens Optional Required (max_tokens)
Images image_url.url source.type + source.data

Extended Thinking Support:

// OpenAI reasoning_effort → Anthropic thinking
{
  "reasoning_effort": "high"  // OpenAI format
}
// Transforms to:
{
  "thinking": {
    "type": "enabled",
    "budget_tokens": 32768   // Mapped from effort level
  }
}

Tool Calling Transformation

The router provides automatic transformation of OpenAI-format tool definitions to backend-native formats, enabling cross-provider tool calling support.

Location: src/core/tool_calling/

Module Structure:

src/core/tool_calling/
├── mod.rs           # Module exports
├── definitions.rs   # Type definitions (ToolDefinition, FunctionDefinition, etc.)
├── transform.rs     # Transformation functions for each backend
├── tool_choice.rs   # Tool choice transformation for each backend
├── response.rs      # Tool call response extraction from backends
├── streaming.rs     # Streaming tool call transformation
└── messages.rs      # Multi-turn conversation message transformation

Supported Backends

Backend Input Format Output Format Notes
OpenAI Native Pass-through Validates tool names
Anthropic OpenAI input_schema format parametersinput_schema
Gemini OpenAI functionDeclarations Nested structure
llama.cpp OpenAI Pass-through Validates with --jinja flag

OpenAI Tool Format (Input)

{
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": { "type": "string" }
          },
          "required": ["location"]
        }
      }
    }
  ]
}

Anthropic Transformation

// transform_tools_to_anthropic()
// Input:  OpenAI format with `function.parameters`
// Output: Anthropic format with `input_schema`
{
  "tools": [
    {
      "name": "get_weather",
      "description": "Get current weather for a location",
      "input_schema": {
        "type": "object",
        "properties": {
          "location": { "type": "string" }
        },
        "required": ["location"]
      }
    }
  ]
}

Gemini Transformation

// transform_tools_to_gemini()
// Input:  OpenAI format
// Output: Gemini nested functionDeclarations format
{
  "tools": [
    {
      "functionDeclarations": [
        {
          "name": "get_weather",
          "description": "Get current weather for a location",
          "parameters": {
            "type": "object",
            "properties": {
              "location": { "type": "string" }
            },
            "required": ["location"]
          }
        }
      ]
    }
  ]
}

Validation Rules

  • Tool Name: Must match pattern ^[a-zA-Z0-9_-]+$ (alphanumeric, underscore, hyphen only)
  • Empty Name: Rejected with validation error
  • Missing Fields: Logged as warnings, tool skipped (no silent data loss)
  • Type Field: Must be "function" for standard tool definitions

Integration with Fallback System

The tool calling transformation integrates with the model fallback system via ParameterTranslator:

// In src/core/fallback/translation.rs
impl ParameterTranslator {
    pub fn translate_tools(
        &self,
        tools: &Value,
        from_backend_type: BackendType,
        to_backend_type: BackendType,
    ) -> CoreResult<Value> {
        match (from_backend_type, to_backend_type) {
            (_, BackendType::Anthropic) => transform_tools_to_anthropic(tools),
            (_, BackendType::Gemini) => transform_tools_to_gemini(tools),
            (_, BackendType::LlamaCpp) => transform_tools_for_llamacpp(tools),
            _ => Ok(tools.clone()),
        }
    }
}

This enables seamless fallback from one provider to another while preserving tool definitions.

Tool Choice Transformation

In addition to tool definitions, the router also transforms the tool_choice parameter which controls model tool-calling behavior.

Supported Values:

OpenAI Value Description Anthropic Gemini
"auto" Model decides whether to call tools {"type": "auto"} mode: "AUTO"
"none" Model should not call tools Remove tools entirely mode: "NONE"
"required" Model must call at least one tool {"type": "any"} mode: "ANY"
{"type": "function", "function": {"name": "X"}} Force specific function {"type": "tool", "name": "X"} mode: "ANY", allowed_function_names: ["X"]

Edge Cases:

  • Anthropic "none" workaround: Anthropic API does not support tool_choice=none. When this value is detected, the router removes tools and tool_choice entirely from the request.
  • llama.cpp: Preserves parallel_tool_calls parameter for parallel function calling support.

Implementation:

// In src/core/tool_calling/tool_choice.rs
pub enum ToolChoiceValue {
    Auto,
    None,
    Required,
    Function(String),
}

impl ToolChoiceValue {
    pub fn from_openai(value: &Value) -> CoreResult<Self>;
    pub fn to_anthropic(&self) -> Option<Value>;  // Returns None for "none"
    pub fn to_gemini(&self) -> Value;
}

Streaming Tool Call Transformation

When streaming is enabled, tool calls are returned incrementally as delta events rather than complete objects. Different providers use distinct streaming formats, requiring real-time transformation to maintain OpenAI API compatibility.

Location: src/core/tool_calling/streaming.rs

Key Components:

Component Purpose
ToolCallAccumulator Tracks streaming tool call state (index, id, name, arguments)
StreamingToolCallTransformer State machine for Anthropic → OpenAI delta transformation
transform_gemini_streaming_tool_call() Gemini → OpenAI delta transformation

Streaming Format Comparison:

Provider Format Characteristics
OpenAI Delta chunks with tool_calls[].function.arguments Arguments arrive in fragments
Anthropic Event-based: content_block_start, content_block_delta, content_block_stop input_json_delta contains partial JSON
Gemini Complete functionCall objects in each chunk Arguments arrive complete per chunk

State Machine (Anthropic):

IDLE → message_start → READY
READY → content_block_start (tool_use) → TOOL_STARTED
TOOL_STARTED → content_block_delta → ACCUMULATING
ACCUMULATING → content_block_delta → ACCUMULATING (loop)
ACCUMULATING → content_block_stop → TOOL_COMPLETE
TOOL_COMPLETE → message_delta → FINISHED

Safety Limits:

To prevent memory exhaustion from malformed streams:

  • MAX_TOOL_CALLS = 64: Maximum parallel tool calls per message
  • MAX_ARGUMENTS_SIZE = 1MB: Maximum accumulated arguments per tool call

Example Transformation (Anthropic → OpenAI):

// Anthropic input: content_block_start
{"type": "content_block_start", "index": 0,
 "content_block": {"type": "tool_use", "id": "toolu_123", "name": "get_weather"}}

// OpenAI output: delta chunk
{"id": "chatcmpl-...", "choices": [{"delta": {"tool_calls": [
  {"index": 0, "id": "toolu_123", "type": "function",
   "function": {"name": "get_weather", "arguments": ""}}
]}}]}

Finish Reason Mapping:

Source (Anthropic/Gemini) Target (OpenAI)
tool_use tool_calls
end_turn / STOP stop
max_tokens / MAX_TOKENS length
SAFETY / RECITATION content_filter

Multi-Turn Conversation Message Transformation

When using tools in multi-turn conversations, the message history contains tool calls from the assistant and tool results from the client. These messages require format transformation when routing to different backends.

Location: src/core/tool_calling/messages.rs

Key Functions:

Function Purpose
transform_tool_message_to_anthropic() Converts OpenAI tool result to Anthropic format
transform_tool_message_to_gemini() Converts OpenAI tool result to Gemini format
transform_assistant_tool_calls_to_anthropic() Converts assistant message with tool_calls to Anthropic
transform_assistant_tool_calls_to_gemini() Converts assistant message with tool_calls to Gemini
transform_messages_with_tools() Transforms entire conversation history
find_function_name_for_tool_call() Looks up function name by toolcallid

Message Format Differences:

Aspect OpenAI Anthropic Gemini
Tool result role tool user (with content block) function
ID reference tool_call_id tool_use_id By name matching
Content format String String or structured response object
Assistant role assistant assistant model

OpenAI Format (Input):

{
  "messages": [
    {"role": "user", "content": "What's the weather in NYC?"},
    {
      "role": "assistant",
      "tool_calls": [{
        "id": "call_abc123",
        "type": "function",
        "function": {"name": "get_weather", "arguments": "{\"location\": \"NYC\"}"}
      }]
    },
    {
      "role": "tool",
      "tool_call_id": "call_abc123",
      "content": "{\"temperature\": 72, \"condition\": \"sunny\"}"
    }
  ]
}

Anthropic Transformation:

{
  "messages": [
    {"role": "user", "content": "What's the weather in NYC?"},
    {
      "role": "assistant",
      "content": [{
        "type": "tool_use",
        "id": "call_abc123",
        "name": "get_weather",
        "input": {"location": "NYC"}
      }]
    },
    {
      "role": "user",
      "content": [{
        "type": "tool_result",
        "tool_use_id": "call_abc123",
        "content": "{\"temperature\": 72, \"condition\": \"sunny\"}"
      }]
    }
  ]
}

Gemini Transformation:

{
  "contents": [
    {"role": "user", "parts": [{"text": "What's the weather in NYC?"}]},
    {
      "role": "model",
      "parts": [{
        "functionCall": {"name": "get_weather", "args": {"location": "NYC"}}
      }]
    },
    {
      "role": "function",
      "parts": [{
        "functionResponse": {
          "name": "get_weather",
          "response": {"temperature": 72, "condition": "sunny"}
        }
      }]
    }
  ]
}

Multiple Tool Results:

When multiple tools are called in parallel, consecutive tool result messages are combined for Anthropic into a single user message:

// Multiple OpenAI tool results:
{"role": "tool", "tool_call_id": "call_1", "content": "72F, sunny"}
{"role": "tool", "tool_call_id": "call_2", "content": "3:00 PM EST"}

// Combined Anthropic format:
{
  "role": "user",
  "content": [
    {"type": "tool_result", "tool_use_id": "call_1", "content": "72F, sunny"},
    {"type": "tool_result", "tool_use_id": "call_2", "content": "3:00 PM EST"}
  ]
}

Error Handling:

  • Missing tool_call_id: Returns validation error
  • Malformed tool calls: Logged as warnings, skipped without failing entire request
  • Unknown function name for Gemini: Falls back to "unknown" with warning log
  • Malformed JSON arguments: Falls back to empty object {}
  • is_error indicator: Preserved when transforming to Anthropic format

Gemini 3 thoughtSignature Support

Gemini 3 models introduce a new thoughtSignature field for function calling that must be preserved and passed back in multi-turn conversations. The router automatically handles this through the transformation pipeline.

What is thoughtSignature?

When Gemini 3+ models return function calls, they include an encrypted thoughtSignature field that encapsulates the model's internal reasoning context. This signature must be included when sending the function result back to continue the conversation correctly.

Response Extraction (Gemini -> Client):

The router extracts thoughtSignature from Gemini responses and includes it in the OpenAI-compatible format:

// Gemini native response (with thoughtSignature)
{
  "candidates": [{
    "content": {
      "parts": [{
        "functionCall": {"name": "get_weather", "args": {"location": "NYC"}},
        "thoughtSignature": "encrypted_signature_abc123"
      }]
    }
  }]
}

// Router's OpenAI-compatible response
{
  "choices": [{
    "message": {
      "tool_calls": [{
        "id": "call_xyz789",
        "type": "function",
        "function": {"name": "get_weather", "arguments": "{\"location\":\"NYC\"}"},
        "extra_content": {
          "google": {
            "thought_signature": "encrypted_signature_abc123"
          }
        }
      }]
    }
  }]
}

Request Injection (Client -> Gemini):

When clients send tool results back, the router extracts thought_signature from extra_content.google and injects it as thoughtSignature in the Gemini request:

// Client's OpenAI-format request
{
  "messages": [{
    "role": "assistant",
    "tool_calls": [{
      "id": "call_xyz789",
      "function": {"name": "get_weather", "arguments": "{}"},
      "extra_content": {
        "google": {"thought_signature": "encrypted_signature_abc123"}
      }
    }]
  }]
}

// Transformed Gemini request
{
  "contents": [{
    "role": "model",
    "parts": [{
      "functionCall": {"name": "get_weather", "args": {}},
      "thoughtSignature": "encrypted_signature_abc123"
    }]
  }]
}

Streaming Support:

The streaming transformer also handles thoughtSignature extraction, including it in tool call delta chunks.

Backward Compatibility:

  • Gemini 2.x models do not return thoughtSignature - the router gracefully handles this
  • Clients that don't preserve extra_content.google.thought_signature can still make function calls (but may have degraded conversation continuity on Gemini 3+)
  • The extra_content field is ignored by non-Gemini backends

Implementation Locations:

Component File Function
Response extraction src/infrastructure/backends/gemini/transform.rs transform_response_gemini(), transform_native_gemini_response()
Request injection src/core/tool_calling/messages.rs transform_assistant_tool_calls_to_gemini()
Streaming extraction src/infrastructure/backends/gemini/stream.rs transform_native_gemini_event(), create_tool_call_chunk_with_signature()

Adding New Backend Types

  1. Implement Backend Trait:

    // In src/infrastructure/backends/custom_backend.rs
    pub struct CustomBackend {
        client: Arc<HttpClient>,
        config: CustomBackendConfig,
    }
    
    #[async_trait]
    impl BackendTrait for CustomBackend {
        async fn health_check(&self) -> CoreResult<()> { /* ... */ }
        async fn list_models(&self) -> CoreResult<Vec<Model>> { /* ... */ }
        async fn chat_completion(&self, request: ChatRequest) -> CoreResult<Response> { /* ... */ }
    }
    

  2. Register in Backend Factory:

    // In src/infrastructure/backends/mod.rs
    pub fn create_backend(backend_type: &str, config: &BackendConfig) -> CoreResult<Box<dyn BackendTrait>> {
        match backend_type {
            "openai" => Ok(Box::new(OpenAIBackend::new(config)?)),
            "vllm" => Ok(Box::new(VLLMBackend::new(config)?)),
            "custom" => Ok(Box::new(CustomBackend::new(config)?)), // New backend
            _ => Err(CoreError::ValidationFailed {
                message: format!("Unknown backend type: {}", backend_type),
                field: Some("backend_type".to_string()),
            }),
        }
    }
    

Adding New Middleware

  1. Implement Middleware Trait:

    // In src/http/middleware/custom_middleware.rs
    pub struct CustomMiddleware {
        config: CustomConfig,
    }
    
    impl<S> tower::Layer<S> for CustomMiddleware {
        type Service = CustomMiddlewareService<S>;
    
        fn layer(&self, inner: S) -> Self::Service {
            CustomMiddlewareService { inner, config: self.config.clone() }
        }
    }
    

  2. Register in HTTP Router:

    // In src/main.rs
    let app = Router::new()
        .route("/v1/models", get(list_models))
        .layer(CustomMiddleware::new(config.custom))
        .layer(LoggingMiddleware::new())
        .with_state(state);
    

Adding New Cache Types

  1. Implement Cache Trait:

    // In src/infrastructure/cache/redis_cache.rs
    pub struct RedisCache {
        client: redis::Client,
        ttl: Duration,
    }
    
    #[async_trait]
    impl CacheTrait for RedisCache {
        async fn get<T>(&self, key: &str) -> CoreResult<Option<T>>
        where T: DeserializeOwned { /* ... */ }
    
        async fn set<T>(&self, key: &str, value: &T, ttl: Option<Duration>) -> CoreResult<()>
        where T: Serialize { /* ... */ }
    }
    

  2. Use in Service:

    // Services can use any cache implementation
    pub struct ModelServiceImpl<C: CacheTrait> {
        cache: Arc<C>,
        // ... other dependencies
    }
    

Adding New Load Balancing Strategies

// In src/services/load_balancer.rs
pub enum LoadBalancingStrategy {
    RoundRobin,
    WeightedRoundRobin,
    LeastConnections,  // New strategy
    Random,
}

impl LoadBalancingStrategy {
    pub fn select_backend(&self, backends: &[Backend]) -> Option<&Backend> {
        match self {
            Self::RoundRobin => /* ... */,
            Self::WeightedRoundRobin => /* ... */,
            Self::LeastConnections => self.select_least_connections(backends),
            Self::Random => /* ... */,
        }
    }
}

Design Decisions

Why 4-Layer Architecture?

Decision: Use a 4-layer architecture (HTTP → Services → Infrastructure → Core)

Rationale

  • Clear Separation: Each layer has distinct responsibilities
  • Testability: Layers can be tested independently
  • Maintainability: Changes in one layer don't affect others
  • Flexibility: Easy to swap implementations (e.g., different cache backends)

Trade-offs

  • Pros: Clean, maintainable, testable, extensible
  • Cons: More complexity, slight performance overhead
  • Verdict: Benefits outweigh costs for a production system

Why Dependency Injection?

Decision: Use a custom DI container instead of compile-time injection

Rationale

  • Runtime Flexibility: Can swap implementations based on configuration
  • Service Lifecycle: Centralized management of service initialization/cleanup
  • Testing: Easy to inject mocks and test doubles

Alternatives Considered

  • Manual dependency passing: Too verbose and error-prone
  • Compile-time DI (generics): Less flexible, harder to configure

Why Arc> for Shared State?

Decision: Use Arc<RwLock<T>> for shared mutable state

Rationale

  • Reader-Writer Semantics: Multiple readers, exclusive writers
  • Performance: Better than Arc<Mutex<T>> for read-heavy workloads
  • Safety: Prevents data races at compile time

Alternatives Considered

  • Arc<Mutex<T>>: Simpler but worse performance for reads
  • Channels: Too complex for simple shared state
  • Atomic types: Not suitable for complex data structures

Why async/await Throughout?

Decision: Use async/await for all I/O operations

Rationale

  • Performance: Non-blocking I/O allows high concurrency
  • Resource Efficiency: Lower memory usage than thread-per-request
  • Ecosystem: Rust async ecosystem (Tokio, reqwest, axum) is mature

Trade-offs

  • Pros: High performance, low resource usage, good ecosystem
  • Cons: Complexity, learning curve, debugging challenges
  • Verdict: Essential for high-performance network services

Why Configuration Hot-Reload?

Decision: Support configuration hot-reload using file watching

Rationale

  • Zero Downtime: Update configuration without restarting
  • Operations Friendly: Easy to adjust settings in production
  • Development: Faster iteration during development

Implementation

  • File system watcher detects changes
  • Validate new configuration before applying
  • Atomic updates to avoid inconsistent state
  • Fallback to previous config on validation errors

Performance Considerations

Memory Management

  1. Connection Pooling: Reuse HTTP connections to reduce allocation overhead
  2. Smart Caching: LRU eviction prevents unbounded memory growth
  3. Arc Cloning: Cheap reference counting instead of deep cloning
  4. Streaming: Process responses in chunks to avoid loading large responses into memory

Concurrency

  1. RwLock for Read-Heavy Workloads: Multiple concurrent readers for backend pool and model cache
  2. Lock-Free Where Possible: Use atomics for counters and simple state
  3. Async Task Spawning: Background tasks for health checks and cache updates
  4. Bounded Channels: Prevent unbounded queuing of tasks

I/O Optimization

  1. Connection Keep-Alive: TCP connections stay open for reuse
  2. Streaming Responses: Forward SSE chunks without buffering
  3. Timeouts: Prevent hanging on slow backends
  4. Retry with Backoff: Avoid overwhelming failing backends

Memory Layout

// Optimized data structures for cache efficiency
pub struct Backend {
    pub name: String,          // Inline string for small names
    pub url: Arc<str>,         // Shared string for URL
    pub weight: u32,           // Compact integer
    pub is_healthy: AtomicBool, // Lock-free health status
}

// Cache-friendly model storage
pub struct ModelCache {
    models: HashMap<String, Arc<ModelInfo>>, // Shared model info
    last_updated: AtomicU64,                 // Lock-free timestamp
    ttl: Duration,
}

Benchmarking Results

Based on our benchmarks (see benches/performance_benchmarks.rs):

  • Request Latency: < 5ms overhead for routing decisions
  • Memory Usage: ~50MB base memory, scales linearly with backends
  • Throughput: 1000+ requests/second on modest hardware
  • Connection Efficiency: 100+ concurrent connections per backend with minimal memory overhead

Rate Limiting

The router implements sophisticated rate limiting to protect against abuse and ensure fair resource allocation across clients.

Key Features: - Dual-window approach: sustained limit (100 req/min) + burst protection (20 req/5s) - Client identification by API key (preferred) or IP address (fallback) - Per-client isolation with automatic cache cleanup - DoS prevention with short TTL for empty responses

Rate Limit V2 Architecture

The enhanced rate limiting system (rate_limit_v2/) provides a modular, high-performance implementation:

Module Structure

src/http/middleware/rate_limit_v2/
├── mod.rs              # Public API and module exports
├── middleware.rs       # Axum middleware integration
├── store.rs            # Rate limit storage and client tracking
└── token_bucket.rs     # Token bucket algorithm implementation

Components

  1. Token Bucket Algorithm (token_bucket.rs)
  2. Configurable bucket capacity and refill rate
  3. Atomic operations for lock-free token consumption
  4. Automatic token replenishment based on elapsed time
  5. Separate buckets for sustained and burst limits

  6. Rate Limit Store (store.rs)

  7. Per-client state tracking with DashMap for concurrent access
  8. Automatic cleanup of expired client entries
  9. Configurable TTL for inactive clients (default: 1 hour)
  10. Memory-efficient with bounded storage

  11. Middleware Integration (middleware.rs)

  12. Extracts client identifier (API key → IP address fallback)
  13. Checks both sustained and burst limits before processing
  14. Returns HTTP 429 (Too Many Requests) with Retry-After header
  15. Prometheus metrics for monitoring rate limit hits

Configuration Example

rate_limiting:
  enabled: true
  sustained:
    max_requests: 100
    window_seconds: 60
  burst:
    max_requests: 20
    window_seconds: 5
  cleanup_interval_seconds: 300

Decision Flow

Request arrives
Extract client ID (API key or IP)
Check sustained limit (100 req/min)
    ↓ OK
Check burst limit (20 req/5s)
    ↓ OK
Process request

For detailed configuration information, see configuration.md section on rate limiting.

Model Fallback System

The router implements a configurable model fallback system that automatically routes requests to alternative models when the primary model is unavailable.

Key Features: - Automatic fallback chain execution (e.g., gpt-4o → gpt-4-turbo → gpt-3.5-turbo) - Cross-provider fallback support with parameter translation - Integration with circuit breaker for intelligent triggering - Prometheus metrics for monitoring fallback usage

For detailed configuration and implementation, see error-handling.md section on model fallback.

Circuit Breaker

The router implements the circuit breaker pattern to prevent cascading failures and provide automatic failover when backends become unhealthy.

Three-State Machine:

State Behavior
Closed Normal operation. Failures are counted.
Open Fast-fail mode. Requests rejected immediately.
HalfOpen Recovery testing. Limited requests allowed.

Key Features: - Per-backend isolation with independent state - Lock-free atomic operations for minimal hot-path overhead - Admin endpoints for manual control (/admin/circuit/*) - Prometheus metrics for observability

For detailed configuration and implementation, see error-handling.md section on circuit breaker.

File Storage

The router provides OpenAI Files API compatible file storage with persistent metadata.

Key Features: - Persistent metadata storage with sidecar JSON files - Automatic recovery on server restart - Orphan file detection and cleanup - Pluggable backends (memory/persistent)

For detailed architecture and implementation, see File Storage Guide.

Image Generation Architecture

The router provides a unified interface for image generation across multiple backends (OpenAI GPT Image, DALL-E, and Google Gemini/Nano Banana) with automatic parameter translation.

Multi-Backend Image Generation

OpenAI → Gemini Parameter Conversion

When using Nano Banana (Gemini) models, OpenAI-style parameters are automatically converted to Gemini's native format:

Size to Aspect Ratio Mapping

OpenAI size Gemini aspectRatio Gemini imageSize Notes
256x256 1:1 1K Minimum Gemini size
512x512 1:1 1K Minimum Gemini size
1024x1024 1:1 1K Default
1536x1024 3:2 1K Landscape (new)
1024x1536 2:3 1K Portrait (new)
1792x1024 16:9 1K Wide landscape
1024x1792 9:16 1K Tall portrait
2048x2048 1:1 2K Pro models only
4096x4096 1:1 4K Pro models only
auto 1:1 1K Default fallback

Request Transformation

OpenAI Format (Input):

{
  "model": "nano-banana",
  "prompt": "A serene Japanese garden",
  "size": "1536x1024",
  "n": 1
}

Gemini Format (Converted):

{
  "contents": [
    {
      "parts": [{"text": "A serene Japanese garden"}]
    }
  ],
  "generationConfig": {
    "imageConfig": {
      "aspectRatio": "3:2",
      "imageSize": "1K"
    }
  }
}

Conversion Implementation

The conversion is handled by src/infrastructure/backends/gemini/image_generation.rs:

pub fn convert_openai_to_gemini(request: &OpenAIImageRequest)
    -> CoreResult<(String, GeminiImageRequest)>
{
    // 1. Map model name
    let gemini_model = map_model_to_gemini(&request.model);

    // 2. Parse size to aspect ratio and size category
    let parsed_size = parse_openai_size(&request.size, &request.model)?;

    // 3. Build Gemini request with imageConfig
    let gemini_request = GeminiImageRequest {
        contents: vec![GeminiContent { parts: vec![...] }],
        generation_config: Some(GeminiGenerationConfig {
            image_config: Some(GeminiImageConfig {
                aspect_ratio: Some(parsed_size.aspect_ratio.to_gemini_string()),
                image_size: Some(parsed_size.size_category.to_gemini_image_size()),
            }),
        }),
    };

    Ok((gemini_model, gemini_request))
}

Streaming Image Generation (SSE)

For GPT Image models, the router supports true SSE passthrough for streaming image generation:

┌─────────────┐                ┌─────────────┐                ┌─────────────┐
│   Client    │────stream:true─▶│   Router    │────stream:true─▶│   OpenAI    │
│             │                │             │                │             │
│             │◀───SSE events──│  Passthrough│◀───SSE events──│             │
└─────────────┘                └─────────────┘                └─────────────┘

SSE Event Types:

Event Description
image_generation.partial_image Intermediate preview during generation
image_generation.complete Final image data
image_generation.usage Token usage for billing
done Stream completion

Implementation (src/proxy/image_gen.rs):

async fn handle_streaming_image_generation(...) -> Result<Response, StatusCode> {
    // 1. Keep stream: true in backend request
    // 2. Make streaming request via bytes_stream()
    // 3. Forward SSE events through tokio channel

    let (tx, rx) = tokio::sync::mpsc::unbounded_channel();

    tokio::spawn(async move {
        let mut stream = backend_response.bytes_stream();
        while let Some(chunk) = stream.next().await {
            // Parse SSE format (event:/data: lines)
            // Forward events to client
            for line in chunk_str.lines() {
                if let Some(event_type) = line.strip_prefix("event:") { ... }
                if let Some(data) = line.strip_prefix("data:") {
                    let event = Event::default().event(event_type).data(data);
                    tx.send(Ok(event));
                }
            }
        }
    });

    Ok(Sse::new(UnboundedReceiverStream::new(rx)).into_response())
}

GPT Image Model Features

The router supports enhanced parameters for GPT Image models (gpt-image-1, gpt-image-1.5, gpt-image-1-mini):

Parameter Description Values
output_format Image file format png, jpeg, webp
output_compression Compression level 0-100 (jpeg/webp only)
background Transparency control transparent, opaque, auto
quality Generation quality low, medium, high, auto
stream Enable SSE streaming true, false
partial_images Preview count 0-3

Model Support Matrix

Feature GPT Image 1.5 GPT Image 1 GPT Image 1 Mini DALL-E 3 DALL-E 2 Nano Banana Nano Banana Pro
Streaming
output_format
background
Custom quality standard/hd
Image Edit
Image Variations
Max Resolution 1536px 1536px 1536px 1792px 1024px 1024px 4096px

Image Edit and Variations

The router provides OpenAI-compatible image editing and variations endpoints through /v1/images/edits and /v1/images/variations.

Image Editing (/v1/images/edits)

Endpoint: POST /v1/images/edits

Allows editing an existing image with a text prompt and optional mask. Supported by GPT Image models and DALL-E 2.

Request Format (multipart/form-data):

image: <file>           # Original image (PNG, required)
prompt: <string>        # Edit instructions (required)
mask: <file>            # Optional mask image (PNG)
model: <string>         # Model name (e.g., "gpt-image-1", "dall-e-2")
n: <integer>            # Number of images (default: 1)
size: <string>          # Output size (e.g., "1024x1024")
response_format: <string> # "url" or "b64_json"

Implementation (src/proxy/image_edit.rs): - Multipart form parsing for image and mask files - Image validation (format, size, aspect ratio) - Model-specific parameter transformation - Proper error handling for invalid inputs

Supported Features

  • Transparent PNG mask support for targeted editing
  • Multiple image generation (n parameter)
  • Flexible output sizes
  • Both URL and base64 response formats

Image Variations (/v1/images/variations)

Endpoint: POST /v1/images/variations

Creates variations of a given image. Supported by DALL-E 2 only.

Request Format (multipart/form-data):

image: <file>           # Source image (PNG, required)
model: <string>         # Model name (default: "dall-e-2")
n: <integer>            # Number of variations (default: 1, max: 10)
size: <string>          # Output size ("256x256", "512x512", "1024x1024")
response_format: <string> # "url" or "b64_json"

Implementation (src/proxy/image_edit.rs): - Image file validation and preprocessing - DALL-E 2-specific routing - Error handling for unsupported models - Consistent response formatting

Key Features

  • Generate multiple variations in a single request
  • Automatic image format validation
  • Standard OpenAI response format compatibility

Image Utilities Module

The image_utils.rs module provides shared utilities for image processing:

Functions

  • validate_image_format(): Validates PNG/JPEG format and dimensions
  • parse_multipart_image_request(): Extracts images from multipart forms
  • check_image_dimensions(): Validates size constraints
  • format_image_error_response(): Standardized error responses

Validation Rules

  • Maximum file size: 4MB (configurable)
  • Supported formats: PNG (required for edits/variations), JPEG (generation only)
  • Aspect ratio constraints per model
  • Transparent PNG requirement for masks

This architecture provides a solid foundation for building a production-ready LLM router that can scale to handle thousands of requests while remaining maintainable and extensible. The clean separation of concerns makes it easy to add new features, swap implementations, and thoroughly test each component.