Skip to content

Architecture Guide

This document provides a comprehensive overview of Continuum Router's architecture, design decisions, and extension points.

Table of Contents

Overview

Continuum Router is designed as a high-performance, production-ready LLM API router using a clean 4-layer architecture that provides clear separation of concerns, testability, and maintainability. The architecture follows Domain-Driven Design principles and dependency inversion to create a robust, extensible system.

Architecture Goals

  1. Separation of Concerns: Each layer has a single, well-defined responsibility
  2. Dependency Inversion: Higher layers depend on abstractions, not concrete implementations
  3. Testability: Each component can be unit tested in isolation
  4. Extensibility: New features can be added without modifying existing code
  5. Performance: Minimal overhead while maintaining clean architecture
  6. Reliability: Fail-fast design with comprehensive error handling

4-Layer Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        HTTP Layer                               │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐   │
│  │     Routes      │ │   Middleware    │ │   Handlers      │   │
│  │                 │ │                 │ │                 │   │
│  │ • /v1/models    │ │ • Logging       │ │ • Streaming     │   │
│  │ • /v1/chat/*    │ │ • Metrics       │ │ • Responses API │   │
│  │ • /v1/responses │ │ • Rate Limit    │ │ • DTOs          │   │
│  │ • /admin/*      │ │ • Auth          │ │ • Error         │   │
│  └─────────────────┘ └─────────────────┘ └─────────────────┘   │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│                     Services Layer                              │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐   │
│  │ Backend Service │ │  Model Service  │ │ Proxy Service   │   │
│  │                 │ │                 │ │                 │   │
│  │ • Pool Mgmt     │ │ • Aggregation   │ │ • Routing       │   │
│  │ • Load Balance  │ │ • Caching       │ │ • Streaming     │   │
│  │ • Health Check  │ │ • Discovery     │ │ • Retry Logic   │   │
│  │                 │ │ • Metadata      │ │                 │   │
│  └─────────────────┘ └─────────────────┘ └─────────────────┘   │
│                                                                 │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐   │
│  │ Health Service  │ │ Service Registry│ │ Deduplication   │   │
│  │                 │ │                 │ │                 │   │
│  │ • Monitoring    │ │ • Lifecycle     │ │ • Cache         │   │
│  │ • Status Track  │ │ • Dependencies  │ │ • Request Hash  │   │
│  │ • Recovery      │ │ • Container     │ │ • TTL Mgmt      │   │
│  └─────────────────┘ └─────────────────┘ └─────────────────┘   │
│                                                                 │
│  ┌─────────────────┐                                            │
│  │  File Service   │  See: architecture/file-storage.md         │
│  │                 │                                            │
│  │ • Upload/Delete │                                            │
│  │ • Metadata Mgmt │                                            │
│  │ • Persistence   │                                            │
│  └─────────────────┘                                            │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│                 Infrastructure Layer                            │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐   │
│  │    Backends     │ │      Cache      │ │     Common      │   │
│  │                 │ │                 │ │                 │   │
│  │ • OpenAI        │ │ • LRU Cache     │ │ • HTTP Client   │   │
│  │ • Anthropic     │ │ • TTL Cache     │ │ • Executor      │   │
│  │ • Gemini        │ │ • Retry Cache   │ │ • Statistics    │   │
│  │ • vLLM          │ │                 │ │ • URL Validator │   │
│  └─────────────────┘ └─────────────────┘ └─────────────────┘   │
│                                                                 │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐   │
│  │  Configuration  │ │  Backend Pool   │ │  Backend Factory  │ │
│  │                 │ │                 │ │                   │ │
│  │ • File Watcher  │ │ • Pool Mgmt     │ │ • Create Backend  │ │
│  │ • Env Override  │ │ • Connection    │ │ • Type Detection  │ │
│  │ • Validation    │ │ • Pre-warming   │ │ • Config Parsing  │ │
│  └─────────────────┘ └─────────────────┘ └─────────────────┘   │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│                       Core Layer                                │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐   │
│  │     Models      │ │     Traits      │ │     Errors      │   │
│  │                 │ │                 │ │                 │   │
│  │ • Backend       │ │ • BackendTrait  │ │ • CoreError     │   │
│  │ • Model         │ │ • ServiceTrait  │ │ • RouterError   │   │
│  │ • Request       │ │ • CacheTrait    │ │ • ErrorSeverity │   │
│  │ • Response      │ │ • HealthTrait   │ │ • ErrorDetail   │   │
│  └─────────────────┘ └─────────────────┘ └─────────────────┘   │
│                                                                 │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐   │
│  │  Retry Logic    │ │  Configuration  │ │   Container     │   │
│  │                 │ │                 │ │                 │   │
│  │ • Policies      │ │ • Models        │ │ • DI Container  │   │
│  │ • Strategies    │ │ • Validation    │ │ • Service Mgmt  │   │
│  │ • Backoff       │ │ • Defaults      │ │ • Lifecycle     │   │
│  └─────────────────┘ └─────────────────┘ └─────────────────┘   │
│                                                                 │
│  ┌─────────────────┐                                           │
│  │ Circuit Breaker │                                           │
│  │                 │                                           │
│  │ • State Machine │                                           │
│  │ • Failure Track │                                           │
│  │ • Auto Recovery │                                           │
│  └─────────────────┘                                           │
└─────────────────────────────────────────────────────────────────┘

Layer Descriptions

1. HTTP Layer (src/http/)

Responsibility: Handle HTTP requests, responses, and web-specific concerns

Components

  • Routes (routes.rs): Define HTTP endpoints and route handling
  • Middleware (middleware/): Cross-cutting concerns (auth, logging, metrics, rate limiting)
  • DTOs (dto/): Data Transfer Objects for HTTP serialization/deserialization
  • Streaming (streaming/): Server-Sent Events (SSE) handling

Key Files

src/http/
├── mod.rs              # HTTP layer exports
├── routes.rs           # Route definitions and handlers
├── dto.rs              # Request/Response DTOs
├── handlers/           # Request handlers
│   ├── mod.rs
│   └── responses.rs    # Responses API handlers
├── middleware/         # HTTP middleware components
│   ├── mod.rs
│   ├── auth.rs         # API key authentication middleware
│   ├── admin_auth.rs   # Admin API authentication middleware
│   ├── files_auth.rs   # Files API authentication middleware
│   ├── admin_audit.rs  # Admin operations audit logging
│   ├── logging.rs      # Request/response logging
│   ├── metrics.rs      # Metrics collection
│   ├── metrics_auth.rs # Metrics endpoint authentication
│   ├── model_extractor.rs # Model extraction from requests
│   ├── prometheus.rs   # Prometheus metrics integration
│   ├── rate_limit.rs   # Rate limiting middleware (legacy)
│   └── rate_limit_v2/  # Enhanced rate limiting (modular)
│       ├── mod.rs      # Module exports
│       ├── middleware.rs # Rate limiting middleware
│       ├── store.rs    # Rate limit storage and tracking
│       └── token_bucket.rs # Token bucket algorithm
└── streaming/          # SSE streaming handlers
    ├── mod.rs
    └── handler.rs      # Streaming response handling

Middleware Components

The HTTP layer includes several middleware components that provide cross-cutting concerns:

  • auth.rs: API key authentication for main endpoints (/v1/chat/completions, /v1/models, etc.)

    • Validates API keys from Authorization: Bearer <key> header
    • Supports multiple API keys configured in config.yaml
    • Returns 401 Unauthorized for invalid/missing keys
  • admin_auth.rs: Separate authentication for admin endpoints (/admin/*)

    • Uses dedicated admin API keys distinct from user API keys
    • Protects sensitive operations (config reload, circuit breaker control, health management)
    • Configurable via admin.api_keys in configuration
  • files_auth.rs: Authentication middleware for Files API (/v1/files/*)

    • Validates API keys specifically for file upload/download/deletion operations
    • Prevents unauthorized file access and manipulation
    • Integrates with file storage service for permission checks
  • admin_audit.rs: Audit logging middleware for admin operations

    • Records all admin API calls with timestamps and caller identification
    • Logs parameters and outcomes of sensitive operations
    • Provides audit trail for compliance and security monitoring
    • Configurable log levels and retention policies
  • ratelimitv2/: Enhanced rate limiting system (see Rate Limiting section)

    • Token bucket algorithm with per-client tracking
    • Separate limits for sustained rate and burst protection
    • Automatic cleanup of expired client entries
    • Detailed metrics for monitoring

2. Services Layer (src/services/)

Responsibility: Orchestrate business logic and coordinate between infrastructure components

Components

  • Backend Service (backend_service.rs): Manage backend pool, load balancing, health checks
  • Model Service (model_service.rs): Aggregate models from backends, handle caching, enrich with metadata
  • Proxy Service (proxy_service.rs): Route requests, handle retries, manage streaming
  • Health Service (health_service.rs): Monitor service health, track status
  • Service Registry (mod.rs): Manage service lifecycle and dependencies

Key Files

src/services/
├── mod.rs              # Service registry and management
├── backend_service.rs  # Backend management service
├── model_service.rs    # Model aggregation service
├── proxy_service.rs    # Request proxying and routing
├── health_service.rs   # Health monitoring service
├── deduplication.rs    # Request deduplication service
├── responses/          # Responses API support
│   ├── mod.rs
│   ├── converter.rs    # Response format conversion
│   ├── session.rs      # Session management
│   └── streaming.rs    # Streaming response handling
└── streaming/          # Streaming utilities
    ├── mod.rs
    ├── parser.rs       # Stream parsing logic
    └── transformer.rs  # Stream transformation (OpenAI/Anthropic)

3. Infrastructure Layer (src/infrastructure/)

Responsibility: Provide concrete implementations of external systems and technical capabilities

Components

  • Backends (backends/): Specific backend implementations (OpenAI, vLLM, Ollama)
  • Cache (cache/): Caching implementations (LRU, TTL-based)
  • Configuration (config/): Configuration loading, watching, validation
  • HTTP Client (http_client.rs): HTTP client management and optimization

Key Files

src/infrastructure/
├── mod.rs              # Infrastructure exports and utilities
├── backends/           # Backend implementations
│   ├── mod.rs
│   ├── anthropic/      # Native Anthropic Claude backend
│   │   ├── mod.rs      # Backend implementation & request transformation
│   │   └── stream.rs   # SSE stream transformer (Anthropic → OpenAI)
│   ├── gemini/         # Native Google Gemini backend
│   │   ├── mod.rs      # Backend implementation with TTFB optimization
│   │   └── stream.rs   # SSE stream transformer (Gemini → OpenAI)
│   ├── openai/         # OpenAI-compatible backend
│   │   ├── mod.rs
│   │   ├── backend.rs  # OpenAI backend implementation
│   │   └── models/     # OpenAI-specific model definitions
│   ├── factory/        # Backend factory pattern
│   │   ├── mod.rs
│   │   └── backend_factory.rs  # Creates backends from config
│   ├── pool/           # Backend pooling and management
│   │   ├── mod.rs
│   │   ├── backend_pool.rs     # Connection pool management
│   │   └── backend_manager.rs  # Backend lifecycle management
│   ├── generic/        # Generic backend implementations
│   │   └── mod.rs
│   └── vllm.rs         # vLLM backend implementation
├── common/             # Shared infrastructure utilities
│   ├── mod.rs
│   ├── executor.rs     # Request execution with retry/metrics
│   ├── headers.rs      # HTTP header utilities
│   ├── http_client.rs  # HTTP client factory with pooling
│   ├── statistics.rs   # Backend statistics collection
│   └── url_validator.rs # URL validation and security
├── cache/              # Caching implementations
│   ├── mod.rs
│   ├── lru_cache.rs    # LRU cache implementation
│   └── retry_cache.rs  # Retry-aware cache
├── config/             # Configuration management
│   ├── mod.rs
│   ├── loader.rs       # Configuration loading
│   ├── validator.rs    # Configuration validation
│   ├── timeout_validator.rs # Timeout configuration validation
│   ├── watcher.rs      # File watching for hot-reload
│   ├── migrator.rs     # Configuration migration orchestrator
│   ├── migration.rs    # Migration types and traits
│   ├── migrations.rs   # Specific migration implementations
│   ├── fixer.rs        # Auto-correction logic
│   ├── backup.rs       # Backup management
│   └── secrets.rs      # Secret/API key management
└── lock_optimization.rs # Lock and concurrency optimization

4. Core Layer (src/core/)

Responsibility: Define domain models, business rules, and fundamental abstractions

Components

  • Models (models/): Core domain entities (Backend, Model, Request, Response)
  • Traits (traits.rs): Core interfaces and contracts
  • Errors (errors.rs): Domain-specific error types and handling
  • Retry (retry/): Retry policies and strategies
  • Container (container.rs): Dependency injection container

Key Files

src/core/
├── mod.rs              # Core exports and utilities
├── models/             # Domain models
│   ├── mod.rs
│   ├── backend.rs      # Backend domain model
│   ├── model.rs        # LLM model representation
│   ├── request.rs      # Request models
│   └── responses.rs    # Response models (Responses API)
├── traits.rs           # Core traits and interfaces
├── errors.rs           # Error types and handling
├── container.rs        # Dependency injection container
├── async_utils.rs      # Async utility functions
├── duration_utils.rs   # Duration parsing utilities
├── streaming/          # Streaming models
│   ├── mod.rs
│   └── models.rs       # Streaming-specific models
├── retry/              # Retry mechanisms
│   ├── mod.rs
│   ├── policy.rs       # Retry policies
│   └── strategy.rs     # Retry strategies
├── circuit_breaker/    # Circuit breaker pattern
│   ├── mod.rs          # Module exports
│   ├── config.rs       # Configuration models
│   ├── state.rs        # State machine and breaker logic
│   ├── error.rs        # Circuit breaker errors
│   ├── metrics.rs      # Prometheus metrics
│   └── tests.rs        # Unit tests
├── files/              # File processing utilities
│   ├── mod.rs          # Module exports
│   ├── resolver.rs     # File reference resolution in chat requests
│   ├── transformer.rs  # Message transformation with file content
│   └── transformer_utils.rs # Transformation utility functions
└── config/             # Configuration models
    ├── mod.rs
    ├── models/            # Configuration data models (modular structure)
    │   ├── mod.rs         # Re-exports for backward compatibility
    │   ├── config.rs      # Main Config struct, ServerConfig, BackendConfig
    │   ├── backend_type.rs # BackendType enum definitions
    │   ├── model_metadata.rs # ModelMetadata, PricingInfo, CapabilityInfo
    │   ├── global_prompts.rs # GlobalPrompts configuration
    │   ├── samples.rs     # Sample generation configurations
    │   ├── validation.rs  # Configuration validation logic
    │   └── error.rs       # Configuration-specific errors
    ├── timeout_models.rs # Timeout configuration models
    ├── cached_timeout.rs # Cached timeout resolution
    ├── optimized_retry.rs # Optimized retry configuration
    ├── metrics.rs      # Metrics configuration
    └── rate_limit.rs   # Rate limit configuration

Core Components

Backend Pool

Location: src/backend.rs (legacy) → src/services/backend_service.rs

Purpose: Manages multiple LLM backends with intelligent load balancing

pub struct BackendPool {
    backends: Arc<RwLock<Vec<Backend>>>,
    load_balancer: LoadBalancingStrategy,
    health_checker: Option<Arc<HealthChecker>>,
}

impl BackendPool {
    // Round-robin load balancing with health awareness
    pub async fn select_backend(&self) -> Option<Backend> { /* ... */ }

    // Filter backends by model availability
    pub async fn backends_for_model(&self, model: &str) -> Vec<Backend> { /* ... */ }
}

Health Checker

Location: src/health.rssrc/services/health_service.rs

Purpose: Monitor backend health with configurable thresholds and automatic recovery

pub struct HealthChecker {
    backends: Arc<RwLock<Vec<Backend>>>,
    config: HealthConfig,
    status_map: Arc<RwLock<HashMap<String, HealthStatus>>>,
}

pub struct HealthConfig {
    pub interval: Duration,
    pub timeout: Duration,
    pub unhealthy_threshold: u32,  // Failures before marking unhealthy
    pub healthy_threshold: u32,    // Successes before marking healthy
}

Model Aggregation Service

Location: src/models/ (modular structure)

Purpose: Aggregate and cache model information from all backends, enrich with metadata

Module Structure (refactored from single models.rs file):

src/models/
├── mod.rs             # Re-exports for backward compatibility
├── types.rs           # Model, AggregatedModel, ModelList types
├── metrics.rs         # ModelMetrics tracking
├── cache.rs           # ModelCache implementation
├── config.rs          # ModelAggregationConfig
├── fetcher.rs         # Model fetching from backends
├── handlers.rs        # HTTP handlers for /v1/models endpoint
├── utils.rs           # Utility functions (normalize_model_id, etc.)
└── aggregation/       # Core aggregation logic
    ├── mod.rs         # ModelAggregationService implementation
    └── tests.rs       # Unit tests

pub struct ModelAggregationService {
    cache: Arc<RwLock<ModelCache>>,
    config: ModelAggregationConfig,
    backends: Arc<BackendPool>,
}

impl ModelAggregationService {
    // Aggregate models from all healthy backends
    pub async fn get_aggregated_models(&self) -> Result<ModelList, Error> { /* ... */ }

    // Enrich models with metadata from config
    pub fn merge_config_metadata(&self, models: &mut Vec<Model>) { /* ... */ }

    // Cache with TTL and deduplication
    pub async fn refresh_models(&self) -> Result<(), Error> { /* ... */ }
}

Proxy Module

Location: src/proxy/ (modular structure)

Purpose: Handle request proxying, backend selection, file resolution, and image generation/editing

Module Structure (refactored from single proxy.rs file):

src/proxy/
├── mod.rs             # Re-exports for backward compatibility
├── backend.rs         # Backend selection and routing logic
├── request.rs         # Request execution with retry logic
├── files.rs           # File reference resolution in requests
├── image_gen.rs       # Image generation handling (DALL-E, Gemini, GPT Image)
├── image_edit.rs      # Image editing support (/v1/images/edits)
├── image_utils.rs     # Image processing utilities (multipart, validation)
├── handlers.rs        # HTTP handlers for proxy endpoints
├── utils.rs           # Utility functions (error responses, etc.)
└── tests.rs           # Unit tests

Key Responsibilities

  • Backend Selection: Intelligent routing to available backends
  • File Resolution: Resolve file references in chat requests
  • Image Generation: Support for OpenAI (DALL-E, GPT Image) and Gemini (Nano Banana) image models
  • Image Editing: Image editing and variations endpoints
  • Request Retry: Automatic retry with exponential backoff
  • Error Handling: Standardized error responses in OpenAI format

Retry Handler

Location: src/services/deduplication.rs

Purpose: Implement exponential backoff with jitter and request deduplication

pub struct EnhancedRetryHandler {
    config: RetryConfig,
    dedup_cache: Arc<Mutex<HashMap<String, CachedResponse>>>,
    dedup_ttl: Duration,
}

pub struct RetryConfig {
    pub max_attempts: u32,
    pub base_delay: Duration,
    pub max_delay: Duration,
    pub exponential_backoff: bool,
    pub jitter: bool,
}

Circuit Breaker

Location: src/core/circuit_breaker/

Purpose: Prevent cascading failures by automatically stopping requests to failing backends

pub struct CircuitBreaker {
    states: Arc<DashMap<String, BackendCircuitState>>,
    config: CircuitBreakerConfig,
    metrics: Option<CircuitBreakerMetrics>,
}

pub struct CircuitBreakerConfig {
    pub enabled: bool,
    pub failure_threshold: u32,           // Failures before opening (default: 5)
    pub failure_rate_threshold: f64,      // Failure rate threshold (default: 0.5)
    pub minimum_requests: u32,            // Min requests before rate calculation
    pub timeout_seconds: u64,             // How long circuit stays open (default: 60s)
    pub half_open_max_requests: u32,      // Max requests in half-open state
    pub half_open_success_threshold: u32, // Successes needed to close
}

pub enum CircuitState {
    Closed,    // Normal operation - requests pass through
    Open,      // Failing fast - requests rejected immediately
    HalfOpen,  // Testing recovery - limited requests allowed
}

Key Features

  • Per-backend circuit breakers with independent state
  • Atomic operations for lock-free state checking in hot path
  • Automatic state transitions based on success/failure patterns
  • Sliding window for failure rate calculation
  • Prometheus metrics for observability
  • Admin endpoints for manual control

Container (Dependency Injection)

Location: src/core/container.rs

Purpose: Manage service lifecycles and dependencies

pub struct Container {
    services: Arc<RwLock<HashMap<TypeId, Box<dyn Any + Send + Sync>>>>,
    singletons: Arc<RwLock<HashMap<TypeId, Arc<dyn Any + Send + Sync>>>>,
}

impl Container {
    // Register singleton service
    pub async fn register_singleton<T>(&self, instance: Arc<T>) -> CoreResult<()> 
    where T: 'static + Send + Sync { /* ... */ }

    // Resolve service dependency
    pub async fn resolve<T>(&self) -> CoreResult<Arc<T>>
    where T: 'static + Send + Sync { /* ... */ }
}

Data Flow

Request Processing Flow

sequenceDiagram
    participant Client
    participant HTTPLayer as HTTP Layer
    participant ProxyService as Proxy Service
    participant BackendService as Backend Service
    participant ModelService as Model Service
    participant Backend as LLM Backend

    Client->>HTTPLayer: POST /v1/chat/completions
    HTTPLayer->>HTTPLayer: Apply Middleware (auth, logging, metrics)
    HTTPLayer->>ProxyService: Forward Request

    ProxyService->>ModelService: Get Model Info
    ModelService->>ModelService: Check Cache
    alt Cache Miss
        ModelService->>BackendService: Get Backends for Model
        BackendService->>Backend: Query Models
        Backend-->>BackendService: Model List
        BackendService-->>ModelService: Filtered Backends
        ModelService->>ModelService: Update Cache
    end
    ModelService-->>ProxyService: Model Available on Backends

    ProxyService->>BackendService: Select Healthy Backend
    BackendService->>BackendService: Apply Load Balancing
    BackendService-->>ProxyService: Selected Backend

    ProxyService->>Backend: Forward Request
    Backend-->>ProxyService: Response (streaming or non-streaming)

    ProxyService->>ProxyService: Apply Response Processing
    ProxyService-->>HTTPLayer: Processed Response
    HTTPLayer-->>Client: HTTP Response

Health Check Flow

sequenceDiagram
    participant HealthService as Health Service
    participant BackendPool as Backend Pool
    participant Backend as LLM Backend
    participant Cache as Health Cache

    loop Every Interval
        HealthService->>BackendPool: Get All Backends
        BackendPool-->>HealthService: Backend List

        par For Each Backend
            HealthService->>Backend: GET /v1/models (or /health)
            alt Success
                Backend-->>HealthService: 200 OK + Model List
                HealthService->>Cache: Update: consecutive_successes++
                HealthService->>HealthService: Mark Healthy if threshold met
            else Failure
                Backend-->>HealthService: Error/Timeout
                HealthService->>Cache: Update: consecutive_failures++
                HealthService->>HealthService: Mark Unhealthy if threshold met
            end
        end

        HealthService->>BackendPool: Update Backend Health Status
    end

Hot Reload Service

Location: src/infrastructure/config/hot_reload.rs, src/services/hot_reload_service.rs

Purpose: Provide runtime configuration updates without server restart

The hot reload system enables zero-downtime configuration changes through automatic file watching and intelligent component updates.

Key Architecture Components

  • ConfigManager: File system watching using notify crate, publishes updates via tokio::sync::watch channel
  • HotReloadService: Computes configuration differences, classifies changes (immediate/gradual/restart)
  • Component Updates: Interior mutability patterns (RwLock) for atomic updates to HealthChecker, CircuitBreaker, RateLimitStore

Change Classification

  • Immediate Update: logging.level, rate_limiting., circuit_breaker., retry., global_prompts.
  • Gradual Update: backends., health_checks., timeouts.*
  • Requires Restart: server.bind_address, server.workers

Admin API: /admin/config/hot-reload-status for inspecting hot reload capabilities

For detailed hot reload configuration, process flow, and usage examples, see configuration.md section on hot reload.

Configuration Migration System

Location: src/infrastructure/config/{migrator,migration,migrations,fixer,backup}.rs

Purpose: Automatically detect and fix configuration issues, migrate schemas, and ensure configuration validity

The configuration migration system provides a comprehensive solution for handling configuration evolution and maintenance. It automatically: - Detects and migrates outdated schema versions - Fixes common syntax errors in YAML/TOML files - Validates and corrects configuration values - Creates backups before making changes - Provides dry-run capability for previewing changes

Architecture Components

1. Migration Orchestrator (migrator.rs) - Main entry point for migration operations - Coordinates the entire migration workflow - Manages backup creation and restoration - Implements security validations (path traversal, file size limits)

2. Migration Framework (migration.rs) - Defines core types and traits for migrations - Migration trait for implementing version upgrades - ConfigIssue enum for categorizing problems - MigrationResult for tracking changes

3. Schema Migrations (migrations.rs) - Concrete migration implementations (e.g., V1ToV2Migration) - Transforms configuration structure between versions - Example: Converting backend_url to backends array

4. Auto-Correction Engine (fixer.rs) - Detects and fixes common configuration errors - Duration format correction (e.g., "10 seconds" → "10s") - URL validation and protocol addition - Field deprecation handling

5. Backup Manager (backup.rs) - Creates timestamped backups before modifications - Implements resource limits (10MB per file, 100MB total, max 50 backups) - Automatic cleanup of old backups - Preserves file permissions

Migration Workflow

graph TD
    A[Read Config File] --> B[Validate Path & Size]
    B --> C[Create Backup]
    C --> D[Parse Configuration]
    D --> E{Parse Success?}
    E -->|No| F[Fix Syntax Errors]
    F --> D
    E -->|Yes| G[Detect Schema Version]
    G --> H{Needs Migration?}
    H -->|Yes| I[Apply Migrations]
    H -->|No| J[Validate Values]
    I --> J
    J --> K{Issues Found?}
    K -->|Yes| L[Apply Auto-Fixes]
    K -->|No| M[Return Config]
    L --> N[Write Updated Config]
    N --> M

Security Features

  1. Path Traversal Protection: Validates paths to prevent directory traversal attacks
  2. File Size Limits: Maximum 10MB configuration files to prevent DoS
  3. Format Validation: Only processes .yaml, .yml, and .toml files
  4. System Directory Protection: Blocks access to sensitive system paths
  5. Test Mode Relaxation: Uses conditional compilation for test-friendly validation

Example Migration: v1.0 to v2.0

// V1ToV2Migration implementation
fn migrate(&self, config: &mut Value) -> Result<(), MigrationError> {
    // Convert single backend_url to backends array
    if let Some(backend_url) = config.get("backend_url") {
        let mut backends = Vec::new();
        let mut backend = Map::new();
        backend.insert("url".to_string(), backend_url.clone());

        // Move models to backend
        if let Some(model) = config.get("model") {
            backend.insert("models".to_string(), 
                Value::Sequence(vec![model.clone()]));
        }

        backends.push(Value::Mapping(backend));
        config["backends"] = Value::Sequence(backends);

        // Remove old fields
        config.remove("backend_url");
        config.remove("model");
    }
    Ok(())
}

Configuration Loading Flow

graph TD
    A[Application Start] --> B[Config Manager Init]
    B --> C{Config File Specified?}
    C -->|Yes| D[Load Specified File]
    C -->|No| E[Search Standard Locations]

    E --> F{Config File Found?}
    F -->|Yes| G[Load Config File]
    F -->|No| H[Use CLI Args + Env Vars + Defaults]

    D --> I[Parse YAML]
    G --> I
    H --> J[Create Config from Args]

    I --> K[Apply Environment Variable Overrides]
    J --> K

    K --> L[Apply CLI Argument Overrides]
    L --> M[Validate Configuration]
    M --> N{Valid?}
    N -->|Yes| O[Return Config]
    N -->|No| P[Exit with Error]

    O --> Q[Start File Watcher for Hot Reload]
    Q --> R[Application Running]

    Q --> S[Config File Changed]
    S --> T[Reload and Validate]
    T --> U{Valid?}
    U -->|Yes| V[Apply New Config]
    U -->|No| W[Log Error, Keep Old Config]

    V --> R
    W --> R

Dependency Injection

Service Registration

Services are registered in the container during application startup:

// In main.rs
async fn setup_services(config: Config) -> Result<ServiceRegistry, Error> {
    let container = Arc::new(Container::new());

    // Register infrastructure services
    container.register_singleton(Arc::new(
        HttpClient::new(&config.http_client)?
    )).await?;

    container.register_singleton(Arc::new(
        BackendManager::new(&config.backends)?
    )).await?;

    // Register core services
    container.register_singleton(Arc::new(
        BackendServiceImpl::new(container.clone())
    )).await?;

    container.register_singleton(Arc::new(
        ModelServiceImpl::new(container.clone())
    )).await?;

    // Create service registry
    let registry = ServiceRegistry::new(container);
    registry.initialize().await?;

    Ok(registry)
}

Service Dependencies

Services declare their dependencies through constructor injection:

pub struct ProxyServiceImpl {
    backend_service: Arc<dyn BackendService>,
    model_service: Arc<dyn ModelService>,
    retry_handler: Arc<dyn RetryHandler>,
    http_client: Arc<HttpClient>,
}

impl ProxyServiceImpl {
    pub fn new(container: Arc<Container>) -> CoreResult<Self> {
        Ok(Self {
            backend_service: container.resolve()?,
            model_service: container.resolve()?,
            retry_handler: container.resolve()?,
            http_client: container.resolve()?,
        })
    }
}

Benefits

  • Testability: Services can be mocked for unit testing
  • Flexibility: Implementations can be swapped without code changes
  • Lifecycle Management: Container manages service initialization and cleanup
  • Circular Dependency Detection: Container prevents circular dependencies

Error Handling Strategy

The router implements a comprehensive error handling strategy with typed errors, intelligent recovery, and user-friendly responses.

Error Type Hierarchy

  • CoreError: Domain-level errors (validation, service failures, timeouts, configuration)
  • RouterError: Application-level errors combining Core, HTTP, Backend, and Model errors
  • HttpError: HTTP-specific errors (400 BadRequest, 401 Unauthorized, 404 NotFound, 500 InternalServerError, etc.)

Error Handling Principles

  1. Fail Fast: Validate inputs early with clear error messages
  2. Error Context: Include relevant context (field names, operation details)
  3. Retryable Classification: Distinguish between retryable (timeout, 503) and non-retryable (400, 401) errors
  4. User-Friendly Responses: Convert internal errors to OpenAI-compatible error format
  5. Structured Logging: Log errors with appropriate severity and context

Error Recovery Mechanisms

  • Circuit Breaker: Prevent cascading failures (see Circuit Breaker)
  • Retry with Exponential Backoff: Automatically retry transient failures
  • Model Fallback: Route to alternative models when primary unavailable (see Model Fallback System)
  • Graceful Degradation: Continue with reduced functionality when components fail

For detailed error handling, recovery strategies, monitoring, and troubleshooting, see error-handling.md.

Extension Points

Backend Type Architecture

The router supports multiple backend types with different API formats. Each backend type handles request/response transformation automatically.

Supported Backend Types

Backend Type API Format Authentication Use Case
openai OpenAI Chat Completions Authorization: Bearer OpenAI, Azure OpenAI, vLLM, LocalAI
anthropic Anthropic Messages API x-api-key header Claude models via native API
gemini OpenAI-compatible Authorization: Bearer Google Gemini via OpenAI compatibility layer

Anthropic Backend Architecture

The Anthropic backend provides native support for Claude models with automatic format translation:

┌─────────────────────────────────────────────────────────────────┐
│                   OpenAI Format Request                          │
│  POST /v1/chat/completions                                       │
│  { "model": "claude-haiku-4-5", "messages": [...] }             │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│              Request Transformation Layer                        │
│  transform_openai_to_anthropic_request()                        │
│  • Extract system messages → separate `system` parameter        │
│  • Transform image_url → Anthropic image format                 │
│  • Map max_tokens / max_completion_tokens                       │
│  • Convert reasoning_effort → thinking parameter                │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│                  Anthropic Messages API                          │
│  POST https://api.anthropic.com/v1/messages                     │
│  Headers: x-api-key, anthropic-version: 2023-06-01              │
│  { "model": "...", "system": "...", "messages": [...] }         │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│              AnthropicStreamTransformer                          │
│  SSE Event Transformation (Anthropic → OpenAI format)           │
│  • message_start → initial chunk with role                      │
│  • content_block_delta → content chunks                         │
│  • thinking_delta → reasoning_content (extended thinking)       │
│  • message_delta → finish_reason mapping                        │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│                   OpenAI Format Response                         │
│  data: {"choices":[{"delta":{"content":"..."}}]}                │
└─────────────────────────────────────────────────────────────────┘

Key Transformations

Request Format Differences:

Aspect OpenAI Format Anthropic Format
System prompt messages[0].role="system" Separate system parameter
Auth header Authorization: Bearer x-api-key
Max tokens Optional Required (max_tokens)
Images image_url.url source.type + source.data

Extended Thinking Support:

// OpenAI reasoning_effort → Anthropic thinking
{
  "reasoning_effort": "high"  // OpenAI format
}
// Transforms to:
{
  "thinking": {
    "type": "enabled",
    "budget_tokens": 32768   // Mapped from effort level
  }
}

Adding New Backend Types

  1. Implement Backend Trait:

    // In src/infrastructure/backends/custom_backend.rs
    pub struct CustomBackend {
        client: Arc<HttpClient>,
        config: CustomBackendConfig,
    }
    
    #[async_trait]
    impl BackendTrait for CustomBackend {
        async fn health_check(&self) -> CoreResult<()> { /* ... */ }
        async fn list_models(&self) -> CoreResult<Vec<Model>> { /* ... */ }
        async fn chat_completion(&self, request: ChatRequest) -> CoreResult<Response> { /* ... */ }
    }
    

  2. Register in Backend Factory:

    // In src/infrastructure/backends/mod.rs
    pub fn create_backend(backend_type: &str, config: &BackendConfig) -> CoreResult<Box<dyn BackendTrait>> {
        match backend_type {
            "openai" => Ok(Box::new(OpenAIBackend::new(config)?)),
            "vllm" => Ok(Box::new(VLLMBackend::new(config)?)),
            "custom" => Ok(Box::new(CustomBackend::new(config)?)), // New backend
            _ => Err(CoreError::ValidationFailed {
                message: format!("Unknown backend type: {}", backend_type),
                field: Some("backend_type".to_string()),
            }),
        }
    }
    

Adding New Middleware

  1. Implement Middleware Trait:

    // In src/http/middleware/custom_middleware.rs
    pub struct CustomMiddleware {
        config: CustomConfig,
    }
    
    impl<S> tower::Layer<S> for CustomMiddleware {
        type Service = CustomMiddlewareService<S>;
    
        fn layer(&self, inner: S) -> Self::Service {
            CustomMiddlewareService { inner, config: self.config.clone() }
        }
    }
    

  2. Register in HTTP Router:

    // In src/main.rs
    let app = Router::new()
        .route("/v1/models", get(list_models))
        .layer(CustomMiddleware::new(config.custom))
        .layer(LoggingMiddleware::new())
        .with_state(state);
    

Adding New Cache Types

  1. Implement Cache Trait:

    // In src/infrastructure/cache/redis_cache.rs
    pub struct RedisCache {
        client: redis::Client,
        ttl: Duration,
    }
    
    #[async_trait]
    impl CacheTrait for RedisCache {
        async fn get<T>(&self, key: &str) -> CoreResult<Option<T>>
        where T: DeserializeOwned { /* ... */ }
    
        async fn set<T>(&self, key: &str, value: &T, ttl: Option<Duration>) -> CoreResult<()>
        where T: Serialize { /* ... */ }
    }
    

  2. Use in Service:

    // Services can use any cache implementation
    pub struct ModelServiceImpl<C: CacheTrait> {
        cache: Arc<C>,
        // ... other dependencies
    }
    

Adding New Load Balancing Strategies

// In src/services/load_balancer.rs
pub enum LoadBalancingStrategy {
    RoundRobin,
    WeightedRoundRobin,
    LeastConnections,  // New strategy
    Random,
}

impl LoadBalancingStrategy {
    pub fn select_backend(&self, backends: &[Backend]) -> Option<&Backend> {
        match self {
            Self::RoundRobin => /* ... */,
            Self::WeightedRoundRobin => /* ... */,
            Self::LeastConnections => self.select_least_connections(backends),
            Self::Random => /* ... */,
        }
    }
}

Design Decisions

Why 4-Layer Architecture?

Decision: Use a 4-layer architecture (HTTP → Services → Infrastructure → Core)

Rationale

  • Clear Separation: Each layer has distinct responsibilities
  • Testability: Layers can be tested independently
  • Maintainability: Changes in one layer don't affect others
  • Flexibility: Easy to swap implementations (e.g., different cache backends)

Trade-offs

  • Pros: Clean, maintainable, testable, extensible
  • Cons: More complexity, slight performance overhead
  • Verdict: Benefits outweigh costs for a production system

Why Dependency Injection?

Decision: Use a custom DI container instead of compile-time injection

Rationale

  • Runtime Flexibility: Can swap implementations based on configuration
  • Service Lifecycle: Centralized management of service initialization/cleanup
  • Testing: Easy to inject mocks and test doubles

Alternatives Considered

  • Manual dependency passing: Too verbose and error-prone
  • Compile-time DI (generics): Less flexible, harder to configure

Why Arc> for Shared State?

Decision: Use Arc<RwLock<T>> for shared mutable state

Rationale

  • Reader-Writer Semantics: Multiple readers, exclusive writers
  • Performance: Better than Arc<Mutex<T>> for read-heavy workloads
  • Safety: Prevents data races at compile time

Alternatives Considered

  • Arc<Mutex<T>>: Simpler but worse performance for reads
  • Channels: Too complex for simple shared state
  • Atomic types: Not suitable for complex data structures

Why async/await Throughout?

Decision: Use async/await for all I/O operations

Rationale

  • Performance: Non-blocking I/O allows high concurrency
  • Resource Efficiency: Lower memory usage than thread-per-request
  • Ecosystem: Rust async ecosystem (Tokio, reqwest, axum) is mature

Trade-offs

  • Pros: High performance, low resource usage, good ecosystem
  • Cons: Complexity, learning curve, debugging challenges
  • Verdict: Essential for high-performance network services

Why Configuration Hot-Reload?

Decision: Support configuration hot-reload using file watching

Rationale

  • Zero Downtime: Update configuration without restarting
  • Operations Friendly: Easy to adjust settings in production
  • Development: Faster iteration during development

Implementation

  • File system watcher detects changes
  • Validate new configuration before applying
  • Atomic updates to avoid inconsistent state
  • Fallback to previous config on validation errors

Performance Considerations

Memory Management

  1. Connection Pooling: Reuse HTTP connections to reduce allocation overhead
  2. Smart Caching: LRU eviction prevents unbounded memory growth
  3. Arc Cloning: Cheap reference counting instead of deep cloning
  4. Streaming: Process responses in chunks to avoid loading large responses into memory

Concurrency

  1. RwLock for Read-Heavy Workloads: Multiple concurrent readers for backend pool and model cache
  2. Lock-Free Where Possible: Use atomics for counters and simple state
  3. Async Task Spawning: Background tasks for health checks and cache updates
  4. Bounded Channels: Prevent unbounded queuing of tasks

I/O Optimization

  1. Connection Keep-Alive: TCP connections stay open for reuse
  2. Streaming Responses: Forward SSE chunks without buffering
  3. Timeouts: Prevent hanging on slow backends
  4. Retry with Backoff: Avoid overwhelming failing backends

Memory Layout

// Optimized data structures for cache efficiency
pub struct Backend {
    pub name: String,          // Inline string for small names
    pub url: Arc<str>,         // Shared string for URL
    pub weight: u32,           // Compact integer
    pub is_healthy: AtomicBool, // Lock-free health status
}

// Cache-friendly model storage
pub struct ModelCache {
    models: HashMap<String, Arc<ModelInfo>>, // Shared model info
    last_updated: AtomicU64,                 // Lock-free timestamp
    ttl: Duration,
}

Benchmarking Results

Based on our benchmarks (see benches/performance_benchmarks.rs):

  • Request Latency: < 5ms overhead for routing decisions
  • Memory Usage: ~50MB base memory, scales linearly with backends
  • Throughput: 1000+ requests/second on modest hardware
  • Connection Efficiency: 100+ concurrent connections per backend with minimal memory overhead

Rate Limiting

The router implements sophisticated rate limiting to protect against abuse and ensure fair resource allocation across clients.

Key Features: - Dual-window approach: sustained limit (100 req/min) + burst protection (20 req/5s) - Client identification by API key (preferred) or IP address (fallback) - Per-client isolation with automatic cache cleanup - DoS prevention with short TTL for empty responses

Rate Limit V2 Architecture

The enhanced rate limiting system (rate_limit_v2/) provides a modular, high-performance implementation:

Module Structure

src/http/middleware/rate_limit_v2/
├── mod.rs              # Public API and module exports
├── middleware.rs       # Axum middleware integration
├── store.rs            # Rate limit storage and client tracking
└── token_bucket.rs     # Token bucket algorithm implementation

Components

  1. Token Bucket Algorithm (token_bucket.rs)
  2. Configurable bucket capacity and refill rate
  3. Atomic operations for lock-free token consumption
  4. Automatic token replenishment based on elapsed time
  5. Separate buckets for sustained and burst limits

  6. Rate Limit Store (store.rs)

  7. Per-client state tracking with DashMap for concurrent access
  8. Automatic cleanup of expired client entries
  9. Configurable TTL for inactive clients (default: 1 hour)
  10. Memory-efficient with bounded storage

  11. Middleware Integration (middleware.rs)

  12. Extracts client identifier (API key → IP address fallback)
  13. Checks both sustained and burst limits before processing
  14. Returns HTTP 429 (Too Many Requests) with Retry-After header
  15. Prometheus metrics for monitoring rate limit hits

Configuration Example

rate_limiting:
  enabled: true
  sustained:
    max_requests: 100
    window_seconds: 60
  burst:
    max_requests: 20
    window_seconds: 5
  cleanup_interval_seconds: 300

Decision Flow

Request arrives
Extract client ID (API key or IP)
Check sustained limit (100 req/min)
    ↓ OK
Check burst limit (20 req/5s)
    ↓ OK
Process request

For detailed configuration information, see configuration.md section on rate limiting.

Model Fallback System

The router implements a configurable model fallback system that automatically routes requests to alternative models when the primary model is unavailable.

Key Features: - Automatic fallback chain execution (e.g., gpt-4o → gpt-4-turbo → gpt-3.5-turbo) - Cross-provider fallback support with parameter translation - Integration with circuit breaker for intelligent triggering - Prometheus metrics for monitoring fallback usage

For detailed configuration and implementation, see error-handling.md section on model fallback.

Circuit Breaker

The router implements the circuit breaker pattern to prevent cascading failures and provide automatic failover when backends become unhealthy.

Three-State Machine:

State Behavior
Closed Normal operation. Failures are counted.
Open Fast-fail mode. Requests rejected immediately.
HalfOpen Recovery testing. Limited requests allowed.

Key Features: - Per-backend isolation with independent state - Lock-free atomic operations for minimal hot-path overhead - Admin endpoints for manual control (/admin/circuit/*) - Prometheus metrics for observability

For detailed configuration and implementation, see error-handling.md section on circuit breaker.

File Storage

The router provides OpenAI Files API compatible file storage with persistent metadata.

Key Features: - Persistent metadata storage with sidecar JSON files - Automatic recovery on server restart - Orphan file detection and cleanup - Pluggable backends (memory/persistent)

For detailed architecture and implementation, see File Storage Guide.

Image Generation Architecture

The router provides a unified interface for image generation across multiple backends (OpenAI GPT Image, DALL-E, and Google Gemini/Nano Banana) with automatic parameter translation.

Multi-Backend Image Generation

┌─────────────────────────────────────────────────────────────────┐
│                   OpenAI-Compatible Request                       │
│  POST /v1/images/generations                                     │
│  { "model": "...", "prompt": "...", "size": "1536x1024" }       │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│                    Model Router (image_gen.rs)                    │
│  • Detects model type (GPT Image, DALL-E, Nano Banana)           │
│  • Routes to appropriate handler                                  │
│  • Handles streaming vs non-streaming                             │
└─────────────────────────────────────────────────────────────────┘
                   │                              │
        ┌──────────┘                              └──────────┐
        ▼                                                     ▼
┌───────────────────────────┐           ┌───────────────────────────┐
│    OpenAI Backend         │           │    Gemini Backend          │
│    (GPT Image, DALL-E)    │           │    (Nano Banana)           │
│                           │           │                            │
│  • Pass-through request   │           │  • Convert to Gemini API   │
│  • SSE streaming support  │           │  • Map size → aspectRatio  │
│  • output_format support  │           │  • imageConfig generation  │
└───────────────────────────┘           └───────────────────────────┘

OpenAI → Gemini Parameter Conversion

When using Nano Banana (Gemini) models, OpenAI-style parameters are automatically converted to Gemini's native format:

Size to Aspect Ratio Mapping

OpenAI size Gemini aspectRatio Gemini imageSize Notes
256x256 1:1 1K Minimum Gemini size
512x512 1:1 1K Minimum Gemini size
1024x1024 1:1 1K Default
1536x1024 3:2 1K Landscape (new)
1024x1536 2:3 1K Portrait (new)
1792x1024 16:9 1K Wide landscape
1024x1792 9:16 1K Tall portrait
2048x2048 1:1 2K Pro models only
4096x4096 1:1 4K Pro models only
auto 1:1 1K Default fallback

Request Transformation

OpenAI Format (Input):

{
  "model": "nano-banana",
  "prompt": "A serene Japanese garden",
  "size": "1536x1024",
  "n": 1
}

Gemini Format (Converted):

{
  "contents": [
    {
      "parts": [{"text": "A serene Japanese garden"}]
    }
  ],
  "generationConfig": {
    "imageConfig": {
      "aspectRatio": "3:2",
      "imageSize": "1K"
    }
  }
}

Conversion Implementation

The conversion is handled by src/infrastructure/backends/gemini/image_generation.rs:

pub fn convert_openai_to_gemini(request: &OpenAIImageRequest)
    -> CoreResult<(String, GeminiImageRequest)>
{
    // 1. Map model name
    let gemini_model = map_model_to_gemini(&request.model);

    // 2. Parse size to aspect ratio and size category
    let parsed_size = parse_openai_size(&request.size, &request.model)?;

    // 3. Build Gemini request with imageConfig
    let gemini_request = GeminiImageRequest {
        contents: vec![GeminiContent { parts: vec![...] }],
        generation_config: Some(GeminiGenerationConfig {
            image_config: Some(GeminiImageConfig {
                aspect_ratio: Some(parsed_size.aspect_ratio.to_gemini_string()),
                image_size: Some(parsed_size.size_category.to_gemini_image_size()),
            }),
        }),
    };

    Ok((gemini_model, gemini_request))
}

Streaming Image Generation (SSE)

For GPT Image models, the router supports true SSE passthrough for streaming image generation:

┌─────────────┐                ┌─────────────┐                ┌─────────────┐
│   Client    │────stream:true─▶│   Router    │────stream:true─▶│   OpenAI    │
│             │                │             │                │             │
│             │◀───SSE events──│  Passthrough│◀───SSE events──│             │
└─────────────┘                └─────────────┘                └─────────────┘

SSE Event Types:

Event Description
image_generation.partial_image Intermediate preview during generation
image_generation.complete Final image data
image_generation.usage Token usage for billing
done Stream completion

Implementation (src/proxy/image_gen.rs):

async fn handle_streaming_image_generation(...) -> Result<Response, StatusCode> {
    // 1. Keep stream: true in backend request
    // 2. Make streaming request via bytes_stream()
    // 3. Forward SSE events through tokio channel

    let (tx, rx) = tokio::sync::mpsc::unbounded_channel();

    tokio::spawn(async move {
        let mut stream = backend_response.bytes_stream();
        while let Some(chunk) = stream.next().await {
            // Parse SSE format (event:/data: lines)
            // Forward events to client
            for line in chunk_str.lines() {
                if let Some(event_type) = line.strip_prefix("event:") { ... }
                if let Some(data) = line.strip_prefix("data:") {
                    let event = Event::default().event(event_type).data(data);
                    tx.send(Ok(event));
                }
            }
        }
    });

    Ok(Sse::new(UnboundedReceiverStream::new(rx)).into_response())
}

GPT Image Model Features

The router supports enhanced parameters for GPT Image models (gpt-image-1, gpt-image-1.5, gpt-image-1-mini):

Parameter Description Values
output_format Image file format png, jpeg, webp
output_compression Compression level 0-100 (jpeg/webp only)
background Transparency control transparent, opaque, auto
quality Generation quality low, medium, high, auto
stream Enable SSE streaming true, false
partial_images Preview count 0-3

Model Support Matrix

Feature GPT Image 1.5 GPT Image 1 GPT Image 1 Mini DALL-E 3 DALL-E 2 Nano Banana Nano Banana Pro
Streaming
output_format
background
Custom quality standard/hd
Image Edit
Image Variations
Max Resolution 1536px 1536px 1536px 1792px 1024px 1024px 4096px

Image Edit and Variations

The router provides OpenAI-compatible image editing and variations endpoints through /v1/images/edits and /v1/images/variations.

Image Editing (/v1/images/edits)

Endpoint: POST /v1/images/edits

Allows editing an existing image with a text prompt and optional mask. Supported by GPT Image models and DALL-E 2.

Request Format (multipart/form-data):

image: <file>           # Original image (PNG, required)
prompt: <string>        # Edit instructions (required)
mask: <file>            # Optional mask image (PNG)
model: <string>         # Model name (e.g., "gpt-image-1", "dall-e-2")
n: <integer>            # Number of images (default: 1)
size: <string>          # Output size (e.g., "1024x1024")
response_format: <string> # "url" or "b64_json"

Implementation (src/proxy/image_edit.rs): - Multipart form parsing for image and mask files - Image validation (format, size, aspect ratio) - Model-specific parameter transformation - Proper error handling for invalid inputs

Supported Features

  • Transparent PNG mask support for targeted editing
  • Multiple image generation (n parameter)
  • Flexible output sizes
  • Both URL and base64 response formats

Image Variations (/v1/images/variations)

Endpoint: POST /v1/images/variations

Creates variations of a given image. Supported by DALL-E 2 only.

Request Format (multipart/form-data):

image: <file>           # Source image (PNG, required)
model: <string>         # Model name (default: "dall-e-2")
n: <integer>            # Number of variations (default: 1, max: 10)
size: <string>          # Output size ("256x256", "512x512", "1024x1024")
response_format: <string> # "url" or "b64_json"

Implementation (src/proxy/image_edit.rs): - Image file validation and preprocessing - DALL-E 2-specific routing - Error handling for unsupported models - Consistent response formatting

Key Features

  • Generate multiple variations in a single request
  • Automatic image format validation
  • Standard OpenAI response format compatibility

Image Utilities Module

The image_utils.rs module provides shared utilities for image processing:

Functions

  • validate_image_format(): Validates PNG/JPEG format and dimensions
  • parse_multipart_image_request(): Extracts images from multipart forms
  • check_image_dimensions(): Validates size constraints
  • format_image_error_response(): Standardized error responses

Validation Rules

  • Maximum file size: 4MB (configurable)
  • Supported formats: PNG (required for edits/variations), JPEG (generation only)
  • Aspect ratio constraints per model
  • Transparent PNG requirement for masks

This architecture provides a solid foundation for building a production-ready LLM router that can scale to handle thousands of requests while remaining maintainable and extensible. The clean separation of concerns makes it easy to add new features, swap implementations, and thoroughly test each component.