Architecture Guide¶

This document provides a comprehensive overview of Continuum Router's architecture, design decisions, and extension points.

Table of Contents¶

Overview
4-Layer Architecture
Core Components
Data Flow
Dependency Injection
Error Handling Strategy
Extension Points
Design Decisions
Performance Considerations
Rate Limiting → Configuration
Model Fallback System → Error Handling
Circuit Breaker → Error Handling
File Storage → Architecture Details

Overview¶

Continuum Router is designed as a high-performance, production-ready LLM API router using a clean 4-layer architecture that provides clear separation of concerns, testability, and maintainability. The architecture follows Domain-Driven Design principles and dependency inversion to create a robust, extensible system.

Architecture Goals¶

Separation of Concerns: Each layer has a single, well-defined responsibility
Dependency Inversion: Higher layers depend on abstractions, not concrete implementations
Testability: Each component can be unit tested in isolation
Extensibility: New features can be added without modifying existing code
Performance: Minimal overhead while maintaining clean architecture
Reliability: Fail-fast design with comprehensive error handling

4-Layer Architecture¶

┌─────────────────────────────────────────────────────────────────┐
│                        HTTP Layer                               │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐   │
│  │     Routes      │ │   Middleware    │ │   Handlers      │   │
│  │                 │ │                 │ │                 │   │
│  │ • /v1/models    │ │ • Logging       │ │ • Streaming     │   │
│  │ • /v1/chat/*    │ │ • Metrics       │ │ • Responses API │   │
│  │ • /v1/responses │ │ • Rate Limit    │ │ • DTOs          │   │
│  │ • /admin/*      │ │ • Auth          │ │ • Error         │   │
│  └─────────────────┘ └─────────────────┘ └─────────────────┘   │
└─────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────┐
│                     Services Layer                              │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐   │
│  │ Backend Service │ │  Model Service  │ │ Proxy Service   │   │
│  │                 │ │                 │ │                 │   │
│  │ • Pool Mgmt     │ │ • Aggregation   │ │ • Routing       │   │
│  │ • Load Balance  │ │ • Caching       │ │ • Streaming     │   │
│  │ • Health Check  │ │ • Discovery     │ │ • Retry Logic   │   │
│  │                 │ │ • Metadata      │ │                 │   │
│  └─────────────────┘ └─────────────────┘ └─────────────────┘   │
│                                                                 │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐   │
│  │ Health Service  │ │ Service Registry│ │ Deduplication   │   │
│  │                 │ │                 │ │                 │   │
│  │ • Monitoring    │ │ • Lifecycle     │ │ • Cache         │   │
│  │ • Status Track  │ │ • Dependencies  │ │ • Request Hash  │   │
│  │ • Recovery      │ │ • Container     │ │ • TTL Mgmt      │   │
│  └─────────────────┘ └─────────────────┘ └─────────────────┘   │
│                                                                 │
│  ┌─────────────────┐                                            │
│  │  File Service   │  See: architecture/file-storage.md         │
│  │                 │                                            │
│  │ • Upload/Delete │                                            │
│  │ • Metadata Mgmt │                                            │
│  │ • Persistence   │                                            │
│  └─────────────────┘                                            │
└─────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────┐
│                 Infrastructure Layer                            │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐   │
│  │    Backends     │ │      Cache      │ │     Common      │   │
│  │                 │ │                 │ │                 │   │
│  │ • OpenAI        │ │ • LRU Cache     │ │ • HTTP Client   │   │
│  │ • Anthropic     │ │ • TTL Cache     │ │ • Executor      │   │
│  │ • Gemini        │ │ • Retry Cache   │ │ • Statistics    │   │
│  │ • vLLM          │ │                 │ │ • URL Validator │   │
│  └─────────────────┘ └─────────────────┘ └─────────────────┘   │
│                                                                 │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐   │
│  │  Configuration  │ │  Backend Pool   │ │  Backend Factory  │ │
│  │                 │ │                 │ │                   │ │
│  │ • File Watcher  │ │ • Pool Mgmt     │ │ • Create Backend  │ │
│  │ • Env Override  │ │ • Connection    │ │ • Type Detection  │ │
│  │ • Validation    │ │ • Pre-warming   │ │ • Config Parsing  │ │
│  └─────────────────┘ └─────────────────┘ └─────────────────┘   │
└─────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────┐
│                       Core Layer                                │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐   │
│  │     Models      │ │     Traits      │ │     Errors      │   │
│  │                 │ │                 │ │                 │   │
│  │ • Backend       │ │ • BackendTrait  │ │ • CoreError     │   │
│  │ • Model         │ │ • ServiceTrait  │ │ • RouterError   │   │
│  │ • Request       │ │ • CacheTrait    │ │ • ErrorSeverity │   │
│  │ • Response      │ │ • HealthTrait   │ │ • ErrorDetail   │   │
│  └─────────────────┘ └─────────────────┘ └─────────────────┘   │
│                                                                 │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐   │
│  │  Retry Logic    │ │  Configuration  │ │   Container     │   │
│  │                 │ │                 │ │                 │   │
│  │ • Policies      │ │ • Models        │ │ • DI Container  │   │
│  │ • Strategies    │ │ • Validation    │ │ • Service Mgmt  │   │
│  │ • Backoff       │ │ • Defaults      │ │ • Lifecycle     │   │
│  └─────────────────┘ └─────────────────┘ └─────────────────┘   │
│                                                                 │
│  ┌─────────────────┐                                           │
│  │ Circuit Breaker │                                           │
│  │                 │                                           │
│  │ • State Machine │                                           │
│  │ • Failure Track │                                           │
│  │ • Auto Recovery │                                           │
│  └─────────────────┘                                           │
└─────────────────────────────────────────────────────────────────┘

Layer Descriptions¶

1. HTTP Layer (`src/http/`)¶

Responsibility: Handle HTTP requests, responses, and web-specific concerns

Components¶

Routes (routes.rs): Define HTTP endpoints and route handling
Middleware (middleware/): Cross-cutting concerns (auth, logging, metrics, rate limiting)
DTOs (dto/): Data Transfer Objects for HTTP serialization/deserialization
Streaming (streaming/): Server-Sent Events (SSE) handling

Key Files¶

src/http/
├── mod.rs              # HTTP layer exports
├── routes.rs           # Route definitions and handlers
├── dto.rs              # Request/Response DTOs
├── handlers/           # Request handlers
│   ├── mod.rs
│   └── responses.rs    # Responses API handlers
├── middleware/         # HTTP middleware components
│   ├── mod.rs
│   ├── auth.rs         # API key authentication middleware
│   ├── admin_auth.rs   # Admin API authentication middleware
│   ├── files_auth.rs   # Files API authentication middleware
│   ├── admin_audit.rs  # Admin operations audit logging
│   ├── logging.rs      # Request/response logging
│   ├── metrics.rs      # Metrics collection
│   ├── metrics_auth.rs # Metrics endpoint authentication
│   ├── model_extractor.rs # Model extraction from requests
│   ├── prometheus.rs   # Prometheus metrics integration
│   ├── rate_limit.rs   # Rate limiting middleware (legacy)
│   └── rate_limit_v2/  # Enhanced rate limiting (modular)
│       ├── mod.rs      # Module exports
│       ├── middleware.rs # Rate limiting middleware
│       ├── store.rs    # Rate limit storage and tracking
│       └── token_bucket.rs # Token bucket algorithm
└── streaming/          # SSE streaming handlers
    ├── mod.rs
    └── handler.rs      # Streaming response handling

Middleware Components¶

The HTTP layer includes several middleware components that provide cross-cutting concerns:

auth.rs: API key authentication for main endpoints (/v1/chat/completions, /v1/models, etc.)
- Validates API keys from Authorization: Bearer <key> header
- Supports multiple API keys configured in config.yaml
- Returns 401 Unauthorized for invalid/missing keys
admin_auth.rs: Separate authentication for admin endpoints (/admin/*)
- Uses dedicated admin API keys distinct from user API keys
- Protects sensitive operations (config reload, circuit breaker control, health management)
- Configurable via admin.api_keys in configuration
files_auth.rs: Authentication middleware for Files API (/v1/files/*)
- Validates API keys specifically for file upload/download/deletion operations
- Prevents unauthorized file access and manipulation
- Integrates with file storage service for permission checks
admin_audit.rs: Audit logging middleware for admin operations
- Records all admin API calls with timestamps and caller identification
- Logs parameters and outcomes of sensitive operations
- Provides audit trail for compliance and security monitoring
- Configurable log levels and retention policies
ratelimitv2/: Enhanced rate limiting system (see Rate Limiting section)
- Token bucket algorithm with per-client tracking
- Separate limits for sustained rate and burst protection
- Automatic cleanup of expired client entries
- Detailed metrics for monitoring

2. Services Layer (`src/services/`)¶

Responsibility: Orchestrate business logic and coordinate between infrastructure components

Components¶

Backend Service (backend_service.rs): Manage backend pool, load balancing, health checks
Model Service (model_service.rs): Aggregate models from backends, handle caching, enrich with metadata
Proxy Service (proxy_service.rs): Route requests, handle retries, manage streaming
Health Service (health_service.rs): Monitor service health, track status
Service Registry (mod.rs): Manage service lifecycle and dependencies

Key Files¶

src/services/
├── mod.rs              # Service registry and management
├── backend_service.rs  # Backend management service
├── model_service.rs    # Model aggregation service
├── proxy_service.rs    # Request proxying and routing
├── health_service.rs   # Health monitoring service
├── deduplication.rs    # Request deduplication service
├── responses/          # Responses API support
│   ├── mod.rs
│   ├── converter.rs    # Response format conversion
│   ├── session.rs      # Session management
│   └── streaming.rs    # Streaming response handling
└── streaming/          # Streaming utilities
    ├── mod.rs
    ├── parser.rs       # Stream parsing logic
    └── transformer.rs  # Stream transformation (OpenAI/Anthropic)

3. Infrastructure Layer (`src/infrastructure/`)¶

Responsibility: Provide concrete implementations of external systems and technical capabilities

Components¶

Backends (backends/): Specific backend implementations (OpenAI, vLLM, Ollama)
Cache (cache/): Caching implementations (LRU, TTL-based)
Configuration (config/): Configuration loading, watching, validation
HTTP Client (http_client.rs): HTTP client management and optimization

Key Files¶

src/infrastructure/
├── mod.rs              # Infrastructure exports and utilities
├── backends/           # Backend implementations
│   ├── mod.rs
│   ├── anthropic/      # Native Anthropic Claude backend
│   │   ├── mod.rs      # Backend implementation & request transformation
│   │   └── stream.rs   # SSE stream transformer (Anthropic → OpenAI)
│   ├── gemini/         # Native Google Gemini backend
│   │   ├── mod.rs      # Backend implementation with TTFB optimization
│   │   └── stream.rs   # SSE stream transformer (Gemini → OpenAI)
│   ├── openai/         # OpenAI-compatible backend
│   │   ├── mod.rs
│   │   ├── backend.rs  # OpenAI backend implementation
│   │   └── models/     # OpenAI-specific model definitions
│   ├── factory/        # Backend factory pattern
│   │   ├── mod.rs
│   │   └── backend_factory.rs  # Creates backends from config
│   ├── pool/           # Backend pooling and management
│   │   ├── mod.rs
│   │   ├── backend_pool.rs     # Connection pool management
│   │   └── backend_manager.rs  # Backend lifecycle management
│   ├── generic/        # Generic backend implementations
│   │   └── mod.rs
│   └── vllm.rs         # vLLM backend implementation
├── common/             # Shared infrastructure utilities
│   ├── mod.rs
│   ├── executor.rs     # Request execution with retry/metrics
│   ├── headers.rs      # HTTP header utilities
│   ├── http_client.rs  # HTTP client factory with pooling
│   ├── statistics.rs   # Backend statistics collection
│   └── url_validator.rs # URL validation and security
├── cache/              # Caching implementations
│   ├── mod.rs
│   ├── lru_cache.rs    # LRU cache implementation
│   └── retry_cache.rs  # Retry-aware cache
├── config/             # Configuration management
│   ├── mod.rs
│   ├── loader.rs       # Configuration loading
│   ├── validator.rs    # Configuration validation
│   ├── timeout_validator.rs # Timeout configuration validation
│   ├── watcher.rs      # File watching for hot-reload
│   ├── migrator.rs     # Configuration migration orchestrator
│   ├── migration.rs    # Migration types and traits
│   ├── migrations.rs   # Specific migration implementations
│   ├── fixer.rs        # Auto-correction logic
│   ├── backup.rs       # Backup management
│   └── secrets.rs      # Secret/API key management
└── lock_optimization.rs # Lock and concurrency optimization

4. Core Layer (`src/core/`)¶

Responsibility: Define domain models, business rules, and fundamental abstractions

Components¶

Models (models/): Core domain entities (Backend, Model, Request, Response)
Traits (traits.rs): Core interfaces and contracts
Errors (errors.rs): Domain-specific error types and handling
Retry (retry/): Retry policies and strategies
Container (container.rs): Dependency injection container

Key Files¶

src/core/
├── mod.rs              # Core exports and utilities
├── models/             # Domain models
│   ├── mod.rs
│   ├── backend.rs      # Backend domain model
│   ├── model.rs        # LLM model representation
│   ├── request.rs      # Request models
│   └── responses.rs    # Response models (Responses API)
├── traits.rs           # Core traits and interfaces
├── errors.rs           # Error types and handling
├── container.rs        # Dependency injection container
├── async_utils.rs      # Async utility functions
├── duration_utils.rs   # Duration parsing utilities
├── streaming/          # Streaming models
│   ├── mod.rs
│   └── models.rs       # Streaming-specific models
├── retry/              # Retry mechanisms
│   ├── mod.rs
│   ├── policy.rs       # Retry policies
│   └── strategy.rs     # Retry strategies
├── circuit_breaker/    # Circuit breaker pattern
│   ├── mod.rs          # Module exports
│   ├── config.rs       # Configuration models
│   ├── state.rs        # State machine and breaker logic
│   ├── error.rs        # Circuit breaker errors
│   ├── metrics.rs      # Prometheus metrics
│   └── tests.rs        # Unit tests
├── files/              # File processing utilities
│   ├── mod.rs          # Module exports
│   ├── resolver.rs     # File reference resolution in chat requests
│   ├── transformer.rs  # Message transformation with file content
│   └── transformer_utils.rs # Transformation utility functions
└── config/             # Configuration models
    ├── mod.rs
    ├── models/            # Configuration data models (modular structure)
    │   ├── mod.rs         # Re-exports for backward compatibility
    │   ├── config.rs      # Main Config struct, ServerConfig, BackendConfig
    │   ├── backend_type.rs # BackendType enum definitions
    │   ├── model_metadata.rs # ModelMetadata, PricingInfo, CapabilityInfo
    │   ├── global_prompts.rs # GlobalPrompts configuration
    │   ├── samples.rs     # Sample generation configurations
    │   ├── validation.rs  # Configuration validation logic
    │   └── error.rs       # Configuration-specific errors
    ├── timeout_models.rs # Timeout configuration models
    ├── cached_timeout.rs # Cached timeout resolution
    ├── optimized_retry.rs # Optimized retry configuration
    ├── metrics.rs      # Metrics configuration
    └── rate_limit.rs   # Rate limit configuration

Core Components¶

Backend Pool¶

Location: src/backend.rs (legacy) → src/services/backend_service.rs

Purpose: Manages multiple LLM backends with intelligent load balancing

pub struct BackendPool {
    backends: Arc<RwLock<Vec<Backend>>>,
    load_balancer: LoadBalancingStrategy,
    health_checker: Option<Arc<HealthChecker>>,
}

impl BackendPool {
    // Round-robin load balancing with health awareness
    pub async fn select_backend(&self) -> Option<Backend> { /* ... */ }

    // Filter backends by model availability
    pub async fn backends_for_model(&self, model: &str) -> Vec<Backend> { /* ... */ }
}

Health Checker¶

Location: src/health.rs → src/services/health_service.rs

Purpose: Monitor backend health with configurable thresholds and automatic recovery

pub struct HealthChecker {
    backends: Arc<RwLock<Vec<Backend>>>,
    config: HealthConfig,
    status_map: Arc<RwLock<HashMap<String, HealthStatus>>>,
}

pub struct HealthConfig {
    pub interval: Duration,
    pub timeout: Duration,
    pub unhealthy_threshold: u32,  // Failures before marking unhealthy
    pub healthy_threshold: u32,    // Successes before marking healthy
}

Model Aggregation Service¶

Location: src/models/ (modular structure)

Purpose: Aggregate and cache model information from all backends, enrich with metadata

Module Structure (refactored from single models.rs file):

src/models/
├── mod.rs             # Re-exports for backward compatibility
├── types.rs           # Model, AggregatedModel, ModelList types
├── metrics.rs         # ModelMetrics tracking
├── cache.rs           # ModelCache implementation
├── config.rs          # ModelAggregationConfig
├── fetcher.rs         # Model fetching from backends
├── handlers.rs        # HTTP handlers for /v1/models endpoint
├── utils.rs           # Utility functions (normalize_model_id, etc.)
└── aggregation/       # Core aggregation logic
    ├── mod.rs         # ModelAggregationService implementation
    └── tests.rs       # Unit tests

pub struct ModelAggregationService {
    cache: Arc<RwLock<ModelCache>>,
    config: ModelAggregationConfig,
    backends: Arc<BackendPool>,
}

impl ModelAggregationService {
    // Aggregate models from all healthy backends
    pub async fn get_aggregated_models(&self) -> Result<ModelList, Error> { /* ... */ }

    // Enrich models with metadata from config
    pub fn merge_config_metadata(&self, models: &mut Vec<Model>) { /* ... */ }

    // Cache with TTL and deduplication
    pub async fn refresh_models(&self) -> Result<(), Error> { /* ... */ }
}

Proxy Module¶

Location: src/proxy/ (modular structure)

Purpose: Handle request proxying, backend selection, file resolution, and image generation/editing

Module Structure (refactored from single proxy.rs file):

src/proxy/
├── mod.rs             # Re-exports for backward compatibility
├── backend.rs         # Backend selection and routing logic
├── request.rs         # Request execution with retry logic
├── files.rs           # File reference resolution in requests
├── image_gen.rs       # Image generation handling (DALL-E, Gemini, GPT Image)
├── image_edit.rs      # Image editing support (/v1/images/edits)
├── image_utils.rs     # Image processing utilities (multipart, validation)
├── handlers.rs        # HTTP handlers for proxy endpoints
├── utils.rs           # Utility functions (error responses, etc.)
└── tests.rs           # Unit tests

Key Responsibilities¶

Backend Selection: Intelligent routing to available backends
File Resolution: Resolve file references in chat requests
Image Generation: Support for OpenAI (DALL-E, GPT Image) and Gemini (Nano Banana) image models
Image Editing: Image editing and variations endpoints
Request Retry: Automatic retry with exponential backoff
Error Handling: Standardized error responses in OpenAI format

Retry Handler¶

Location: src/services/deduplication.rs

Purpose: Implement exponential backoff with jitter and request deduplication

pub struct EnhancedRetryHandler {
    config: RetryConfig,
    dedup_cache: Arc<Mutex<HashMap<String, CachedResponse>>>,
    dedup_ttl: Duration,
}

pub struct RetryConfig {
    pub max_attempts: u32,
    pub base_delay: Duration,
    pub max_delay: Duration,
    pub exponential_backoff: bool,
    pub jitter: bool,
}

Circuit Breaker¶

Location: src/core/circuit_breaker/

Purpose: Prevent cascading failures by automatically stopping requests to failing backends

pub struct CircuitBreaker {
    states: Arc<DashMap<String, BackendCircuitState>>,
    config: CircuitBreakerConfig,
    metrics: Option<CircuitBreakerMetrics>,
}

pub struct CircuitBreakerConfig {
    pub enabled: bool,
    pub failure_threshold: u32,           // Failures before opening (default: 5)
    pub failure_rate_threshold: f64,      // Failure rate threshold (default: 0.5)
    pub minimum_requests: u32,            // Min requests before rate calculation
    pub timeout_seconds: u64,             // How long circuit stays open (default: 60s)
    pub half_open_max_requests: u32,      // Max requests in half-open state
    pub half_open_success_threshold: u32, // Successes needed to close
}

pub enum CircuitState {
    Closed,    // Normal operation - requests pass through
    Open,      // Failing fast - requests rejected immediately
    HalfOpen,  // Testing recovery - limited requests allowed
}

Key Features¶

Per-backend circuit breakers with independent state
Atomic operations for lock-free state checking in hot path
Automatic state transitions based on success/failure patterns
Sliding window for failure rate calculation
Prometheus metrics for observability
Admin endpoints for manual control

Container (Dependency Injection)¶

Location: src/core/container.rs

Purpose: Manage service lifecycles and dependencies

pub struct Container {
    services: Arc<RwLock<HashMap<TypeId, Box<dyn Any + Send + Sync>>>>,
    singletons: Arc<RwLock<HashMap<TypeId, Arc<dyn Any + Send + Sync>>>>,
}

impl Container {
    // Register singleton service
    pub async fn register_singleton<T>(&self, instance: Arc<T>) -> CoreResult<()> 
    where T: 'static + Send + Sync { /* ... */ }

    // Resolve service dependency
    pub async fn resolve<T>(&self) -> CoreResult<Arc<T>>
    where T: 'static + Send + Sync { /* ... */ }
}

Data Flow¶

Request Processing Flow¶

sequenceDiagram
    participant Client
    participant HTTPLayer as HTTP Layer
    participant ProxyService as Proxy Service
    participant BackendService as Backend Service
    participant ModelService as Model Service
    participant Backend as LLM Backend

    Client->>HTTPLayer: POST /v1/chat/completions
    HTTPLayer->>HTTPLayer: Apply Middleware (auth, logging, metrics)
    HTTPLayer->>ProxyService: Forward Request

    ProxyService->>ModelService: Get Model Info
    ModelService->>ModelService: Check Cache
    alt Cache Miss
        ModelService->>BackendService: Get Backends for Model
        BackendService->>Backend: Query Models
        Backend-->>BackendService: Model List
        BackendService-->>ModelService: Filtered Backends
        ModelService->>ModelService: Update Cache
    end
    ModelService-->>ProxyService: Model Available on Backends

    ProxyService->>BackendService: Select Healthy Backend
    BackendService->>BackendService: Apply Load Balancing
    BackendService-->>ProxyService: Selected Backend

    ProxyService->>Backend: Forward Request
    Backend-->>ProxyService: Response (streaming or non-streaming)

    ProxyService->>ProxyService: Apply Response Processing
    ProxyService-->>HTTPLayer: Processed Response
    HTTPLayer-->>Client: HTTP Response

Health Check Flow¶

sequenceDiagram
    participant HealthService as Health Service
    participant BackendPool as Backend Pool
    participant Backend as LLM Backend
    participant Cache as Health Cache

    loop Every Interval
        HealthService->>BackendPool: Get All Backends
        BackendPool-->>HealthService: Backend List

        par For Each Backend
            HealthService->>Backend: GET /v1/models (or /health)
            alt Success
                Backend-->>HealthService: 200 OK + Model List
                HealthService->>Cache: Update: consecutive_successes++
                HealthService->>HealthService: Mark Healthy if threshold met
            else Failure
                Backend-->>HealthService: Error/Timeout
                HealthService->>Cache: Update: consecutive_failures++
                HealthService->>HealthService: Mark Unhealthy if threshold met
            end
        end

        HealthService->>BackendPool: Update Backend Health Status
    end

Hot Reload Service¶

Location: src/infrastructure/config/hot_reload.rs, src/services/hot_reload_service.rs

Purpose: Provide runtime configuration updates without server restart

The hot reload system enables zero-downtime configuration changes through automatic file watching and intelligent component updates.

Key Architecture Components¶

ConfigManager: File system watching using notify crate, publishes updates via tokio::sync::watch channel
HotReloadService: Computes configuration differences, classifies changes (immediate/gradual/restart)
Component Updates: Interior mutability patterns (RwLock) for atomic updates to HealthChecker, CircuitBreaker, RateLimitStore

Change Classification¶

Immediate Update: logging.level, rate_limiting., circuit_breaker., retry., global_prompts.
Gradual Update: backends., health_checks., timeouts.*
Requires Restart: server.bind_address, server.workers

Admin API: /admin/config/hot-reload-status for inspecting hot reload capabilities

For detailed hot reload configuration, process flow, and usage examples, see configuration.md section on hot reload.

Configuration Migration System¶

Location: src/infrastructure/config/{migrator,migration,migrations,fixer,backup}.rs

Purpose: Automatically detect and fix configuration issues, migrate schemas, and ensure configuration validity

The configuration migration system provides a comprehensive solution for handling configuration evolution and maintenance. It automatically: - Detects and migrates outdated schema versions - Fixes common syntax errors in YAML/TOML files - Validates and corrects configuration values - Creates backups before making changes - Provides dry-run capability for previewing changes

Architecture Components¶

1. Migration Orchestrator (migrator.rs) - Main entry point for migration operations - Coordinates the entire migration workflow - Manages backup creation and restoration - Implements security validations (path traversal, file size limits)

2. Migration Framework (migration.rs) - Defines core types and traits for migrations - Migration trait for implementing version upgrades - ConfigIssue enum for categorizing problems - MigrationResult for tracking changes

3. Schema Migrations (migrations.rs) - Concrete migration implementations (e.g., V1ToV2Migration) - Transforms configuration structure between versions - Example: Converting backend_url to backends array

4. Auto-Correction Engine (fixer.rs) - Detects and fixes common configuration errors - Duration format correction (e.g., "10 seconds" → "10s") - URL validation and protocol addition - Field deprecation handling

5. Backup Manager (backup.rs) - Creates timestamped backups before modifications - Implements resource limits (10MB per file, 100MB total, max 50 backups) - Automatic cleanup of old backups - Preserves file permissions

Migration Workflow¶

graph TD
    A[Read Config File] --> B[Validate Path & Size]
    B --> C[Create Backup]
    C --> D[Parse Configuration]
    D --> E{Parse Success?}
    E -->|No| F[Fix Syntax Errors]
    F --> D
    E -->|Yes| G[Detect Schema Version]
    G --> H{Needs Migration?}
    H -->|Yes| I[Apply Migrations]
    H -->|No| J[Validate Values]
    I --> J
    J --> K{Issues Found?}
    K -->|Yes| L[Apply Auto-Fixes]
    K -->|No| M[Return Config]
    L --> N[Write Updated Config]
    N --> M

Security Features¶

Path Traversal Protection: Validates paths to prevent directory traversal attacks
File Size Limits: Maximum 10MB configuration files to prevent DoS
Format Validation: Only processes .yaml, .yml, and .toml files
System Directory Protection: Blocks access to sensitive system paths
Test Mode Relaxation: Uses conditional compilation for test-friendly validation

Example Migration: v1.0 to v2.0¶

// V1ToV2Migration implementation
fn migrate(&self, config: &mut Value) -> Result<(), MigrationError> {
    // Convert single backend_url to backends array
    if let Some(backend_url) = config.get("backend_url") {
        let mut backends = Vec::new();
        let mut backend = Map::new();
        backend.insert("url".to_string(), backend_url.clone());

        // Move models to backend
        if let Some(model) = config.get("model") {
            backend.insert("models".to_string(), 
                Value::Sequence(vec![model.clone()]));
        }

        backends.push(Value::Mapping(backend));
        config["backends"] = Value::Sequence(backends);

        // Remove old fields
        config.remove("backend_url");
        config.remove("model");
    }
    Ok(())
}

Configuration Loading Flow¶

graph TD
    A[Application Start] --> B[Config Manager Init]
    B --> C{Config File Specified?}
    C -->|Yes| D[Load Specified File]
    C -->|No| E[Search Standard Locations]

    E --> F{Config File Found?}
    F -->|Yes| G[Load Config File]
    F -->|No| H[Use CLI Args + Env Vars + Defaults]

    D --> I[Parse YAML]
    G --> I
    H --> J[Create Config from Args]

    I --> K[Apply Environment Variable Overrides]
    J --> K

    K --> L[Apply CLI Argument Overrides]
    L --> M[Validate Configuration]
    M --> N{Valid?}
    N -->|Yes| O[Return Config]
    N -->|No| P[Exit with Error]

    O --> Q[Start File Watcher for Hot Reload]
    Q --> R[Application Running]

    Q --> S[Config File Changed]
    S --> T[Reload and Validate]
    T --> U{Valid?}
    U -->|Yes| V[Apply New Config]
    U -->|No| W[Log Error, Keep Old Config]

    V --> R
    W --> R

Dependency Injection¶

Service Registration¶

Services are registered in the container during application startup:

// In main.rs
async fn setup_services(config: Config) -> Result<ServiceRegistry, Error> {
    let container = Arc::new(Container::new());

    // Register infrastructure services
    container.register_singleton(Arc::new(
        HttpClient::new(&config.http_client)?
    )).await?;

    container.register_singleton(Arc::new(
        BackendManager::new(&config.backends)?
    )).await?;

    // Register core services
    container.register_singleton(Arc::new(
        BackendServiceImpl::new(container.clone())
    )).await?;

    container.register_singleton(Arc::new(
        ModelServiceImpl::new(container.clone())
    )).await?;

    // Create service registry
    let registry = ServiceRegistry::new(container);
    registry.initialize().await?;

    Ok(registry)
}

Service Dependencies¶

Services declare their dependencies through constructor injection:

pub struct ProxyServiceImpl {
    backend_service: Arc<dyn BackendService>,
    model_service: Arc<dyn ModelService>,
    retry_handler: Arc<dyn RetryHandler>,
    http_client: Arc<HttpClient>,
}

impl ProxyServiceImpl {
    pub fn new(container: Arc<Container>) -> CoreResult<Self> {
        Ok(Self {
            backend_service: container.resolve()?,
            model_service: container.resolve()?,
            retry_handler: container.resolve()?,
            http_client: container.resolve()?,
        })
    }
}

Benefits¶

Testability: Services can be mocked for unit testing
Flexibility: Implementations can be swapped without code changes
Lifecycle Management: Container manages service initialization and cleanup
Circular Dependency Detection: Container prevents circular dependencies

Error Handling Strategy¶

The router implements a comprehensive error handling strategy with typed errors, intelligent recovery, and user-friendly responses.

Error Type Hierarchy¶

CoreError: Domain-level errors (validation, service failures, timeouts, configuration)
RouterError: Application-level errors combining Core, HTTP, Backend, and Model errors
HttpError: HTTP-specific errors (400 BadRequest, 401 Unauthorized, 404 NotFound, 500 InternalServerError, etc.)

Error Handling Principles¶

Fail Fast: Validate inputs early with clear error messages
Error Context: Include relevant context (field names, operation details)
Retryable Classification: Distinguish between retryable (timeout, 503) and non-retryable (400, 401) errors
User-Friendly Responses: Convert internal errors to OpenAI-compatible error format
Structured Logging: Log errors with appropriate severity and context

Error Recovery Mechanisms¶

Circuit Breaker: Prevent cascading failures (see Circuit Breaker)
Retry with Exponential Backoff: Automatically retry transient failures
Model Fallback: Route to alternative models when primary unavailable (see Model Fallback System)
Graceful Degradation: Continue with reduced functionality when components fail

For detailed error handling, recovery strategies, monitoring, and troubleshooting, see error-handling.md.

Extension Points¶

Backend Type Architecture¶

The router supports multiple backend types with different API formats. Each backend type handles request/response transformation automatically.

Supported Backend Types¶

Backend Type	API Format	Authentication	Use Case
`openai`	OpenAI Chat Completions	`Authorization: Bearer`	OpenAI, Azure OpenAI, vLLM, LocalAI
`anthropic`	Anthropic Messages API	`x-api-key` header	Claude models via native API
`gemini`	OpenAI-compatible	`Authorization: Bearer`	Google Gemini via OpenAI compatibility layer

Anthropic Backend Architecture¶

The Anthropic backend provides native support for Claude models with automatic format translation:

┌─────────────────────────────────────────────────────────────────┐
│                   OpenAI Format Request                          │
│  POST /v1/chat/completions                                       │
│  { "model": "claude-haiku-4-5", "messages": [...] }             │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│              Request Transformation Layer                        │
│  transform_openai_to_anthropic_request()                        │
│  • Extract system messages → separate `system` parameter        │
│  • Transform image_url → Anthropic image format                 │
│  • Map max_tokens / max_completion_tokens                       │
│  • Convert reasoning_effort → thinking parameter                │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                  Anthropic Messages API                          │
│  POST https://api.anthropic.com/v1/messages                     │
│  Headers: x-api-key, anthropic-version: 2023-06-01              │
│  { "model": "...", "system": "...", "messages": [...] }         │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│              AnthropicStreamTransformer                          │
│  SSE Event Transformation (Anthropic → OpenAI format)           │
│  • message_start → initial chunk with role                      │
│  • content_block_delta → content chunks                         │
│  • thinking_delta → reasoning_content (extended thinking)       │
│  • message_delta → finish_reason mapping                        │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                   OpenAI Format Response                         │
│  data: {"choices":[{"delta":{"content":"..."}}]}                │
└─────────────────────────────────────────────────────────────────┘

Key Transformations¶

Request Format Differences:

Aspect	OpenAI Format	Anthropic Format
System prompt	`messages[0].role="system"`	Separate `system` parameter
Auth header	`Authorization: Bearer`	`x-api-key`
Max tokens	Optional	Required (`max_tokens`)
Images	`image_url.url`	`source.type` + `source.data`

Extended Thinking Support:

// OpenAI reasoning_effort → Anthropic thinking
{
  "reasoning_effort": "high"  // OpenAI format
}
// Transforms to:
{
  "thinking": {
    "type": "enabled",
    "budget_tokens": 32768   // Mapped from effort level
  }
}

Adding New Backend Types¶

Implement Backend Trait:

// In src/infrastructure/backends/custom_backend.rs
pub struct CustomBackend {
    client: Arc<HttpClient>,
    config: CustomBackendConfig,
}

#[async_trait]
impl BackendTrait for CustomBackend {
    async fn health_check(&self) -> CoreResult<()> { /* ... */ }
    async fn list_models(&self) -> CoreResult<Vec<Model>> { /* ... */ }
    async fn chat_completion(&self, request: ChatRequest) -> CoreResult<Response> { /* ... */ }
}

Register in Backend Factory:

// In src/infrastructure/backends/mod.rs
pub fn create_backend(backend_type: &str, config: &BackendConfig) -> CoreResult<Box<dyn BackendTrait>> {
    match backend_type {
        "openai" => Ok(Box::new(OpenAIBackend::new(config)?)),
        "vllm" => Ok(Box::new(VLLMBackend::new(config)?)),
        "custom" => Ok(Box::new(CustomBackend::new(config)?)), // New backend
        _ => Err(CoreError::ValidationFailed {
            message: format!("Unknown backend type: {}", backend_type),
            field: Some("backend_type".to_string()),
        }),
    }
}

Adding New Middleware¶

Implement Middleware Trait:

// In src/http/middleware/custom_middleware.rs
pub struct CustomMiddleware {
    config: CustomConfig,
}

impl<S> tower::Layer<S> for CustomMiddleware {
    type Service = CustomMiddlewareService<S>;

    fn layer(&self, inner: S) -> Self::Service {
        CustomMiddlewareService { inner, config: self.config.clone() }
    }
}

Register in HTTP Router:

// In src/main.rs
let app = Router::new()
    .route("/v1/models", get(list_models))
    .layer(CustomMiddleware::new(config.custom))
    .layer(LoggingMiddleware::new())
    .with_state(state);

Adding New Cache Types¶

Implement Cache Trait:

// In src/infrastructure/cache/redis_cache.rs
pub struct RedisCache {
    client: redis::Client,
    ttl: Duration,
}

#[async_trait]
impl CacheTrait for RedisCache {
    async fn get<T>(&self, key: &str) -> CoreResult<Option<T>>
    where T: DeserializeOwned { /* ... */ }

    async fn set<T>(&self, key: &str, value: &T, ttl: Option<Duration>) -> CoreResult<()>
    where T: Serialize { /* ... */ }
}

Use in Service:

// Services can use any cache implementation
pub struct ModelServiceImpl<C: CacheTrait> {
    cache: Arc<C>,
    // ... other dependencies
}

Adding New Load Balancing Strategies¶

// In src/services/load_balancer.rs
pub enum LoadBalancingStrategy {
    RoundRobin,
    WeightedRoundRobin,
    LeastConnections,  // New strategy
    Random,
}

impl LoadBalancingStrategy {
    pub fn select_backend(&self, backends: &[Backend]) -> Option<&Backend> {
        match self {
            Self::RoundRobin => /* ... */,
            Self::WeightedRoundRobin => /* ... */,
            Self::LeastConnections => self.select_least_connections(backends),
            Self::Random => /* ... */,
        }
    }
}

Design Decisions¶

Why 4-Layer Architecture?¶

Decision: Use a 4-layer architecture (HTTP → Services → Infrastructure → Core)

Rationale¶

Clear Separation: Each layer has distinct responsibilities
Testability: Layers can be tested independently
Maintainability: Changes in one layer don't affect others
Flexibility: Easy to swap implementations (e.g., different cache backends)

Trade-offs¶

✅ Pros: Clean, maintainable, testable, extensible
❌ Cons: More complexity, slight performance overhead
Verdict: Benefits outweigh costs for a production system

Why Dependency Injection?¶

Decision: Use a custom DI container instead of compile-time injection

Rationale¶

Runtime Flexibility: Can swap implementations based on configuration
Service Lifecycle: Centralized management of service initialization/cleanup
Testing: Easy to inject mocks and test doubles

Alternatives Considered¶

Manual dependency passing: Too verbose and error-prone
Compile-time DI (generics): Less flexible, harder to configure

Why Arc> for Shared State?¶

Decision: Use Arc<RwLock<T>> for shared mutable state

Rationale¶

Reader-Writer Semantics: Multiple readers, exclusive writers
Performance: Better than Arc<Mutex<T>> for read-heavy workloads
Safety: Prevents data races at compile time

Alternatives Considered¶

Arc<Mutex<T>>: Simpler but worse performance for reads
Channels: Too complex for simple shared state
Atomic types: Not suitable for complex data structures

Why async/await Throughout?¶

Decision: Use async/await for all I/O operations

Rationale¶

Performance: Non-blocking I/O allows high concurrency
Resource Efficiency: Lower memory usage than thread-per-request
Ecosystem: Rust async ecosystem (Tokio, reqwest, axum) is mature

Trade-offs¶

✅ Pros: High performance, low resource usage, good ecosystem
❌ Cons: Complexity, learning curve, debugging challenges
Verdict: Essential for high-performance network services

Why Configuration Hot-Reload?¶

Decision: Support configuration hot-reload using file watching

Rationale¶

Zero Downtime: Update configuration without restarting
Operations Friendly: Easy to adjust settings in production
Development: Faster iteration during development

Implementation¶

File system watcher detects changes
Validate new configuration before applying
Atomic updates to avoid inconsistent state
Fallback to previous config on validation errors

Performance Considerations¶

Memory Management¶

Connection Pooling: Reuse HTTP connections to reduce allocation overhead
Smart Caching: LRU eviction prevents unbounded memory growth
Arc Cloning: Cheap reference counting instead of deep cloning
Streaming: Process responses in chunks to avoid loading large responses into memory

Concurrency¶

RwLock for Read-Heavy Workloads: Multiple concurrent readers for backend pool and model cache
Lock-Free Where Possible: Use atomics for counters and simple state
Async Task Spawning: Background tasks for health checks and cache updates
Bounded Channels: Prevent unbounded queuing of tasks

I/O Optimization¶

Connection Keep-Alive: TCP connections stay open for reuse
Streaming Responses: Forward SSE chunks without buffering
Timeouts: Prevent hanging on slow backends
Retry with Backoff: Avoid overwhelming failing backends

Memory Layout¶

// Optimized data structures for cache efficiency
pub struct Backend {
    pub name: String,          // Inline string for small names
    pub url: Arc<str>,         // Shared string for URL
    pub weight: u32,           // Compact integer
    pub is_healthy: AtomicBool, // Lock-free health status
}

// Cache-friendly model storage
pub struct ModelCache {
    models: HashMap<String, Arc<ModelInfo>>, // Shared model info
    last_updated: AtomicU64,                 // Lock-free timestamp
    ttl: Duration,
}

Benchmarking Results¶

Based on our benchmarks (see benches/performance_benchmarks.rs):

Request Latency: < 5ms overhead for routing decisions
Memory Usage: ~50MB base memory, scales linearly with backends
Throughput: 1000+ requests/second on modest hardware
Connection Efficiency: 100+ concurrent connections per backend with minimal memory overhead

Rate Limiting¶

The router implements sophisticated rate limiting to protect against abuse and ensure fair resource allocation across clients.

Key Features: - Dual-window approach: sustained limit (100 req/min) + burst protection (20 req/5s) - Client identification by API key (preferred) or IP address (fallback) - Per-client isolation with automatic cache cleanup - DoS prevention with short TTL for empty responses

Rate Limit V2 Architecture¶

The enhanced rate limiting system (rate_limit_v2/) provides a modular, high-performance implementation:

Module Structure¶

src/http/middleware/rate_limit_v2/
├── mod.rs              # Public API and module exports
├── middleware.rs       # Axum middleware integration
├── store.rs            # Rate limit storage and client tracking
└── token_bucket.rs     # Token bucket algorithm implementation

Components¶

Token Bucket Algorithm (token_bucket.rs)
Configurable bucket capacity and refill rate
Atomic operations for lock-free token consumption
Automatic token replenishment based on elapsed time
Separate buckets for sustained and burst limits
Rate Limit Store (store.rs)
Per-client state tracking with DashMap for concurrent access
Automatic cleanup of expired client entries
Configurable TTL for inactive clients (default: 1 hour)
Memory-efficient with bounded storage
Middleware Integration (middleware.rs)
Extracts client identifier (API key → IP address fallback)
Checks both sustained and burst limits before processing
Returns HTTP 429 (Too Many Requests) with Retry-After header
Prometheus metrics for monitoring rate limit hits

Configuration Example¶

rate_limiting:
  enabled: true
  sustained:
    max_requests: 100
    window_seconds: 60
  burst:
    max_requests: 20
    window_seconds: 5
  cleanup_interval_seconds: 300

Decision Flow¶

Request arrives
    ↓
Extract client ID (API key or IP)
    ↓
Check sustained limit (100 req/min)
    ↓ OK
Check burst limit (20 req/5s)
    ↓ OK
Process request

For detailed configuration information, see configuration.md section on rate limiting.

Model Fallback System¶

The router implements a configurable model fallback system that automatically routes requests to alternative models when the primary model is unavailable.

Key Features: - Automatic fallback chain execution (e.g., gpt-4o → gpt-4-turbo → gpt-3.5-turbo) - Cross-provider fallback support with parameter translation - Integration with circuit breaker for intelligent triggering - Prometheus metrics for monitoring fallback usage

For detailed configuration and implementation, see error-handling.md section on model fallback.

Circuit Breaker¶

The router implements the circuit breaker pattern to prevent cascading failures and provide automatic failover when backends become unhealthy.

Three-State Machine:

State	Behavior
Closed	Normal operation. Failures are counted.
Open	Fast-fail mode. Requests rejected immediately.
HalfOpen	Recovery testing. Limited requests allowed.

Key Features: - Per-backend isolation with independent state - Lock-free atomic operations for minimal hot-path overhead - Admin endpoints for manual control (/admin/circuit/*) - Prometheus metrics for observability

For detailed configuration and implementation, see error-handling.md section on circuit breaker.

File Storage¶

The router provides OpenAI Files API compatible file storage with persistent metadata.

Key Features: - Persistent metadata storage with sidecar JSON files - Automatic recovery on server restart - Orphan file detection and cleanup - Pluggable backends (memory/persistent)

For detailed architecture and implementation, see File Storage Guide.

Image Generation Architecture¶

The router provides a unified interface for image generation across multiple backends (OpenAI GPT Image, DALL-E, and Google Gemini/Nano Banana) with automatic parameter translation.

Multi-Backend Image Generation¶

┌─────────────────────────────────────────────────────────────────┐
│                   OpenAI-Compatible Request                       │
│  POST /v1/images/generations                                     │
│  { "model": "...", "prompt": "...", "size": "1536x1024" }       │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Model Router (image_gen.rs)                    │
│  • Detects model type (GPT Image, DALL-E, Nano Banana)           │
│  • Routes to appropriate handler                                  │
│  • Handles streaming vs non-streaming                             │
└─────────────────────────────────────────────────────────────────┘
                   │                              │
        ┌──────────┘                              └──────────┐
        ▼                                                     ▼
┌───────────────────────────┐           ┌───────────────────────────┐
│    OpenAI Backend         │           │    Gemini Backend          │
│    (GPT Image, DALL-E)    │           │    (Nano Banana)           │
│                           │           │                            │
│  • Pass-through request   │           │  • Convert to Gemini API   │
│  • SSE streaming support  │           │  • Map size → aspectRatio  │
│  • output_format support  │           │  • imageConfig generation  │
└───────────────────────────┘           └───────────────────────────┘

OpenAI → Gemini Parameter Conversion¶

When using Nano Banana (Gemini) models, OpenAI-style parameters are automatically converted to Gemini's native format:

Size to Aspect Ratio Mapping¶

OpenAI `size`	Gemini `aspectRatio`	Gemini `imageSize`	Notes
`256x256`	`1:1`	`1K`	Minimum Gemini size
`512x512`	`1:1`	`1K`	Minimum Gemini size
`1024x1024`	`1:1`	`1K`	Default
`1536x1024`	`3:2`	`1K`	Landscape (new)
`1024x1536`	`2:3`	`1K`	Portrait (new)
`1792x1024`	`16:9`	`1K`	Wide landscape
`1024x1792`	`9:16`	`1K`	Tall portrait
`2048x2048`	`1:1`	`2K`	Pro models only
`4096x4096`	`1:1`	`4K`	Pro models only
`auto`	`1:1`	`1K`	Default fallback

Request Transformation¶

OpenAI Format (Input):

{
  "model": "nano-banana",
  "prompt": "A serene Japanese garden",
  "size": "1536x1024",
  "n": 1
}

Gemini Format (Converted):

{
  "contents": [
    {
      "parts": [{"text": "A serene Japanese garden"}]
    }
  ],
  "generationConfig": {
    "imageConfig": {
      "aspectRatio": "3:2",
      "imageSize": "1K"
    }
  }
}

Conversion Implementation¶

The conversion is handled by src/infrastructure/backends/gemini/image_generation.rs:

pub fn convert_openai_to_gemini(request: &OpenAIImageRequest)
    -> CoreResult<(String, GeminiImageRequest)>
{
    // 1. Map model name
    let gemini_model = map_model_to_gemini(&request.model);

    // 2. Parse size to aspect ratio and size category
    let parsed_size = parse_openai_size(&request.size, &request.model)?;

    // 3. Build Gemini request with imageConfig
    let gemini_request = GeminiImageRequest {
        contents: vec![GeminiContent { parts: vec![...] }],
        generation_config: Some(GeminiGenerationConfig {
            image_config: Some(GeminiImageConfig {
                aspect_ratio: Some(parsed_size.aspect_ratio.to_gemini_string()),
                image_size: Some(parsed_size.size_category.to_gemini_image_size()),
            }),
        }),
    };

    Ok((gemini_model, gemini_request))
}

Streaming Image Generation (SSE)¶

For GPT Image models, the router supports true SSE passthrough for streaming image generation:

┌─────────────┐                ┌─────────────┐                ┌─────────────┐
│   Client    │────stream:true─▶│   Router    │────stream:true─▶│   OpenAI    │
│             │                │             │                │             │
│             │◀───SSE events──│  Passthrough│◀───SSE events──│             │
└─────────────┘                └─────────────┘                └─────────────┘

SSE Event Types:

Event	Description
`image_generation.partial_image`	Intermediate preview during generation
`image_generation.complete`	Final image data
`image_generation.usage`	Token usage for billing
`done`	Stream completion

Implementation (src/proxy/image_gen.rs):

async fn handle_streaming_image_generation(...) -> Result<Response, StatusCode> {
    // 1. Keep stream: true in backend request
    // 2. Make streaming request via bytes_stream()
    // 3. Forward SSE events through tokio channel

    let (tx, rx) = tokio::sync::mpsc::unbounded_channel();

    tokio::spawn(async move {
        let mut stream = backend_response.bytes_stream();
        while let Some(chunk) = stream.next().await {
            // Parse SSE format (event:/data: lines)
            // Forward events to client
            for line in chunk_str.lines() {
                if let Some(event_type) = line.strip_prefix("event:") { ... }
                if let Some(data) = line.strip_prefix("data:") {
                    let event = Event::default().event(event_type).data(data);
                    tx.send(Ok(event));
                }
            }
        }
    });

    Ok(Sse::new(UnboundedReceiverStream::new(rx)).into_response())
}

GPT Image Model Features¶

The router supports enhanced parameters for GPT Image models (gpt-image-1, gpt-image-1.5, gpt-image-1-mini):

Parameter	Description	Values
`output_format`	Image file format	`png`, `jpeg`, `webp`
`output_compression`	Compression level	0-100 (jpeg/webp only)
`background`	Transparency control	`transparent`, `opaque`, `auto`
`quality`	Generation quality	`low`, `medium`, `high`, `auto`
`stream`	Enable SSE streaming	`true`, `false`
`partial_images`	Preview count	0-3

Model Support Matrix¶

Feature	GPT Image 1.5	GPT Image 1	GPT Image 1 Mini	DALL-E 3	DALL-E 2	Nano Banana	Nano Banana Pro
Streaming	✅	✅	✅	❌	❌	❌	❌
output_format	✅	✅	✅	❌	❌	❌	❌
background	✅	✅	✅	❌	❌	❌	❌
Custom quality	✅	✅	✅	standard/hd	❌	❌	❌
Image Edit	✅	✅	✅	❌	✅	❌	❌
Image Variations	❌	❌	❌	❌	✅	❌	❌
Max Resolution	1536px	1536px	1536px	1792px	1024px	1024px	4096px

Image Edit and Variations¶

The router provides OpenAI-compatible image editing and variations endpoints through /v1/images/edits and /v1/images/variations.

Image Editing (/v1/images/edits)¶

Endpoint: POST /v1/images/edits

Allows editing an existing image with a text prompt and optional mask. Supported by GPT Image models and DALL-E 2.

Request Format (multipart/form-data):

image: <file>           # Original image (PNG, required)
prompt: <string>        # Edit instructions (required)
mask: <file>            # Optional mask image (PNG)
model: <string>         # Model name (e.g., "gpt-image-1", "dall-e-2")
n: <integer>            # Number of images (default: 1)
size: <string>          # Output size (e.g., "1024x1024")
response_format: <string> # "url" or "b64_json"

Implementation (src/proxy/image_edit.rs): - Multipart form parsing for image and mask files - Image validation (format, size, aspect ratio) - Model-specific parameter transformation - Proper error handling for invalid inputs

Supported Features¶

Transparent PNG mask support for targeted editing
Multiple image generation (n parameter)
Flexible output sizes
Both URL and base64 response formats

Image Variations (/v1/images/variations)¶

Endpoint: POST /v1/images/variations

Creates variations of a given image. Supported by DALL-E 2 only.

Request Format (multipart/form-data):

image: <file>           # Source image (PNG, required)
model: <string>         # Model name (default: "dall-e-2")
n: <integer>            # Number of variations (default: 1, max: 10)
size: <string>          # Output size ("256x256", "512x512", "1024x1024")
response_format: <string> # "url" or "b64_json"

Implementation (src/proxy/image_edit.rs): - Image file validation and preprocessing - DALL-E 2-specific routing - Error handling for unsupported models - Consistent response formatting

Key Features¶

Generate multiple variations in a single request
Automatic image format validation
Standard OpenAI response format compatibility

Image Utilities Module¶

The image_utils.rs module provides shared utilities for image processing:

Functions¶

validate_image_format(): Validates PNG/JPEG format and dimensions
parse_multipart_image_request(): Extracts images from multipart forms
check_image_dimensions(): Validates size constraints
format_image_error_response(): Standardized error responses

Validation Rules¶

Maximum file size: 4MB (configurable)
Supported formats: PNG (required for edits/variations), JPEG (generation only)
Aspect ratio constraints per model
Transparent PNG requirement for masks

This architecture provides a solid foundation for building a production-ready LLM router that can scale to handle thousands of requests while remaining maintainable and extensible. The clean separation of concerns makes it easy to add new features, swap implementations, and thoroughly test each component.

Architecture Guide¶

Table of Contents¶

Overview¶

Architecture Goals¶

4-Layer Architecture¶

Layer Descriptions¶

1. HTTP Layer (src/http/)¶

Components¶

Key Files¶

Middleware Components¶

2. Services Layer (src/services/)¶

Components¶

Key Files¶

3. Infrastructure Layer (src/infrastructure/)¶

Components¶

Key Files¶

4. Core Layer (src/core/)¶

Components¶

Key Files¶

Core Components¶

Backend Pool¶

Health Checker¶

Model Aggregation Service¶

Proxy Module¶

Key Responsibilities¶

Retry Handler¶

Circuit Breaker¶

Key Features¶

Container (Dependency Injection)¶

Data Flow¶

Request Processing Flow¶

Health Check Flow¶

Hot Reload Service¶

Key Architecture Components¶

Change Classification¶

Configuration Migration System¶

Architecture Components¶

Migration Workflow¶

Security Features¶

Example Migration: v1.0 to v2.0¶

Configuration Loading Flow¶

Dependency Injection¶

Service Registration¶

Service Dependencies¶

Benefits¶

Error Handling Strategy¶

Error Type Hierarchy¶

Error Handling Principles¶

Error Recovery Mechanisms¶

Extension Points¶

Backend Type Architecture¶

Supported Backend Types¶

Anthropic Backend Architecture¶

Key Transformations¶

Adding New Backend Types¶

Adding New Middleware¶

Adding New Cache Types¶

Adding New Load Balancing Strategies¶

Design Decisions¶

Why 4-Layer Architecture?¶

Rationale¶

Trade-offs¶

Why Dependency Injection?¶

Rationale¶

Alternatives Considered¶

Why Arc> for Shared State?¶

Rationale¶

Alternatives Considered¶

Why async/await Throughout?¶

Rationale¶

Trade-offs¶

Why Configuration Hot-Reload?¶

Rationale¶

Implementation¶

Performance Considerations¶

Memory Management¶

Concurrency¶

I/O Optimization¶

Memory Layout¶

Benchmarking Results¶

1. HTTP Layer (`src/http/`)¶

2. Services Layer (`src/services/`)¶

3. Infrastructure Layer (`src/infrastructure/`)¶

4. Core Layer (`src/core/`)¶