Architecture Guide¶
This document provides a comprehensive overview of Continuum Router's architecture, design decisions, and extension points.
Table of Contents¶
- Overview
- 4-Layer Architecture
- Core Components
- Data Flow
- Dependency Injection
- Error Handling Strategy
- Extension Points
- Design Decisions
- Performance Considerations
- Rate Limiting → Configuration
- Model Fallback System → Error Handling
- Circuit Breaker → Error Handling
- File Storage → Architecture Details
Overview¶
Continuum Router is designed as a high-performance, production-ready LLM API router using a clean 4-layer architecture that provides clear separation of concerns, testability, and maintainability. The architecture follows Domain-Driven Design principles and dependency inversion to create a robust, extensible system.
Architecture Goals¶
- Separation of Concerns: Each layer has a single, well-defined responsibility
- Dependency Inversion: Higher layers depend on abstractions, not concrete implementations
- Testability: Each component can be unit tested in isolation
- Extensibility: New features can be added without modifying existing code
- Performance: Minimal overhead while maintaining clean architecture
- Reliability: Fail-fast design with comprehensive error handling
4-Layer Architecture¶
Layer Descriptions¶
1. HTTP Layer (src/http/)¶
Responsibility: Handle HTTP requests, responses, and web-specific concerns
Components¶
- Routes (
routes.rs): Define HTTP endpoints and route handling - Middleware (
middleware/): Cross-cutting concerns (auth, logging, metrics, rate limiting) - DTOs (
dto/): Data Transfer Objects for HTTP serialization/deserialization - Streaming (
streaming/): Server-Sent Events (SSE) handling
Key Files¶
src/http/
├── mod.rs # HTTP layer exports
├── routes.rs # Route definitions and handlers
├── dto.rs # Request/Response DTOs
├── handlers/ # Request handlers
│ ├── mod.rs
│ └── responses.rs # Responses API handlers
├── middleware/ # HTTP middleware components
│ ├── mod.rs
│ ├── auth.rs # API key authentication middleware
│ ├── admin_auth.rs # Admin API authentication middleware
│ ├── files_auth.rs # Files API authentication middleware
│ ├── admin_audit.rs # Admin operations audit logging
│ ├── cors.rs # CORS (Cross-Origin Resource Sharing) middleware
│ ├── logging.rs # Request/response logging
│ ├── metrics.rs # Metrics collection
│ ├── metrics_auth.rs # Metrics endpoint authentication
│ ├── model_extractor.rs # Model extraction from requests
│ ├── prometheus.rs # Prometheus metrics integration
│ ├── rate_limit.rs # Rate limiting middleware (legacy)
│ └── rate_limit_v2/ # Enhanced rate limiting (modular)
│ ├── mod.rs # Module exports
│ ├── middleware.rs # Rate limiting middleware
│ ├── store.rs # Rate limit storage and tracking
│ └── token_bucket.rs # Token bucket algorithm
└── streaming/ # SSE streaming handlers
├── mod.rs
└── handler.rs # Streaming response handling
Middleware Components¶
The HTTP layer includes several middleware components that provide cross-cutting concerns:
-
auth.rs: API key authentication for main endpoints (
/v1/chat/completions,/v1/models, etc.)- Validates API keys from
Authorization: Bearer <key>header - Supports multiple API keys configured in
config.yaml - Returns 401 Unauthorized for invalid/missing keys
- Validates API keys from
-
admin_auth.rs: Separate authentication for admin endpoints (
/admin/*)- Uses dedicated admin API keys distinct from user API keys
- Protects sensitive operations (config reload, circuit breaker control, health management)
- Configurable via
admin.api_keysin configuration
-
files_auth.rs: Authentication middleware for Files API (
/v1/files/*)- Validates API keys specifically for file upload/download/deletion operations
- Prevents unauthorized file access and manipulation
- Integrates with file storage service for permission checks
-
admin_audit.rs: Audit logging middleware for admin operations
- Records all admin API calls with timestamps and caller identification
- Logs parameters and outcomes of sensitive operations
- Provides audit trail for compliance and security monitoring
- Configurable log levels and retention policies
-
cors.rs: CORS (Cross-Origin Resource Sharing) middleware
- Enables embedding the router in web applications, Tauri apps, and Electron apps
- Supports wildcard origins (
*), exact origins, and port wildcards (http://localhost:*) - Custom scheme support for desktop apps (e.g.,
tauri://localhost) - Configurable methods, headers, credentials, and preflight cache duration
- Applied early in the middleware stack for proper preflight handling
-
ratelimitv2/: Enhanced rate limiting system (see Rate Limiting section)
- Token bucket algorithm with per-client tracking
- Separate limits for sustained rate and burst protection
- Automatic cleanup of expired client entries
- Detailed metrics for monitoring
2. Services Layer (src/services/)¶
Responsibility: Orchestrate business logic and coordinate between infrastructure components
Components¶
- Backend Service (
backend_service.rs): Manage backend pool, load balancing, health checks - Model Service (
model_service.rs): Aggregate models from backends, handle caching, enrich with metadata - Proxy Service (
proxy_service.rs): Route requests, handle retries, manage streaming - Health Service (
health_service.rs): Monitor service health, track status - Service Registry (
mod.rs): Manage service lifecycle and dependencies
Key Files¶
src/services/
├── mod.rs # Service registry and management
├── backend_service.rs # Backend management service
├── model_service.rs # Model aggregation service
├── proxy_service.rs # Request proxying and routing
├── health_service.rs # Health monitoring service
├── deduplication.rs # Request deduplication service
├── responses/ # Responses API support
│ ├── mod.rs
│ ├── converter.rs # Response format conversion
│ ├── router.rs # Routing strategy determination (pass-through vs conversion)
│ ├── passthrough.rs # Direct pass-through for native OpenAI/Azure backends
│ ├── session.rs # Session management
│ ├── stream_service.rs # Streaming service orchestration
│ └── streaming.rs # Streaming response handling
└── streaming/ # Streaming utilities
├── mod.rs
├── parser.rs # Stream parsing logic
└── transformer.rs # Stream transformation (OpenAI/Anthropic)
3. Infrastructure Layer (src/infrastructure/)¶
Responsibility: Provide concrete implementations of external systems and technical capabilities
Components¶
- Backends (
backends/): Specific backend implementations (OpenAI, vLLM, Ollama) - Cache (
cache/): Caching implementations (LRU, TTL-based) - Configuration (
config/): Configuration loading, watching, validation - HTTP Client (
http_client.rs): HTTP client management and optimization
Key Files¶
src/infrastructure/
├── mod.rs # Infrastructure exports and utilities
├── backends/ # Backend implementations
│ ├── mod.rs
│ ├── anthropic/ # Native Anthropic Claude backend
│ │ ├── mod.rs # Backend implementation & request transformation
│ │ └── stream.rs # SSE stream transformer (Anthropic → OpenAI)
│ ├── gemini/ # Native Google Gemini backend
│ │ ├── mod.rs # Backend implementation with TTFB optimization
│ │ └── stream.rs # SSE stream transformer (Gemini → OpenAI)
│ ├── openai/ # OpenAI-compatible backend
│ │ ├── mod.rs
│ │ ├── backend.rs # OpenAI backend implementation
│ │ └── models/ # OpenAI-specific model definitions
│ ├── factory/ # Backend factory pattern
│ │ ├── mod.rs
│ │ └── backend_factory.rs # Creates backends from config
│ ├── pool/ # Backend pooling and management
│ │ ├── mod.rs
│ │ ├── backend_pool.rs # Connection pool management
│ │ └── backend_manager.rs # Backend lifecycle management
│ ├── generic/ # Generic backend implementations
│ │ └── mod.rs
│ └── vllm.rs # vLLM backend implementation
├── common/ # Shared infrastructure utilities
│ ├── mod.rs
│ ├── executor.rs # Request execution with retry/metrics
│ ├── headers.rs # HTTP header utilities
│ ├── http_client.rs # HTTP client factory with pooling
│ ├── statistics.rs # Backend statistics collection
│ └── url_validator.rs # URL validation and security
├── transport/ # Transport abstraction layer
│ ├── mod.rs # Transport enum (HTTP, UnixSocket)
│ └── unix_socket.rs # Unix Domain Socket client
├── cache/ # Caching implementations
│ ├── mod.rs
│ ├── lru_cache.rs # LRU cache implementation
│ └── retry_cache.rs # Retry-aware cache
├── config/ # Configuration management
│ ├── mod.rs
│ ├── loader.rs # Configuration loading
│ ├── validator.rs # Configuration validation
│ ├── timeout_validator.rs # Timeout configuration validation
│ ├── watcher.rs # File watching for hot-reload
│ ├── migrator.rs # Configuration migration orchestrator
│ ├── migration.rs # Migration types and traits
│ ├── migrations.rs # Specific migration implementations
│ ├── fixer.rs # Auto-correction logic
│ ├── backup.rs # Backup management
│ └── secrets.rs # Secret/API key management
└── lock_optimization.rs # Lock and concurrency optimization
4. Core Layer (src/core/)¶
Responsibility: Define domain models, business rules, and fundamental abstractions
Components¶
- Models (
models/): Core domain entities (Backend, Model, Request, Response) - Traits (
traits.rs): Core interfaces and contracts - Errors (
errors.rs): Domain-specific error types and handling - Retry (
retry/): Retry policies and strategies - Container (
container.rs): Dependency injection container
Key Files¶
src/core/
├── mod.rs # Core exports and utilities
├── models/ # Domain models
│ ├── mod.rs
│ ├── backend.rs # Backend domain model
│ ├── model.rs # LLM model representation
│ ├── request.rs # Request models
│ └── responses.rs # Response models (Responses API)
├── traits.rs # Core traits and interfaces
├── errors.rs # Error types and handling
├── container.rs # Dependency injection container
├── async_utils.rs # Async utility functions
├── duration_utils.rs # Duration parsing utilities
├── streaming/ # Streaming models
│ ├── mod.rs
│ └── models.rs # Streaming-specific models
├── retry/ # Retry mechanisms
│ ├── mod.rs
│ ├── policy.rs # Retry policies
│ └── strategy.rs # Retry strategies
├── circuit_breaker/ # Circuit breaker pattern
│ ├── mod.rs # Module exports
│ ├── config.rs # Configuration models
│ ├── state.rs # State machine and breaker logic
│ ├── error.rs # Circuit breaker errors
│ ├── metrics.rs # Prometheus metrics
│ └── tests.rs # Unit tests
├── files/ # File processing utilities
│ ├── mod.rs # Module exports
│ ├── resolver.rs # File reference resolution in chat requests
│ ├── transformer.rs # Message transformation with file content
│ └── transformer_utils.rs # Transformation utility functions
├── tool_calling/ # Tool calling transformation
│ ├── mod.rs # Module exports
│ ├── definitions.rs # Tool/function type definitions
│ └── transform.rs # Backend-specific transformations
└── config/ # Configuration models
├── mod.rs
├── models/ # Configuration data models (modular structure)
│ ├── mod.rs # Re-exports for backward compatibility
│ ├── config.rs # Main Config struct, ServerConfig, BackendConfig
│ ├── backend_type.rs # BackendType enum definitions
│ ├── model_metadata.rs # ModelMetadata, PricingInfo, CapabilityInfo
│ ├── global_prompts.rs # GlobalPrompts configuration
│ ├── samples.rs # Sample generation configurations
│ ├── validation.rs # Configuration validation logic
│ └── error.rs # Configuration-specific errors
├── timeout_models.rs # Timeout configuration models
├── cached_timeout.rs # Cached timeout resolution
├── optimized_retry.rs # Optimized retry configuration
├── metrics.rs # Metrics configuration
└── rate_limit.rs # Rate limit configuration
Core Components¶
Backend Pool¶
Location: src/backend.rs (legacy) → src/services/backend_service.rs
Purpose: Manages multiple LLM backends with intelligent load balancing
pub struct BackendPool {
backends: Arc<RwLock<Vec<Backend>>>,
load_balancer: LoadBalancingStrategy,
health_checker: Option<Arc<HealthChecker>>,
}
impl BackendPool {
// Round-robin load balancing with health awareness
pub async fn select_backend(&self) -> Option<Backend> { /* ... */ }
// Filter backends by model availability
pub async fn backends_for_model(&self, model: &str) -> Vec<Backend> { /* ... */ }
}
Health Checker¶
Location: src/health.rs → src/services/health_service.rs
Purpose: Monitor backend health with configurable thresholds, automatic recovery, and accelerated warmup detection
pub struct HealthChecker {
backends: Arc<RwLock<Vec<Backend>>>,
config: HealthConfig,
status_map: Arc<RwLock<HashMap<String, HealthStatus>>>,
}
pub struct HealthConfig {
pub interval: Duration,
pub timeout: Duration,
pub unhealthy_threshold: u32, // Failures before marking unhealthy
pub healthy_threshold: u32, // Successes before marking healthy
pub warmup_check_interval: Duration, // Accelerated interval during warmup (default: 1s)
pub max_warmup_duration: Duration, // Max time in warmup mode (default: 300s)
}
pub enum HealthStatus {
Healthy, // Backend responding with HTTP 200
Unhealthy, // Connection failure or error
WarmingUp, // HTTP 503 - backend loading (accelerated checks)
Unknown, // Initial state
}
Accelerated Warmup Health Checks¶
When a backend returns HTTP 503 (Service Unavailable), it enters the WarmingUp state. During this state:
- Health checks run at
warmup_check_interval(default: 1 second) instead of the normal interval - This reduces model availability detection from ~30 seconds to ~1 second
- After
max_warmup_duration, the backend is marked asUnhealthy - Particularly useful for llama.cpp backends that return HTTP 503 during model loading
Transport Layer¶
Location: src/infrastructure/transport/
Purpose: Provide a unified transport abstraction for backend communication over HTTP/HTTPS or Unix Domain Sockets
The transport layer enables secure local LLM communication via Unix sockets, eliminating the need for TCP port exposure.
URL Schemes¶
| Scheme | Transport | Example |
|---|---|---|
http:// | TCP/HTTP | http://localhost:8080/v1 |
https:// | TCP/HTTPS | https://api.openai.com/v1 |
unix:// | Unix Socket | unix:///var/run/llama.sock |
Transport Enum¶
pub enum Transport {
Http { url: String },
UnixSocket { socket_path: PathBuf },
}
impl Transport {
pub fn from_url(url: &str) -> Result<Self, TransportError>;
pub fn is_unix_socket(&self) -> bool;
pub fn is_http(&self) -> bool;
}
Unix Socket Client¶
The UnixSocketClient provides HTTP-over-Unix-socket communication:
pub struct UnixSocketClient {
socket_path: PathBuf,
config: UnixSocketClientConfig,
}
impl UnixSocketClient {
pub async fn get(&self, endpoint: &str, headers: Option<Vec<(String, String)>>)
-> Result<UnixSocketResponse, UnixSocketError>;
pub async fn post(&self, endpoint: &str, headers: Option<Vec<(String, String)>>, body: Bytes)
-> Result<UnixSocketResponse, UnixSocketError>;
pub async fn health_check(&self) -> Result<bool, UnixSocketError>;
}
Security Features¶
- Path Traversal Protection: Validates socket paths to prevent directory traversal attacks
- CRLF Injection Protection: Validates endpoints and headers for HTTP header injection
- Response Size Limits: Configurable max response size (default: 100MB)
Platform Support¶
| Platform | Support |
|---|---|
| Linux | Full support via AF_UNIX |
| macOS | Full support via AF_UNIX |
| Windows | Planned for future releases |
Model Aggregation Service¶
Location: src/models/ (modular structure)
Purpose: Aggregate and cache model information from all backends, enrich with metadata, with cache stampede prevention
Module Structure (refactored from single models.rs file):
src/models/
├── mod.rs # Re-exports for backward compatibility
├── types.rs # Model, AggregatedModel, ModelList, SingleModelResponse types
├── metrics.rs # ModelMetrics tracking (includes stampede metrics)
├── cache.rs # ModelCache with stale-while-revalidate & singleflight
├── config.rs # ModelAggregationConfig
├── fetcher.rs # Model fetching from backends
├── handlers.rs # HTTP handlers for /v1/models and /v1/models/{model} endpoints
├── background_refresh.rs # Background periodic cache refresh service
├── utils.rs # Utility functions (normalize_model_id, etc.)
└── aggregation/ # Core aggregation logic
├── mod.rs # ModelAggregationService implementation
└── tests.rs # Unit tests
Cache Stampede Prevention¶
The model aggregation service implements three strategies to prevent cache stampede (thundering herd problem):
-
Singleflight Pattern: Only one aggregation request runs at a time. Concurrent requests wait for the ongoing aggregation to complete, then share the result.
-
Stale-While-Revalidate: When cache is stale (between soft and hard TTL), return stale data immediately while triggering a background refresh. Clients get fast responses with potentially slightly stale data.
-
Background Periodic Refresh: A background task proactively refreshes the cache before expiration. Requests never block on cache refresh (except cold start).
pub struct ModelAggregationService {
cache: ModelCache, // With singleflight lock and background refresh tracking
config: ModelAggregationConfig,
fetcher: ModelFetcher,
}
impl ModelAggregationService {
// Aggregate models with singleflight protection
pub async fn aggregate_models_with_singleflight(&self, state: &Arc<AppState>)
-> Result<AggregatedModelsResponse, StatusCode> { /* ... */ }
// Find backends with stale-while-revalidate support
pub async fn find_backends_for_model(&self, state: &Arc<AppState>, model_id: &str)
-> Vec<String> { /* ... */ }
// Clear cache (used during hot reload)
pub fn clear_cache(&self) { /* ... */ }
}
Cache TTL Configuration¶
The cache uses a dual-TTL approach:
| TTL Type | Duration | Behavior |
|---|---|---|
| Soft TTL | 80% of hard TTL | Triggers background refresh, returns stale data |
| Hard TTL | Configured value | Requires blocking refresh |
| Empty Response TTL | 5 seconds | Short TTL for empty responses to prevent DoS |
Metrics¶
New metrics for cache stampede monitoring:
stale_while_revalidate: Requests that returned stale data during refreshcoalesced_requests: Requests that waited for ongoing aggregationbackground_refreshes: Background refresh operations initiatedbackground_refresh_successes/failures: Background refresh outcomessingleflight_lock_acquired: Times the aggregation lock was acquired
Proxy Module¶
Location: src/proxy/ (modular structure)
Purpose: Handle request proxying, backend selection, file resolution, and image generation/editing
Module Structure (refactored from single proxy.rs file):
src/proxy/
├── mod.rs # Re-exports for backward compatibility
├── backend.rs # Backend selection and routing logic
├── request.rs # Request execution with retry logic
├── files.rs # File reference resolution in requests
├── image_gen.rs # Image generation handling (DALL-E, Gemini, GPT Image)
├── image_edit.rs # Image editing support (/v1/images/edits)
├── image_utils.rs # Image processing utilities (multipart, validation)
├── handlers.rs # HTTP handlers for proxy endpoints
├── utils.rs # Utility functions (error responses, etc.)
└── tests.rs # Unit tests
Key Responsibilities¶
- Backend Selection: Intelligent routing to available backends
- File Resolution: Resolve file references in chat requests
- Image Generation: Support for OpenAI (DALL-E, GPT Image) and Gemini (Nano Banana) image models
- Image Editing: Image editing and variations endpoints
- Request Retry: Automatic retry with exponential backoff
- Error Handling: Standardized error responses in OpenAI format
Retry Handler¶
Location: src/services/deduplication.rs
Purpose: Implement exponential backoff with jitter and request deduplication
pub struct EnhancedRetryHandler {
config: RetryConfig,
dedup_cache: Arc<Mutex<HashMap<String, CachedResponse>>>,
dedup_ttl: Duration,
}
pub struct RetryConfig {
pub max_attempts: u32,
pub base_delay: Duration,
pub max_delay: Duration,
pub exponential_backoff: bool,
pub jitter: bool,
}
Circuit Breaker¶
Location: src/core/circuit_breaker/
Purpose: Prevent cascading failures by automatically stopping requests to failing backends
pub struct CircuitBreaker {
states: Arc<DashMap<String, BackendCircuitState>>,
config: CircuitBreakerConfig,
metrics: Option<CircuitBreakerMetrics>,
}
pub struct CircuitBreakerConfig {
pub enabled: bool,
pub failure_threshold: u32, // Failures before opening (default: 5)
pub failure_rate_threshold: f64, // Failure rate threshold (default: 0.5)
pub minimum_requests: u32, // Min requests before rate calculation
pub timeout_seconds: u64, // How long circuit stays open (default: 60s)
pub half_open_max_requests: u32, // Max requests in half-open state
pub half_open_success_threshold: u32, // Successes needed to close
}
pub enum CircuitState {
Closed, // Normal operation - requests pass through
Open, // Failing fast - requests rejected immediately
HalfOpen, // Testing recovery - limited requests allowed
}
Key Features¶
- Per-backend circuit breakers with independent state
- Atomic operations for lock-free state checking in hot path
- Automatic state transitions based on success/failure patterns
- Sliding window for failure rate calculation
- Prometheus metrics for observability
- Admin endpoints for manual control
Container (Dependency Injection)¶
Location: src/core/container.rs
Purpose: Manage service lifecycles and dependencies
pub struct Container {
services: Arc<RwLock<HashMap<TypeId, Box<dyn Any + Send + Sync>>>>,
singletons: Arc<RwLock<HashMap<TypeId, Arc<dyn Any + Send + Sync>>>>,
}
impl Container {
// Register singleton service
pub async fn register_singleton<T>(&self, instance: Arc<T>) -> CoreResult<()>
where T: 'static + Send + Sync { /* ... */ }
// Resolve service dependency
pub async fn resolve<T>(&self) -> CoreResult<Arc<T>>
where T: 'static + Send + Sync { /* ... */ }
}
Data Flow¶
Request Processing Flow¶
sequenceDiagram
participant Client
participant HTTPLayer as HTTP Layer
participant ProxyService as Proxy Service
participant BackendService as Backend Service
participant ModelService as Model Service
participant Backend as LLM Backend
Client->>HTTPLayer: POST /v1/chat/completions
HTTPLayer->>HTTPLayer: Apply Middleware (auth, logging, metrics)
HTTPLayer->>ProxyService: Forward Request
ProxyService->>ModelService: Get Model Info
ModelService->>ModelService: Check Cache
alt Cache Miss
ModelService->>BackendService: Get Backends for Model
BackendService->>Backend: Query Models
Backend-->>BackendService: Model List
BackendService-->>ModelService: Filtered Backends
ModelService->>ModelService: Update Cache
end
ModelService-->>ProxyService: Model Available on Backends
ProxyService->>BackendService: Select Healthy Backend
BackendService->>BackendService: Apply Load Balancing
BackendService-->>ProxyService: Selected Backend
ProxyService->>Backend: Forward Request
Backend-->>ProxyService: Response (streaming or non-streaming)
ProxyService->>ProxyService: Apply Response Processing
ProxyService-->>HTTPLayer: Processed Response
HTTPLayer-->>Client: HTTP Response Health Check Flow¶
sequenceDiagram
participant HealthService as Health Service
participant BackendPool as Backend Pool
participant Backend as LLM Backend
participant Cache as Health Cache
loop Every Interval
HealthService->>BackendPool: Get All Backends
BackendPool-->>HealthService: Backend List
par For Each Backend
HealthService->>Backend: GET /v1/models (or /health)
alt Success
Backend-->>HealthService: 200 OK + Model List
HealthService->>Cache: Update: consecutive_successes++
HealthService->>HealthService: Mark Healthy if threshold met
else Failure
Backend-->>HealthService: Error/Timeout
HealthService->>Cache: Update: consecutive_failures++
HealthService->>HealthService: Mark Unhealthy if threshold met
end
end
HealthService->>BackendPool: Update Backend Health Status
end Hot Reload Service¶
Location: src/infrastructure/config/hot_reload.rs, src/infrastructure/config/hot_reload_service.rs
Purpose: Provide runtime configuration updates without server restart
The hot reload system enables zero-downtime configuration changes through automatic file watching and intelligent component updates.
Key Architecture Components¶
- ConfigManager: File system watching using
notifycrate, publishes updates viatokio::sync::watchchannel - HotReloadService: Computes configuration differences, classifies changes (immediate/gradual/restart)
- Component Updates: Interior mutability patterns (RwLock) for atomic updates to HealthChecker, CircuitBreaker, RateLimitStore, BackendPool
Backend Pool Hot Reload¶
The BackendPool uses interior mutability (Arc<RwLock<Vec<Arc<Backend>>>>) to support runtime backend additions and removals without service interruption.
Backend States:
- Active: Normal operation, accepts new requests
- Draining: Removed from config, existing requests continue, no new requests
- Removed: Fully cleaned up after all references released
Graceful Draining Process:
- When a backend is removed from config, it's marked as
Draining - New requests skip draining backends
- Existing in-flight requests/streams continue uninterrupted (Arc reference counting)
- Background cleanup task runs every 10 seconds
- Backends are removed when: no references remain OR 5-minute timeout exceeded
This ensures zero impact on ongoing connections during configuration changes.
Change Classification¶
- Immediate Update: logging.level, rate_limiting., circuit_breaker., retry., global_prompts.
- Gradual Update: backends., health_checks., timeouts.*
- Requires Restart: server.bind_address, server.workers
Admin API: /admin/config/hot-reload-status for inspecting hot reload capabilities
For detailed hot reload configuration, process flow, and usage examples, see configuration.md section on hot reload.
Configuration Migration System¶
Location: src/infrastructure/config/{migrator,migration,migrations,fixer,backup}.rs
Purpose: Automatically detect and fix configuration issues, migrate schemas, and ensure configuration validity
The configuration migration system provides a comprehensive solution for handling configuration evolution and maintenance. It automatically: - Detects and migrates outdated schema versions - Fixes common syntax errors in YAML/TOML files - Validates and corrects configuration values - Creates backups before making changes - Provides dry-run capability for previewing changes
Architecture Components¶
1. Migration Orchestrator (migrator.rs) - Main entry point for migration operations - Coordinates the entire migration workflow - Manages backup creation and restoration - Implements security validations (path traversal, file size limits)
2. Migration Framework (migration.rs) - Defines core types and traits for migrations - Migration trait for implementing version upgrades - ConfigIssue enum for categorizing problems - MigrationResult for tracking changes
3. Schema Migrations (migrations.rs) - Concrete migration implementations (e.g., V1ToV2Migration) - Transforms configuration structure between versions - Example: Converting backend_url to backends array
4. Auto-Correction Engine (fixer.rs) - Detects and fixes common configuration errors - Duration format correction (e.g., "10 seconds" → "10s") - URL validation and protocol addition - Field deprecation handling
5. Backup Manager (backup.rs) - Creates timestamped backups before modifications - Implements resource limits (10MB per file, 100MB total, max 50 backups) - Automatic cleanup of old backups - Preserves file permissions
Migration Workflow¶
graph TD
A[Read Config File] --> B[Validate Path & Size]
B --> C[Create Backup]
C --> D[Parse Configuration]
D --> E{Parse Success?}
E -->|No| F[Fix Syntax Errors]
F --> D
E -->|Yes| G[Detect Schema Version]
G --> H{Needs Migration?}
H -->|Yes| I[Apply Migrations]
H -->|No| J[Validate Values]
I --> J
J --> K{Issues Found?}
K -->|Yes| L[Apply Auto-Fixes]
K -->|No| M[Return Config]
L --> N[Write Updated Config]
N --> M Security Features¶
- Path Traversal Protection: Validates paths to prevent directory traversal attacks
- File Size Limits: Maximum 10MB configuration files to prevent DoS
- Format Validation: Only processes .yaml, .yml, and .toml files
- System Directory Protection: Blocks access to sensitive system paths
- Test Mode Relaxation: Uses conditional compilation for test-friendly validation
Example Migration: v1.0 to v2.0¶
// V1ToV2Migration implementation
fn migrate(&self, config: &mut Value) -> Result<(), MigrationError> {
// Convert single backend_url to backends array
if let Some(backend_url) = config.get("backend_url") {
let mut backends = Vec::new();
let mut backend = Map::new();
backend.insert("url".to_string(), backend_url.clone());
// Move models to backend
if let Some(model) = config.get("model") {
backend.insert("models".to_string(),
Value::Sequence(vec![model.clone()]));
}
backends.push(Value::Mapping(backend));
config["backends"] = Value::Sequence(backends);
// Remove old fields
config.remove("backend_url");
config.remove("model");
}
Ok(())
}
Configuration Loading Flow¶
graph TD
A[Application Start] --> B[Config Manager Init]
B --> C{Config File Specified?}
C -->|Yes| D[Load Specified File]
C -->|No| E[Search Standard Locations]
E --> F{Config File Found?}
F -->|Yes| G[Load Config File]
F -->|No| H[Use CLI Args + Env Vars + Defaults]
D --> I[Parse YAML]
G --> I
H --> J[Create Config from Args]
I --> K[Apply Environment Variable Overrides]
J --> K
K --> L[Apply CLI Argument Overrides]
L --> M[Validate Configuration]
M --> N{Valid?}
N -->|Yes| O[Return Config]
N -->|No| P[Exit with Error]
O --> Q[Start File Watcher for Hot Reload]
Q --> R[Application Running]
Q --> S[Config File Changed]
S --> T[Reload and Validate]
T --> U{Valid?}
U -->|Yes| V[Apply New Config]
U -->|No| W[Log Error, Keep Old Config]
V --> R
W --> R Dependency Injection¶
Service Registration¶
Services are registered in the container during application startup:
// In main.rs
async fn setup_services(config: Config) -> Result<ServiceRegistry, Error> {
let container = Arc::new(Container::new());
// Register infrastructure services
container.register_singleton(Arc::new(
HttpClient::new(&config.http_client)?
)).await?;
container.register_singleton(Arc::new(
BackendManager::new(&config.backends)?
)).await?;
// Register core services
container.register_singleton(Arc::new(
BackendServiceImpl::new(container.clone())
)).await?;
container.register_singleton(Arc::new(
ModelServiceImpl::new(container.clone())
)).await?;
// Create service registry
let registry = ServiceRegistry::new(container);
registry.initialize().await?;
Ok(registry)
}
Service Dependencies¶
Services declare their dependencies through constructor injection:
pub struct ProxyServiceImpl {
backend_service: Arc<dyn BackendService>,
model_service: Arc<dyn ModelService>,
retry_handler: Arc<dyn RetryHandler>,
http_client: Arc<HttpClient>,
}
impl ProxyServiceImpl {
pub fn new(container: Arc<Container>) -> CoreResult<Self> {
Ok(Self {
backend_service: container.resolve()?,
model_service: container.resolve()?,
retry_handler: container.resolve()?,
http_client: container.resolve()?,
})
}
}
Benefits¶
- Testability: Services can be mocked for unit testing
- Flexibility: Implementations can be swapped without code changes
- Lifecycle Management: Container manages service initialization and cleanup
- Circular Dependency Detection: Container prevents circular dependencies
Error Handling Strategy¶
The router implements a comprehensive error handling strategy with typed errors, intelligent recovery, and user-friendly responses.
Error Type Hierarchy¶
- CoreError: Domain-level errors (validation, service failures, timeouts, configuration)
- RouterError: Application-level errors combining Core, HTTP, Backend, and Model errors
- HttpError: HTTP-specific errors (400 BadRequest, 401 Unauthorized, 404 NotFound, 500 InternalServerError, etc.)
Error Handling Principles¶
- Fail Fast: Validate inputs early with clear error messages
- Error Context: Include relevant context (field names, operation details)
- Retryable Classification: Distinguish between retryable (timeout, 503) and non-retryable (400, 401) errors
- User-Friendly Responses: Convert internal errors to OpenAI-compatible error format
- Structured Logging: Log errors with appropriate severity and context
Error Recovery Mechanisms¶
- Circuit Breaker: Prevent cascading failures (see Circuit Breaker)
- Retry with Exponential Backoff: Automatically retry transient failures
- Model Fallback: Route to alternative models when primary unavailable (see Model Fallback System)
- Graceful Degradation: Continue with reduced functionality when components fail
For detailed error handling, recovery strategies, monitoring, and troubleshooting, see error-handling.md.
Extension Points¶
Backend Type Architecture¶
The router supports multiple backend types with different API formats. Each backend type handles request/response transformation automatically.
Supported Backend Types¶
| Backend Type | API Format | Authentication | Use Case |
|---|---|---|---|
openai | OpenAI Chat Completions | Authorization: Bearer | OpenAI, Azure OpenAI, vLLM, LocalAI |
anthropic | Anthropic Messages API | x-api-key header | Claude models via native API |
gemini | OpenAI-compatible | Authorization: Bearer | Google Gemini via OpenAI compatibility layer |
Anthropic Backend Architecture¶
The Anthropic backend provides native support for Claude models with automatic format translation:
Key Transformations¶
Request Format Differences:
| Aspect | OpenAI Format | Anthropic Format |
|---|---|---|
| System prompt | messages[0].role="system" | Separate system parameter |
| Auth header | Authorization: Bearer | x-api-key |
| Max tokens | Optional | Required (max_tokens) |
| Images | image_url.url | source.type + source.data |
Extended Thinking Support:
// OpenAI reasoning_effort → Anthropic thinking
{
"reasoning_effort": "high" // OpenAI format
}
// Transforms to:
{
"thinking": {
"type": "enabled",
"budget_tokens": 32768 // Mapped from effort level
}
}
Tool Calling Transformation¶
The router provides automatic transformation of OpenAI-format tool definitions to backend-native formats, enabling cross-provider tool calling support.
Location: src/core/tool_calling/
Module Structure:
src/core/tool_calling/
├── mod.rs # Module exports
├── definitions.rs # Type definitions (ToolDefinition, FunctionDefinition, etc.)
├── transform.rs # Transformation functions for each backend
├── tool_choice.rs # Tool choice transformation for each backend
├── response.rs # Tool call response extraction from backends
├── streaming.rs # Streaming tool call transformation
└── messages.rs # Multi-turn conversation message transformation
Supported Backends¶
| Backend | Input Format | Output Format | Notes |
|---|---|---|---|
| OpenAI | Native | Pass-through | Validates tool names |
| Anthropic | OpenAI | input_schema format | parameters → input_schema |
| Gemini | OpenAI | functionDeclarations | Nested structure |
| llama.cpp | OpenAI | Pass-through | Validates with --jinja flag |
OpenAI Tool Format (Input)¶
{
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": { "type": "string" }
},
"required": ["location"]
}
}
}
]
}
Anthropic Transformation¶
// transform_tools_to_anthropic()
// Input: OpenAI format with `function.parameters`
// Output: Anthropic format with `input_schema`
{
"tools": [
{
"name": "get_weather",
"description": "Get current weather for a location",
"input_schema": {
"type": "object",
"properties": {
"location": { "type": "string" }
},
"required": ["location"]
}
}
]
}
Gemini Transformation¶
// transform_tools_to_gemini()
// Input: OpenAI format
// Output: Gemini nested functionDeclarations format
{
"tools": [
{
"functionDeclarations": [
{
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": { "type": "string" }
},
"required": ["location"]
}
}
]
}
]
}
Validation Rules¶
- Tool Name: Must match pattern
^[a-zA-Z0-9_-]+$(alphanumeric, underscore, hyphen only) - Empty Name: Rejected with validation error
- Missing Fields: Logged as warnings, tool skipped (no silent data loss)
- Type Field: Must be
"function"for standard tool definitions
Integration with Fallback System¶
The tool calling transformation integrates with the model fallback system via ParameterTranslator:
// In src/core/fallback/translation.rs
impl ParameterTranslator {
pub fn translate_tools(
&self,
tools: &Value,
from_backend_type: BackendType,
to_backend_type: BackendType,
) -> CoreResult<Value> {
match (from_backend_type, to_backend_type) {
(_, BackendType::Anthropic) => transform_tools_to_anthropic(tools),
(_, BackendType::Gemini) => transform_tools_to_gemini(tools),
(_, BackendType::LlamaCpp) => transform_tools_for_llamacpp(tools),
_ => Ok(tools.clone()),
}
}
}
This enables seamless fallback from one provider to another while preserving tool definitions.
Tool Choice Transformation¶
In addition to tool definitions, the router also transforms the tool_choice parameter which controls model tool-calling behavior.
Supported Values:
| OpenAI Value | Description | Anthropic | Gemini |
|---|---|---|---|
"auto" | Model decides whether to call tools | {"type": "auto"} | mode: "AUTO" |
"none" | Model should not call tools | Remove tools entirely | mode: "NONE" |
"required" | Model must call at least one tool | {"type": "any"} | mode: "ANY" |
{"type": "function", "function": {"name": "X"}} | Force specific function | {"type": "tool", "name": "X"} | mode: "ANY", allowed_function_names: ["X"] |
Edge Cases:
- Anthropic "none" workaround: Anthropic API does not support
tool_choice=none. When this value is detected, the router removestoolsandtool_choiceentirely from the request. - llama.cpp: Preserves
parallel_tool_callsparameter for parallel function calling support.
Implementation:
// In src/core/tool_calling/tool_choice.rs
pub enum ToolChoiceValue {
Auto,
None,
Required,
Function(String),
}
impl ToolChoiceValue {
pub fn from_openai(value: &Value) -> CoreResult<Self>;
pub fn to_anthropic(&self) -> Option<Value>; // Returns None for "none"
pub fn to_gemini(&self) -> Value;
}
Streaming Tool Call Transformation¶
When streaming is enabled, tool calls are returned incrementally as delta events rather than complete objects. Different providers use distinct streaming formats, requiring real-time transformation to maintain OpenAI API compatibility.
Location: src/core/tool_calling/streaming.rs
Key Components:
| Component | Purpose |
|---|---|
ToolCallAccumulator | Tracks streaming tool call state (index, id, name, arguments) |
StreamingToolCallTransformer | State machine for Anthropic → OpenAI delta transformation |
transform_gemini_streaming_tool_call() | Gemini → OpenAI delta transformation |
Streaming Format Comparison:
| Provider | Format | Characteristics |
|---|---|---|
| OpenAI | Delta chunks with tool_calls[].function.arguments | Arguments arrive in fragments |
| Anthropic | Event-based: content_block_start, content_block_delta, content_block_stop | input_json_delta contains partial JSON |
| Gemini | Complete functionCall objects in each chunk | Arguments arrive complete per chunk |
State Machine (Anthropic):
IDLE → message_start → READY
READY → content_block_start (tool_use) → TOOL_STARTED
TOOL_STARTED → content_block_delta → ACCUMULATING
ACCUMULATING → content_block_delta → ACCUMULATING (loop)
ACCUMULATING → content_block_stop → TOOL_COMPLETE
TOOL_COMPLETE → message_delta → FINISHED
Safety Limits:
To prevent memory exhaustion from malformed streams:
MAX_TOOL_CALLS = 64: Maximum parallel tool calls per messageMAX_ARGUMENTS_SIZE = 1MB: Maximum accumulated arguments per tool call
Example Transformation (Anthropic → OpenAI):
// Anthropic input: content_block_start
{"type": "content_block_start", "index": 0,
"content_block": {"type": "tool_use", "id": "toolu_123", "name": "get_weather"}}
// OpenAI output: delta chunk
{"id": "chatcmpl-...", "choices": [{"delta": {"tool_calls": [
{"index": 0, "id": "toolu_123", "type": "function",
"function": {"name": "get_weather", "arguments": ""}}
]}}]}
Finish Reason Mapping:
| Source (Anthropic/Gemini) | Target (OpenAI) |
|---|---|
tool_use | tool_calls |
end_turn / STOP | stop |
max_tokens / MAX_TOKENS | length |
SAFETY / RECITATION | content_filter |
Multi-Turn Conversation Message Transformation¶
When using tools in multi-turn conversations, the message history contains tool calls from the assistant and tool results from the client. These messages require format transformation when routing to different backends.
Location: src/core/tool_calling/messages.rs
Key Functions:
| Function | Purpose |
|---|---|
transform_tool_message_to_anthropic() | Converts OpenAI tool result to Anthropic format |
transform_tool_message_to_gemini() | Converts OpenAI tool result to Gemini format |
transform_assistant_tool_calls_to_anthropic() | Converts assistant message with tool_calls to Anthropic |
transform_assistant_tool_calls_to_gemini() | Converts assistant message with tool_calls to Gemini |
transform_messages_with_tools() | Transforms entire conversation history |
find_function_name_for_tool_call() | Looks up function name by toolcallid |
Message Format Differences:
| Aspect | OpenAI | Anthropic | Gemini |
|---|---|---|---|
| Tool result role | tool | user (with content block) | function |
| ID reference | tool_call_id | tool_use_id | By name matching |
| Content format | String | String or structured | response object |
| Assistant role | assistant | assistant | model |
OpenAI Format (Input):
{
"messages": [
{"role": "user", "content": "What's the weather in NYC?"},
{
"role": "assistant",
"tool_calls": [{
"id": "call_abc123",
"type": "function",
"function": {"name": "get_weather", "arguments": "{\"location\": \"NYC\"}"}
}]
},
{
"role": "tool",
"tool_call_id": "call_abc123",
"content": "{\"temperature\": 72, \"condition\": \"sunny\"}"
}
]
}
Anthropic Transformation:
{
"messages": [
{"role": "user", "content": "What's the weather in NYC?"},
{
"role": "assistant",
"content": [{
"type": "tool_use",
"id": "call_abc123",
"name": "get_weather",
"input": {"location": "NYC"}
}]
},
{
"role": "user",
"content": [{
"type": "tool_result",
"tool_use_id": "call_abc123",
"content": "{\"temperature\": 72, \"condition\": \"sunny\"}"
}]
}
]
}
Gemini Transformation:
{
"contents": [
{"role": "user", "parts": [{"text": "What's the weather in NYC?"}]},
{
"role": "model",
"parts": [{
"functionCall": {"name": "get_weather", "args": {"location": "NYC"}}
}]
},
{
"role": "function",
"parts": [{
"functionResponse": {
"name": "get_weather",
"response": {"temperature": 72, "condition": "sunny"}
}
}]
}
]
}
Multiple Tool Results:
When multiple tools are called in parallel, consecutive tool result messages are combined for Anthropic into a single user message:
// Multiple OpenAI tool results:
{"role": "tool", "tool_call_id": "call_1", "content": "72F, sunny"}
{"role": "tool", "tool_call_id": "call_2", "content": "3:00 PM EST"}
// Combined Anthropic format:
{
"role": "user",
"content": [
{"type": "tool_result", "tool_use_id": "call_1", "content": "72F, sunny"},
{"type": "tool_result", "tool_use_id": "call_2", "content": "3:00 PM EST"}
]
}
Error Handling:
- Missing
tool_call_id: Returns validation error - Malformed tool calls: Logged as warnings, skipped without failing entire request
- Unknown function name for Gemini: Falls back to
"unknown"with warning log - Malformed JSON arguments: Falls back to empty object
{} is_errorindicator: Preserved when transforming to Anthropic format
Gemini 3 thoughtSignature Support¶
Gemini 3 models introduce a new thoughtSignature field for function calling that must be preserved and passed back in multi-turn conversations. The router automatically handles this through the transformation pipeline.
What is thoughtSignature?
When Gemini 3+ models return function calls, they include an encrypted thoughtSignature field that encapsulates the model's internal reasoning context. This signature must be included when sending the function result back to continue the conversation correctly.
Response Extraction (Gemini -> Client):
The router extracts thoughtSignature from Gemini responses and includes it in the OpenAI-compatible format:
// Gemini native response (with thoughtSignature)
{
"candidates": [{
"content": {
"parts": [{
"functionCall": {"name": "get_weather", "args": {"location": "NYC"}},
"thoughtSignature": "encrypted_signature_abc123"
}]
}
}]
}
// Router's OpenAI-compatible response
{
"choices": [{
"message": {
"tool_calls": [{
"id": "call_xyz789",
"type": "function",
"function": {"name": "get_weather", "arguments": "{\"location\":\"NYC\"}"},
"extra_content": {
"google": {
"thought_signature": "encrypted_signature_abc123"
}
}
}]
}
}]
}
Request Injection (Client -> Gemini):
When clients send tool results back, the router extracts thought_signature from extra_content.google and injects it as thoughtSignature in the Gemini request:
// Client's OpenAI-format request
{
"messages": [{
"role": "assistant",
"tool_calls": [{
"id": "call_xyz789",
"function": {"name": "get_weather", "arguments": "{}"},
"extra_content": {
"google": {"thought_signature": "encrypted_signature_abc123"}
}
}]
}]
}
// Transformed Gemini request
{
"contents": [{
"role": "model",
"parts": [{
"functionCall": {"name": "get_weather", "args": {}},
"thoughtSignature": "encrypted_signature_abc123"
}]
}]
}
Streaming Support:
The streaming transformer also handles thoughtSignature extraction, including it in tool call delta chunks.
Backward Compatibility:
- Gemini 2.x models do not return
thoughtSignature- the router gracefully handles this - Clients that don't preserve
extra_content.google.thought_signaturecan still make function calls (but may have degraded conversation continuity on Gemini 3+) - The
extra_contentfield is ignored by non-Gemini backends
Implementation Locations:
| Component | File | Function |
|---|---|---|
| Response extraction | src/infrastructure/backends/gemini/transform.rs | transform_response_gemini(), transform_native_gemini_response() |
| Request injection | src/core/tool_calling/messages.rs | transform_assistant_tool_calls_to_gemini() |
| Streaming extraction | src/infrastructure/backends/gemini/stream.rs | transform_native_gemini_event(), create_tool_call_chunk_with_signature() |
Adding New Backend Types¶
-
Implement Backend Trait:
// In src/infrastructure/backends/custom_backend.rs pub struct CustomBackend { client: Arc<HttpClient>, config: CustomBackendConfig, } #[async_trait] impl BackendTrait for CustomBackend { async fn health_check(&self) -> CoreResult<()> { /* ... */ } async fn list_models(&self) -> CoreResult<Vec<Model>> { /* ... */ } async fn chat_completion(&self, request: ChatRequest) -> CoreResult<Response> { /* ... */ } } -
Register in Backend Factory:
// In src/infrastructure/backends/mod.rs pub fn create_backend(backend_type: &str, config: &BackendConfig) -> CoreResult<Box<dyn BackendTrait>> { match backend_type { "openai" => Ok(Box::new(OpenAIBackend::new(config)?)), "vllm" => Ok(Box::new(VLLMBackend::new(config)?)), "custom" => Ok(Box::new(CustomBackend::new(config)?)), // New backend _ => Err(CoreError::ValidationFailed { message: format!("Unknown backend type: {}", backend_type), field: Some("backend_type".to_string()), }), } }
Adding New Middleware¶
-
Implement Middleware Trait:
// In src/http/middleware/custom_middleware.rs pub struct CustomMiddleware { config: CustomConfig, } impl<S> tower::Layer<S> for CustomMiddleware { type Service = CustomMiddlewareService<S>; fn layer(&self, inner: S) -> Self::Service { CustomMiddlewareService { inner, config: self.config.clone() } } } -
Register in HTTP Router:
Adding New Cache Types¶
-
Implement Cache Trait:
// In src/infrastructure/cache/redis_cache.rs pub struct RedisCache { client: redis::Client, ttl: Duration, } #[async_trait] impl CacheTrait for RedisCache { async fn get<T>(&self, key: &str) -> CoreResult<Option<T>> where T: DeserializeOwned { /* ... */ } async fn set<T>(&self, key: &str, value: &T, ttl: Option<Duration>) -> CoreResult<()> where T: Serialize { /* ... */ } } -
Use in Service:
Adding New Load Balancing Strategies¶
// In src/services/load_balancer.rs
pub enum LoadBalancingStrategy {
RoundRobin,
WeightedRoundRobin,
LeastConnections, // New strategy
Random,
}
impl LoadBalancingStrategy {
pub fn select_backend(&self, backends: &[Backend]) -> Option<&Backend> {
match self {
Self::RoundRobin => /* ... */,
Self::WeightedRoundRobin => /* ... */,
Self::LeastConnections => self.select_least_connections(backends),
Self::Random => /* ... */,
}
}
}
Design Decisions¶
Why 4-Layer Architecture?¶
Decision: Use a 4-layer architecture (HTTP → Services → Infrastructure → Core)
Rationale¶
- Clear Separation: Each layer has distinct responsibilities
- Testability: Layers can be tested independently
- Maintainability: Changes in one layer don't affect others
- Flexibility: Easy to swap implementations (e.g., different cache backends)
Trade-offs¶
- ✅ Pros: Clean, maintainable, testable, extensible
- ❌ Cons: More complexity, slight performance overhead
- Verdict: Benefits outweigh costs for a production system
Why Dependency Injection?¶
Decision: Use a custom DI container instead of compile-time injection
Rationale¶
- Runtime Flexibility: Can swap implementations based on configuration
- Service Lifecycle: Centralized management of service initialization/cleanup
- Testing: Easy to inject mocks and test doubles
Alternatives Considered¶
- Manual dependency passing: Too verbose and error-prone
- Compile-time DI (generics): Less flexible, harder to configure
Why Arc> for Shared State?¶
Decision: Use Arc<RwLock<T>> for shared mutable state
Rationale¶
- Reader-Writer Semantics: Multiple readers, exclusive writers
- Performance: Better than
Arc<Mutex<T>>for read-heavy workloads - Safety: Prevents data races at compile time
Alternatives Considered¶
Arc<Mutex<T>>: Simpler but worse performance for reads- Channels: Too complex for simple shared state
- Atomic types: Not suitable for complex data structures
Why async/await Throughout?¶
Decision: Use async/await for all I/O operations
Rationale¶
- Performance: Non-blocking I/O allows high concurrency
- Resource Efficiency: Lower memory usage than thread-per-request
- Ecosystem: Rust async ecosystem (Tokio, reqwest, axum) is mature
Trade-offs¶
- ✅ Pros: High performance, low resource usage, good ecosystem
- ❌ Cons: Complexity, learning curve, debugging challenges
- Verdict: Essential for high-performance network services
Why Configuration Hot-Reload?¶
Decision: Support configuration hot-reload using file watching
Rationale¶
- Zero Downtime: Update configuration without restarting
- Operations Friendly: Easy to adjust settings in production
- Development: Faster iteration during development
Implementation¶
- File system watcher detects changes
- Validate new configuration before applying
- Atomic updates to avoid inconsistent state
- Fallback to previous config on validation errors
Performance Considerations¶
Memory Management¶
- Connection Pooling: Reuse HTTP connections to reduce allocation overhead
- Smart Caching: LRU eviction prevents unbounded memory growth
- Arc Cloning: Cheap reference counting instead of deep cloning
- Streaming: Process responses in chunks to avoid loading large responses into memory
Concurrency¶
- RwLock for Read-Heavy Workloads: Multiple concurrent readers for backend pool and model cache
- Lock-Free Where Possible: Use atomics for counters and simple state
- Async Task Spawning: Background tasks for health checks and cache updates
- Bounded Channels: Prevent unbounded queuing of tasks
I/O Optimization¶
- Connection Keep-Alive: TCP connections stay open for reuse
- Streaming Responses: Forward SSE chunks without buffering
- Timeouts: Prevent hanging on slow backends
- Retry with Backoff: Avoid overwhelming failing backends
Memory Layout¶
// Optimized data structures for cache efficiency
pub struct Backend {
pub name: String, // Inline string for small names
pub url: Arc<str>, // Shared string for URL
pub weight: u32, // Compact integer
pub is_healthy: AtomicBool, // Lock-free health status
}
// Cache-friendly model storage
pub struct ModelCache {
models: HashMap<String, Arc<ModelInfo>>, // Shared model info
last_updated: AtomicU64, // Lock-free timestamp
ttl: Duration,
}
Benchmarking Results¶
Based on our benchmarks (see benches/performance_benchmarks.rs):
- Request Latency: < 5ms overhead for routing decisions
- Memory Usage: ~50MB base memory, scales linearly with backends
- Throughput: 1000+ requests/second on modest hardware
- Connection Efficiency: 100+ concurrent connections per backend with minimal memory overhead
Rate Limiting¶
The router implements sophisticated rate limiting to protect against abuse and ensure fair resource allocation across clients.
Key Features: - Dual-window approach: sustained limit (100 req/min) + burst protection (20 req/5s) - Client identification by API key (preferred) or IP address (fallback) - Per-client isolation with automatic cache cleanup - DoS prevention with short TTL for empty responses
Rate Limit V2 Architecture¶
The enhanced rate limiting system (rate_limit_v2/) provides a modular, high-performance implementation:
Module Structure¶
src/http/middleware/rate_limit_v2/
├── mod.rs # Public API and module exports
├── middleware.rs # Axum middleware integration
├── store.rs # Rate limit storage and client tracking
└── token_bucket.rs # Token bucket algorithm implementation
Components¶
- Token Bucket Algorithm (
token_bucket.rs) - Configurable bucket capacity and refill rate
- Atomic operations for lock-free token consumption
- Automatic token replenishment based on elapsed time
-
Separate buckets for sustained and burst limits
-
Rate Limit Store (
store.rs) - Per-client state tracking with
DashMapfor concurrent access - Automatic cleanup of expired client entries
- Configurable TTL for inactive clients (default: 1 hour)
-
Memory-efficient with bounded storage
-
Middleware Integration (
middleware.rs) - Extracts client identifier (API key → IP address fallback)
- Checks both sustained and burst limits before processing
- Returns HTTP 429 (Too Many Requests) with
Retry-Afterheader - Prometheus metrics for monitoring rate limit hits
Configuration Example¶
rate_limiting:
enabled: true
sustained:
max_requests: 100
window_seconds: 60
burst:
max_requests: 20
window_seconds: 5
cleanup_interval_seconds: 300
Decision Flow¶
Request arrives
↓
Extract client ID (API key or IP)
↓
Check sustained limit (100 req/min)
↓ OK
Check burst limit (20 req/5s)
↓ OK
Process request
For detailed configuration information, see configuration.md section on rate limiting.
Model Fallback System¶
The router implements a configurable model fallback system that automatically routes requests to alternative models when the primary model is unavailable.
Key Features: - Automatic fallback chain execution (e.g., gpt-4o → gpt-4-turbo → gpt-3.5-turbo) - Cross-provider fallback support with parameter translation - Integration with circuit breaker for intelligent triggering - Prometheus metrics for monitoring fallback usage
For detailed configuration and implementation, see error-handling.md section on model fallback.
Circuit Breaker¶
The router implements the circuit breaker pattern to prevent cascading failures and provide automatic failover when backends become unhealthy.
Three-State Machine:
| State | Behavior |
|---|---|
| Closed | Normal operation. Failures are counted. |
| Open | Fast-fail mode. Requests rejected immediately. |
| HalfOpen | Recovery testing. Limited requests allowed. |
Key Features: - Per-backend isolation with independent state - Lock-free atomic operations for minimal hot-path overhead - Admin endpoints for manual control (/admin/circuit/*) - Prometheus metrics for observability
For detailed configuration and implementation, see error-handling.md section on circuit breaker.
File Storage¶
The router provides OpenAI Files API compatible file storage with persistent metadata.
Key Features: - Persistent metadata storage with sidecar JSON files - Automatic recovery on server restart - Orphan file detection and cleanup - Pluggable backends (memory/persistent)
For detailed architecture and implementation, see File Storage Guide.
Image Generation Architecture¶
The router provides a unified interface for image generation across multiple backends (OpenAI GPT Image, DALL-E, and Google Gemini/Nano Banana) with automatic parameter translation.
Multi-Backend Image Generation¶
OpenAI → Gemini Parameter Conversion¶
When using Nano Banana (Gemini) models, OpenAI-style parameters are automatically converted to Gemini's native format:
Size to Aspect Ratio Mapping¶
OpenAI size | Gemini aspectRatio | Gemini imageSize | Notes |
|---|---|---|---|
256x256 | 1:1 | 1K | Minimum Gemini size |
512x512 | 1:1 | 1K | Minimum Gemini size |
1024x1024 | 1:1 | 1K | Default |
1536x1024 | 3:2 | 1K | Landscape (new) |
1024x1536 | 2:3 | 1K | Portrait (new) |
1792x1024 | 16:9 | 1K | Wide landscape |
1024x1792 | 9:16 | 1K | Tall portrait |
2048x2048 | 1:1 | 2K | Pro models only |
4096x4096 | 1:1 | 4K | Pro models only |
auto | 1:1 | 1K | Default fallback |
Request Transformation¶
OpenAI Format (Input):
Gemini Format (Converted):
{
"contents": [
{
"parts": [{"text": "A serene Japanese garden"}]
}
],
"generationConfig": {
"imageConfig": {
"aspectRatio": "3:2",
"imageSize": "1K"
}
}
}
Conversion Implementation¶
The conversion is handled by src/infrastructure/backends/gemini/image_generation.rs:
pub fn convert_openai_to_gemini(request: &OpenAIImageRequest)
-> CoreResult<(String, GeminiImageRequest)>
{
// 1. Map model name
let gemini_model = map_model_to_gemini(&request.model);
// 2. Parse size to aspect ratio and size category
let parsed_size = parse_openai_size(&request.size, &request.model)?;
// 3. Build Gemini request with imageConfig
let gemini_request = GeminiImageRequest {
contents: vec![GeminiContent { parts: vec![...] }],
generation_config: Some(GeminiGenerationConfig {
image_config: Some(GeminiImageConfig {
aspect_ratio: Some(parsed_size.aspect_ratio.to_gemini_string()),
image_size: Some(parsed_size.size_category.to_gemini_image_size()),
}),
}),
};
Ok((gemini_model, gemini_request))
}
Streaming Image Generation (SSE)¶
For GPT Image models, the router supports true SSE passthrough for streaming image generation:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Client │────stream:true─▶│ Router │────stream:true─▶│ OpenAI │
│ │ │ │ │ │
│ │◀───SSE events──│ Passthrough│◀───SSE events──│ │
└─────────────┘ └─────────────┘ └─────────────┘
SSE Event Types:
| Event | Description |
|---|---|
image_generation.partial_image | Intermediate preview during generation |
image_generation.complete | Final image data |
image_generation.usage | Token usage for billing |
done | Stream completion |
Implementation (src/proxy/image_gen.rs):
async fn handle_streaming_image_generation(...) -> Result<Response, StatusCode> {
// 1. Keep stream: true in backend request
// 2. Make streaming request via bytes_stream()
// 3. Forward SSE events through tokio channel
let (tx, rx) = tokio::sync::mpsc::unbounded_channel();
tokio::spawn(async move {
let mut stream = backend_response.bytes_stream();
while let Some(chunk) = stream.next().await {
// Parse SSE format (event:/data: lines)
// Forward events to client
for line in chunk_str.lines() {
if let Some(event_type) = line.strip_prefix("event:") { ... }
if let Some(data) = line.strip_prefix("data:") {
let event = Event::default().event(event_type).data(data);
tx.send(Ok(event));
}
}
}
});
Ok(Sse::new(UnboundedReceiverStream::new(rx)).into_response())
}
GPT Image Model Features¶
The router supports enhanced parameters for GPT Image models (gpt-image-1, gpt-image-1.5, gpt-image-1-mini):
| Parameter | Description | Values |
|---|---|---|
output_format | Image file format | png, jpeg, webp |
output_compression | Compression level | 0-100 (jpeg/webp only) |
background | Transparency control | transparent, opaque, auto |
quality | Generation quality | low, medium, high, auto |
stream | Enable SSE streaming | true, false |
partial_images | Preview count | 0-3 |
Model Support Matrix¶
| Feature | GPT Image 1.5 | GPT Image 1 | GPT Image 1 Mini | DALL-E 3 | DALL-E 2 | Nano Banana | Nano Banana Pro |
|---|---|---|---|---|---|---|---|
| Streaming | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
| output_format | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
| background | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
| Custom quality | ✅ | ✅ | ✅ | standard/hd | ❌ | ❌ | ❌ |
| Image Edit | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ |
| Image Variations | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ |
| Max Resolution | 1536px | 1536px | 1536px | 1792px | 1024px | 1024px | 4096px |
Image Edit and Variations¶
The router provides OpenAI-compatible image editing and variations endpoints through /v1/images/edits and /v1/images/variations.
Image Editing (/v1/images/edits)¶
Endpoint: POST /v1/images/edits
Allows editing an existing image with a text prompt and optional mask. Supported by GPT Image models and DALL-E 2.
Request Format (multipart/form-data):
image: <file> # Original image (PNG, required)
prompt: <string> # Edit instructions (required)
mask: <file> # Optional mask image (PNG)
model: <string> # Model name (e.g., "gpt-image-1", "dall-e-2")
n: <integer> # Number of images (default: 1)
size: <string> # Output size (e.g., "1024x1024")
response_format: <string> # "url" or "b64_json"
Implementation (src/proxy/image_edit.rs): - Multipart form parsing for image and mask files - Image validation (format, size, aspect ratio) - Model-specific parameter transformation - Proper error handling for invalid inputs
Supported Features¶
- Transparent PNG mask support for targeted editing
- Multiple image generation (n parameter)
- Flexible output sizes
- Both URL and base64 response formats
Image Variations (/v1/images/variations)¶
Endpoint: POST /v1/images/variations
Creates variations of a given image. Supported by DALL-E 2 only.
Request Format (multipart/form-data):
image: <file> # Source image (PNG, required)
model: <string> # Model name (default: "dall-e-2")
n: <integer> # Number of variations (default: 1, max: 10)
size: <string> # Output size ("256x256", "512x512", "1024x1024")
response_format: <string> # "url" or "b64_json"
Implementation (src/proxy/image_edit.rs): - Image file validation and preprocessing - DALL-E 2-specific routing - Error handling for unsupported models - Consistent response formatting
Key Features¶
- Generate multiple variations in a single request
- Automatic image format validation
- Standard OpenAI response format compatibility
Image Utilities Module¶
The image_utils.rs module provides shared utilities for image processing:
Functions¶
validate_image_format(): Validates PNG/JPEG format and dimensionsparse_multipart_image_request(): Extracts images from multipart formscheck_image_dimensions(): Validates size constraintsformat_image_error_response(): Standardized error responses
Validation Rules¶
- Maximum file size: 4MB (configurable)
- Supported formats: PNG (required for edits/variations), JPEG (generation only)
- Aspect ratio constraints per model
- Transparent PNG requirement for masks
This architecture provides a solid foundation for building a production-ready LLM router that can scale to handle thousands of requests while remaining maintainable and extensible. The clean separation of concerns makes it easy to add new features, swap implementations, and thoroughly test each component.