Circuit Breaker¶
The router implements the circuit breaker pattern to prevent cascading failures and provide automatic failover when backends become unhealthy.
Three-State Machine¶
┌─────────────────────────────────────────────────────────────────┐
│ Circuit Breaker States │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ failure_threshold ┌──────────┐ │
│ │ │ exceeded │ │ │
│ │ CLOSED │ ───────────────────────▶ │ OPEN │ │
│ │ │ │ │ │
│ └──────────┘ └──────────┘ │
│ ▲ │ │
│ │ │ │
│ │ half_open_success │ timeout │
│ │ threshold met │ expires │
│ │ ▼ │
│ │ ┌──────────┐ │
│ │ │ │ │
│ └────────────────────────────── │ HALFOPEN │ │
│ │ │ │
│ failure ──────────────└──────────┘ │
│ in half-open │ │
│ │ │ │
│ └──────────────────────┘ │
│ reopens circuit │
└─────────────────────────────────────────────────────────────────┘
State Descriptions¶
| State | Behavior | Transition Trigger |
|---|---|---|
| Closed | Normal operation. All requests pass through. Failures are counted. | Opens when failure_threshold exceeded or failure_rate_threshold exceeded (with minimum_requests met) |
| Open | Fast-fail mode. All requests rejected immediately with 503 Service Unavailable. | Transitions to HalfOpen after timeout_seconds expires |
| HalfOpen | Recovery testing. Limited requests allowed (half_open_max_requests). | Closes after half_open_success_threshold successes. Reopens on any failure. |
Configuration¶
circuit_breaker:
enabled: true
# Failure detection
failure_threshold: 5 # Consecutive failures before opening
failure_rate_threshold: 0.5 # 50% failure rate threshold
minimum_requests: 10 # Min requests before rate calculation
# Timing
timeout_seconds: 60 # How long circuit stays open
half_open_max_requests: 3 # Max concurrent requests in half-open
half_open_success_threshold: 2 # Successes needed to close
# What counts as failure
failure_status_codes: # HTTP status codes treated as failures
- 500
- 502
- 503
- 504
# Per-backend overrides
backends:
openai-primary:
failure_threshold: 10 # More tolerant for stable backend
timeout_seconds: 30 # Faster recovery attempts
local-llm:
failure_threshold: 3 # Less tolerant for local service
timeout_seconds: 120 # Longer wait before recovery
Per-Backend Isolation¶
Each backend maintains its own independent circuit breaker state:
pub struct CircuitBreaker {
states: Arc<DashMap<String, BackendCircuitState>>,
config: CircuitBreakerConfig,
}
// Each backend has independent state
pub struct BackendCircuitState {
state: AtomicU8, // 0=Closed, 1=Open, 2=HalfOpen
failure_count: AtomicU32,
success_count: AtomicU32,
total_requests: AtomicU64,
last_failure_time: AtomicU64,
last_state_change: AtomicU64,
half_open_requests: AtomicU32, // Current requests in half-open
consecutive_successes: AtomicU32,
}
Admin Endpoints¶
Manual control of circuit breakers is available through admin endpoints:
| Endpoint | Method | Description |
|---|---|---|
/admin/circuit/all | GET | List all circuit breaker states |
/admin/circuit/:backend/status | GET | Get specific backend status |
/admin/circuit/:backend/open | POST | Force circuit open |
/admin/circuit/:backend/close | POST | Force circuit closed |
/admin/circuit/:backend/reset | POST | Reset circuit to initial state |
Example Response (GET /admin/circuit/openai-primary/status):
{
"backend": "openai-primary",
"state": "closed",
"failure_count": 2,
"success_count": 1547,
"total_requests": 1549,
"failure_rate": 0.0013,
"last_failure_time": "2024-01-15T10:30:00Z",
"last_state_change": "2024-01-15T08:00:00Z",
"half_open_requests": 0,
"consecutive_successes": 1547
}
Prometheus Metrics¶
# Current state of each circuit (0=Closed, 1=Open, 2=HalfOpen)
circuit_breaker_state{backend="openai-primary"} 0
# Total state transitions
circuit_breaker_transitions_total{backend="openai-primary", from="closed", to="open"} 3
circuit_breaker_transitions_total{backend="openai-primary", from="open", to="half_open"} 3
circuit_breaker_transitions_total{backend="openai-primary", from="half_open", to="closed"} 3
# Recorded outcomes
circuit_breaker_successes_total{backend="openai-primary"} 15470
circuit_breaker_failures_total{backend="openai-primary"} 12
Integration with Backend Selection¶
impl BackendManager {
pub async fn select_backend(&self, model: Option<&str>) -> CoreResult<Arc<dyn Backend>> {
let backends = self.pool.list_backends().await;
// Filter by model support
let compatible = backends.iter()
.filter(|b| model.is_none_or(|m| b.supports_model(m)));
// Filter by circuit breaker state (skip backends with open circuits)
let available: Vec<_> = compatible
.filter(|b| !self.circuit_breaker.is_open(b.name()))
.collect();
if available.is_empty() {
return Err(CoreError::ServiceUnavailable {
message: "All backends unavailable (circuit breakers open)".into(),
retry_after: Some(Duration::from_secs(60)),
});
}
// Apply load balancing strategy to available backends
self.apply_strategy(&available, model).await
}
}
Fallback Strategies¶
When a circuit opens, the system can apply different fallback strategies:
pub enum FallbackStrategy {
/// Use next available backend with same model support
NextAvailable,
/// Use a specific backup backend
SpecificBackend(String),
/// Return cached response if available
CachedResponse,
/// Use degraded service (e.g., smaller/faster model)
DegradedService(String),
/// Fail immediately with error
FailFast,
}
Performance Considerations¶
The circuit breaker is designed for minimal overhead in the hot path:
- Atomic Operations: All state checks use lock-free atomic operations
- DashMap: Concurrent hashmap for per-backend state without global locks
- Lazy Initialization: Backend states created on first access
- CAS Loop for Half-Open: Compare-and-swap prevents race conditions in request limiting
// Hot path - checking if circuit is open (lock-free)
pub fn is_open(&self, backend: &str) -> bool {
if let Some(state) = self.states.get(backend) {
matches!(state.get_state(), CircuitState::Open)
} else {
false // No state = circuit is closed
}
}
Error Response Format¶
When a request is rejected due to an open circuit:
{
"error": {
"message": "Service temporarily unavailable due to circuit breaker",
"type": "circuit_breaker_open",
"code": 503,
"details": {
"backend": "openai-primary",
"circuit_state": "open",
"retry_after": 60,
"alternative_backends": ["openai-secondary", "azure-openai"]
}
}
}
Best Practices¶
- Tune thresholds per backend: Stable backends can have higher thresholds; flaky services need lower ones
- Monitor state transitions: Alert on frequent open/close cycles (circuit flapping)
- Set appropriate timeouts: Balance between quick recovery and backend overload
- Use admin endpoints sparingly: Manual overrides bypass automatic protection
- Combine with health checks: Circuit breaker complements but doesn't replace health monitoring
Related Documentation¶
- Model Fallback - Fallback chain configuration
- Rate Limiting - Request rate limiting
- Architecture Overview - Main architecture guide