Skip to content

Circuit Breaker

The router implements the circuit breaker pattern to prevent cascading failures and provide automatic failover when backends become unhealthy.

Three-State Machine

┌─────────────────────────────────────────────────────────────────┐
│                    Circuit Breaker States                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│    ┌──────────┐     failure_threshold     ┌──────────┐          │
│    │          │        exceeded           │          │          │
│    │  CLOSED  │ ───────────────────────▶  │   OPEN   │          │
│    │          │                           │          │          │
│    └──────────┘                           └──────────┘          │
│         ▲                                      │                 │
│         │                                      │                 │
│         │  half_open_success                   │  timeout        │
│         │    threshold met                     │  expires        │
│         │                                      ▼                 │
│         │                               ┌──────────┐            │
│         │                               │          │            │
│         └────────────────────────────── │ HALFOPEN │            │
│                                         │          │            │
│                   failure ──────────────└──────────┘            │
│                   in half-open               │                   │
│                       │                      │                   │
│                       └──────────────────────┘                  │
│                              reopens circuit                     │
└─────────────────────────────────────────────────────────────────┘

State Descriptions

State Behavior Transition Trigger
Closed Normal operation. All requests pass through. Failures are counted. Opens when failure_threshold exceeded or failure_rate_threshold exceeded (with minimum_requests met)
Open Fast-fail mode. All requests rejected immediately with 503 Service Unavailable. Transitions to HalfOpen after timeout_seconds expires
HalfOpen Recovery testing. Limited requests allowed (half_open_max_requests). Closes after half_open_success_threshold successes. Reopens on any failure.

Configuration

circuit_breaker:
  enabled: true

  # Failure detection
  failure_threshold: 5          # Consecutive failures before opening
  failure_rate_threshold: 0.5   # 50% failure rate threshold
  minimum_requests: 10          # Min requests before rate calculation

  # Timing
  timeout_seconds: 60           # How long circuit stays open
  half_open_max_requests: 3     # Max concurrent requests in half-open
  half_open_success_threshold: 2 # Successes needed to close

  # What counts as failure
  failure_status_codes:         # HTTP status codes treated as failures
        - 500
        - 502
        - 503
        - 504

  # Per-backend overrides
  backends:
    openai-primary:
      failure_threshold: 10     # More tolerant for stable backend
      timeout_seconds: 30       # Faster recovery attempts
    local-llm:
      failure_threshold: 3      # Less tolerant for local service
      timeout_seconds: 120      # Longer wait before recovery

Per-Backend Isolation

Each backend maintains its own independent circuit breaker state:

pub struct CircuitBreaker {
    states: Arc<DashMap<String, BackendCircuitState>>,
    config: CircuitBreakerConfig,
}

// Each backend has independent state
pub struct BackendCircuitState {
    state: AtomicU8,                  // 0=Closed, 1=Open, 2=HalfOpen
    failure_count: AtomicU32,
    success_count: AtomicU32,
    total_requests: AtomicU64,
    last_failure_time: AtomicU64,
    last_state_change: AtomicU64,
    half_open_requests: AtomicU32,    // Current requests in half-open
    consecutive_successes: AtomicU32,
}

Admin Endpoints

Manual control of circuit breakers is available through admin endpoints:

Endpoint Method Description
/admin/circuit/all GET List all circuit breaker states
/admin/circuit/:backend/status GET Get specific backend status
/admin/circuit/:backend/open POST Force circuit open
/admin/circuit/:backend/close POST Force circuit closed
/admin/circuit/:backend/reset POST Reset circuit to initial state

Example Response (GET /admin/circuit/openai-primary/status):

{
  "backend": "openai-primary",
  "state": "closed",
  "failure_count": 2,
  "success_count": 1547,
  "total_requests": 1549,
  "failure_rate": 0.0013,
  "last_failure_time": "2024-01-15T10:30:00Z",
  "last_state_change": "2024-01-15T08:00:00Z",
  "half_open_requests": 0,
  "consecutive_successes": 1547
}

Prometheus Metrics

# Current state of each circuit (0=Closed, 1=Open, 2=HalfOpen)
circuit_breaker_state{backend="openai-primary"} 0

# Total state transitions
circuit_breaker_transitions_total{backend="openai-primary", from="closed", to="open"} 3
circuit_breaker_transitions_total{backend="openai-primary", from="open", to="half_open"} 3
circuit_breaker_transitions_total{backend="openai-primary", from="half_open", to="closed"} 3

# Recorded outcomes
circuit_breaker_successes_total{backend="openai-primary"} 15470
circuit_breaker_failures_total{backend="openai-primary"} 12

Integration with Backend Selection

impl BackendManager {
    pub async fn select_backend(&self, model: Option<&str>) -> CoreResult<Arc<dyn Backend>> {
        let backends = self.pool.list_backends().await;

        // Filter by model support
        let compatible = backends.iter()
            .filter(|b| model.is_none_or(|m| b.supports_model(m)));

        // Filter by circuit breaker state (skip backends with open circuits)
        let available: Vec<_> = compatible
            .filter(|b| !self.circuit_breaker.is_open(b.name()))
            .collect();

        if available.is_empty() {
            return Err(CoreError::ServiceUnavailable {
                message: "All backends unavailable (circuit breakers open)".into(),
                retry_after: Some(Duration::from_secs(60)),
            });
        }

        // Apply load balancing strategy to available backends
        self.apply_strategy(&available, model).await
    }
}

Fallback Strategies

When a circuit opens, the system can apply different fallback strategies:

pub enum FallbackStrategy {
    /// Use next available backend with same model support
    NextAvailable,
    /// Use a specific backup backend
    SpecificBackend(String),
    /// Return cached response if available
    CachedResponse,
    /// Use degraded service (e.g., smaller/faster model)
    DegradedService(String),
    /// Fail immediately with error
    FailFast,
}

Performance Considerations

The circuit breaker is designed for minimal overhead in the hot path:

  1. Atomic Operations: All state checks use lock-free atomic operations
  2. DashMap: Concurrent hashmap for per-backend state without global locks
  3. Lazy Initialization: Backend states created on first access
  4. CAS Loop for Half-Open: Compare-and-swap prevents race conditions in request limiting
// Hot path - checking if circuit is open (lock-free)
pub fn is_open(&self, backend: &str) -> bool {
    if let Some(state) = self.states.get(backend) {
        matches!(state.get_state(), CircuitState::Open)
    } else {
        false  // No state = circuit is closed
    }
}

Error Response Format

When a request is rejected due to an open circuit:

{
  "error": {
    "message": "Service temporarily unavailable due to circuit breaker",
    "type": "circuit_breaker_open",
    "code": 503,
    "details": {
      "backend": "openai-primary",
      "circuit_state": "open",
      "retry_after": 60,
      "alternative_backends": ["openai-secondary", "azure-openai"]
    }
  }
}

Best Practices

  1. Tune thresholds per backend: Stable backends can have higher thresholds; flaky services need lower ones
  2. Monitor state transitions: Alert on frequent open/close cycles (circuit flapping)
  3. Set appropriate timeouts: Balance between quick recovery and backend overload
  4. Use admin endpoints sparingly: Manual overrides bypass automatic protection
  5. Combine with health checks: Circuit breaker complements but doesn't replace health monitoring