Load Balancing Guide¶

Continuum Router provides advanced load balancing capabilities to distribute requests across multiple backends efficiently. The router supports five different strategies, each optimized for specific use cases.

Table of Contents¶

Available Strategies
Configuration Examples
Dynamic Strategy Switching
Monitoring Load Distribution
Advanced Configuration
Best Practices
Strategy Selection Guide

Available Strategies¶

1. Round-Robin (Default)¶

selection_strategy: RoundRobin

Description: Distributes requests evenly across all healthy backends in sequential order
Use Case: General-purpose load distribution when all backends have similar capabilities
Behavior: Each backend gets exactly one request before the cycle repeats
Weight Support: No (weights are ignored)
State Management: Minimal (only tracks current backend index)

Pros: - Simple and predictable - Fair distribution - Low overhead - No configuration required

Cons: - Ignores backend performance differences - Doesn't account for request complexity - May overload slower backends

2. Weighted Round-Robin¶

selection_strategy: WeightedRoundRobin

backends:
    - name: high-performance
    url: http://gpu-server:8000
    weight: 3  # Gets 3x more traffic
    - name: standard
    url: http://cpu-server:8000
    weight: 1  # Base traffic level

Description: Distributes requests proportionally based on backend weights
Use Case: When backends have different performance characteristics or capacity
Behavior: Backends with higher weights receive proportionally more requests
Weight Support: Yes (required)
Weight Range: 1-100 (recommended)

Pros: - Respects backend capabilities - Flexible traffic distribution - Easy to adjust load ratios

Cons: - Requires manual weight tuning - Static weights don't adapt to real-time conditions

3. Least-Latency¶

selection_strategy: LeastLatency

Description: Routes requests to the backend with the lowest average response time
Use Case: Optimizing for response speed in production environments
Behavior: Continuously tracks response times and adapts routing accordingly
Note: Falls back to round-robin until sufficient latency data is collected (typically 10 requests per backend)

Pros: - Automatically optimizes for performance - Adapts to real-time conditions - No manual tuning required

Cons: - Needs warm-up period - May concentrate load on fastest backend - Sensitive to temporary latency spikes

4. Random¶

selection_strategy: Random

Description: Randomly selects a healthy backend for each request
Use Case: Simple load distribution without state tracking
Behavior: Each request has equal probability of going to any healthy backend
Advantages: No state management overhead, good for stateless workloads

Pros: - No state management overhead - Simple implementation - Good for stateless workloads - Natural load distribution

Cons: - Less predictable - May have uneven distribution in short term - No performance optimization

5. Consistent-Hash¶

selection_strategy: ConsistentHash

Description: Uses consistent hashing to ensure the same model requests go to the same backend
Use Case: When you need session affinity or want to maximize cache efficiency
Behavior: Hashes the model name to consistently select the same backend
Benefits: Improves model caching, reduces model loading overhead

Pros: - Maximizes cache efficiency - Reduces model loading overhead - Predictable routing - Good for stateful models

Cons: - May cause load imbalance - Less flexible for dynamic scaling - Model-specific routing only

Configuration Examples¶

High-Performance Setup¶

Optimized for lowest latency:

# Automatically routes to fastest backend
selection_strategy: LeastLatency

backends:
    - name: local-gpu
    url: http://localhost:8000
    models: ["llama3", "mistral"]
    health_check:
      interval: 10s  # Frequent checks for accurate latency data
    - name: remote-gpu
    url: http://gpu-cluster:8000
    models: ["llama3", "mistral"]
    health_check:
      interval: 10s

Weighted Distribution¶

Distribute load based on server capacity:

selection_strategy: WeightedRoundRobin

backends:
    - name: powerful-server
    url: http://high-end:8000
    weight: 5  # Handles 5x more traffic
    models: ["gpt-4", "claude-opus"]

    - name: medium-server
    url: http://medium:8000
    weight: 2  # Handles 2x base traffic
    models: ["gpt-3.5-turbo", "claude-sonnet"]

    - name: basic-server
    url: http://basic:8000
    weight: 1  # Base traffic level
    models: ["gpt-3.5-turbo"]

Cache-Optimized Setup¶

Maximize model cache hits:

selection_strategy: ConsistentHash

# Models always go to same backend for cache efficiency
backends:
    - name: backend-1
    url: http://server1:8000
    models: ["gpt-4", "gpt-3.5-turbo"]

    - name: backend-2
    url: http://server2:8000
    models: ["gpt-4", "gpt-3.5-turbo"]

    - name: backend-3
    url: http://server3:8000
    models: ["claude-opus", "claude-sonnet"]

Mixed Strategy with Fallback¶

routing:
  strategy: LeastLatency
  fallback_strategy: RoundRobin  # Used when latency data insufficient

  # Override for specific models
  model_overrides:
    "gpt-4": ConsistentHash  # Always use same backend for GPT-4
    "llama3": WeightedRoundRobin  # Distribute based on weights

backends:
    - name: primary
    url: http://primary:8000
    weight: 3
    priority: 1  # Preferred backend

    - name: secondary
    url: http://secondary:8000
    weight: 1
    priority: 2  # Fallback backend

Dynamic Strategy Switching¶

Via Environment Variable¶

# Change strategy at startup
export CONTINUUM_SELECTION_STRATEGY=LeastLatency
continuum-router --config config.yaml

Via Configuration Hot-Reload¶

# Update config.yaml
sed -i 's/selection_strategy: .*/selection_strategy: WeightedRoundRobin/' config.yaml

# Router automatically reloads configuration

Checking Current Strategy¶

# View current configuration including load balancing strategy
curl http://localhost:8080/admin/config | jq '.selection_strategy'

# View full configuration details
curl http://localhost:8080/admin/config

Monitoring Load Distribution¶

Backend Statistics¶

# Check backend statistics
curl http://localhost:8080/admin/backends

Response:



href=#__codelineno-13-1>{ "backends": [ { "name": "backend-1", "url": "http://server1:8000", "status": "healthy", "total_requests": 1523, "successful_requests": 1520, "failed_requests": 3, "average_latency_ms": 245, "p95_latency_ms": 450, "p99_latency_ms": 890, "weight": 2, "models_served": ["gpt-4", "gpt-3.5-turbo"], "last_selected": "2024-01-15T10:30:45Z" }, { "name": "backend-2", "url": "http://server2:8000", "status": "healthy", "total_requests": 761, "successful_requests": 760, "failed_requests": 1, "average_latency_ms": 312, "p95_latency_ms": 520, "p99_latency_ms": 950, "weight": 1, "models_served": ["gpt-4", "gpt-3.5-turbo"], "last_selected": "2024-01-15T10:30:44Z" } ], "strategy": "WeightedRoundRobin", "total_requests": 2284, "distribution_ratio": { "backend-1": 0.667, "backend-2": 0.333 } class=p>}
 Prometheus Metrics¶
 # Request distribution by backend
sum(rate(routing_decisions_total[5m])) by (selected_backend)

# Backend selection latency
histogram_quantile(0.95, routing_backend_selection_duration_seconds)

# Load balancing effectiveness
stddev(rate(backend_request_total[5m]) by (backend_id))
 Advanced Configuration¶
 Health-Aware Load Balancing¶
 health_checks:
  enabled: true
  interval: 30s
  timeout: 5s
  unhealthy_threshold: 3
  healthy_threshold: 2

  # Adjust weight based on health
  dynamic_weight_adjustment:
    enabled: true
    degraded_weight_factor: 0.5  # Reduce weight by 50% when degraded

selection_strategy: WeightedRoundRobin

backends:
    - name: primary
    url: http://primary:8000
    weight: 100
    health_score_threshold:
      healthy: 0.9    # >90% success rate
      degraded: 0.7    # 70-90% success rate
      unhealthy: 0.0   # <70% success rate
 Request-Aware Routing¶
 routing:
  strategy: Custom

  # Route based on request characteristics
  rules:
        - condition:
        model: "gpt-4"
        max_tokens: { greater_than: 2000 }
      strategy: ConsistentHash  # Long requests to same backend

        - condition:
        model: "gpt-3.5-turbo"
        stream: true
      strategy: LeastLatency  # Streaming to fastest backend

        - condition:
        default: true
      strategy: WeightedRoundRobin
 Geographic Load Balancing¶
 routing:
  strategy: Geographic

backends:
    - name: us-west
    url: http://us-west.example.com:8000
    region: us-west
    weight: 1

    - name: us-east
    url: http://us-east.example.com:8000
    region: us-east
    weight: 1

    - name: eu-central
    url: http://eu-central.example.com:8000
    region: eu-central
    weight: 1

geographic_routing:
  detect_client_region: true
  fallback_to_nearest: true
  latency_based_selection: true
 Best Practices¶
 1. Start Simple¶
  Begin with Round-Robin for initial deployments
 Monitor performance metrics
 Switch strategies based on observed patterns
 
 2. Monitor and Adjust¶
  Use /admin/backends to track backend performance
 Watch for load imbalances
 Adjust weights incrementally (±10% at a time)
 
 3. Consider Your Workload¶
  Uniform requests: Round-Robin or Random
 Variable capacity: WeightedRoundRobin
 Performance critical: LeastLatency
 Cache-heavy: ConsistentHash
 
 4. Health Check Configuration¶
  Enable health checks for automatic failover
 Set appropriate thresholds based on SLAs
 Use shorter intervals for critical backends
 
 5. Testing Strategies¶
 # Test load distribution
for i in {1..100}; do
  curl -s http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"gpt-3.5-turbo","messages":[{"role":"user","content":"Hi"}]}'
done

# Check distribution
curl http://localhost:8080/admin/backends | jq '.distribution_ratio'
 6. Gradual Migration¶
 When changing strategies: 1. Test in staging environment 2. Monitor for 24 hours 3. Gradually roll out to production 4. Keep previous configuration for rollback
 Strategy Selection Guide¶
    Strategy  Best For  Pros  Cons  When to Use  
 
   RoundRobin  Equal backends  Simple, fair distribution  Ignores backend capacity  Default choice, homogeneous backends  
  WeightedRoundRobin  Mixed capacity backends  Respects backend capabilities  Requires weight tuning  Known performance differences  
  LeastLatency  Performance optimization  Adapts to real conditions  Needs warm-up period  Production environments, SLA critical  
  Random  Stateless workloads  No state overhead  Less predictable  Simple deployments, testing  
  ConsistentHash  Cache optimization  Maximizes cache hits  Can cause imbalance  Model-heavy workloads, stateful services  
 
 
 Troubleshooting¶
 Uneven Load Distribution¶
 # Check strategy
curl http://localhost:8080/admin/config | jq '.selection_strategy'

# Verify backend weights
curl http://localhost:8080/admin/backends | jq '.[].weight'

# Check health status
curl http://localhost:8080/admin/backends | jq '.[].status'
 High Latency with LeastLatency¶
  Check if warm-up period has completed
 Verify latency measurements are accurate
 Consider increasing health check frequency
 Check for network issues
 
 ConsistentHash Imbalance¶
  Review model distribution across backends
 Consider adding more backends
 Use weight adjustments to compensate
 Monitor cache hit rates
 
 See Also¶
  Configuration Guide - Full configuration options
 Metrics Guide - Monitor load balancing effectiveness
 Performance Guide - Optimize routing performance

Strategy	Best For	Pros	Cons	When to Use
RoundRobin	Equal backends	Simple, fair distribution	Ignores backend capacity	Default choice, homogeneous backends
WeightedRoundRobin	Mixed capacity backends	Respects backend capabilities	Requires weight tuning	Known performance differences
LeastLatency	Performance optimization	Adapts to real conditions	Needs warm-up period	Production environments, SLA critical
Random	Stateless workloads	No state overhead	Less predictable	Simple deployments, testing
ConsistentHash	Cache optimization	Maximizes cache hits	Can cause imbalance	Model-heavy workloads, stateful services