Skip to content

Load Balancing Guide

Continuum Router provides advanced load balancing capabilities to distribute requests across multiple backends efficiently. The router supports five different strategies, each optimized for specific use cases.

Table of Contents

Available Strategies

1. Round-Robin (Default)

selection_strategy: RoundRobin
  • Description: Distributes requests evenly across all healthy backends in sequential order
  • Use Case: General-purpose load distribution when all backends have similar capabilities
  • Behavior: Each backend gets exactly one request before the cycle repeats
  • Weight Support: No (weights are ignored)
  • State Management: Minimal (only tracks current backend index)

Pros: - Simple and predictable - Fair distribution - Low overhead - No configuration required

Cons: - Ignores backend performance differences - Doesn't account for request complexity - May overload slower backends

2. Weighted Round-Robin

selection_strategy: WeightedRoundRobin

backends:
    - name: high-performance
    url: http://gpu-server:8000
    weight: 3  # Gets 3x more traffic
    - name: standard
    url: http://cpu-server:8000
    weight: 1  # Base traffic level
  • Description: Distributes requests proportionally based on backend weights
  • Use Case: When backends have different performance characteristics or capacity
  • Behavior: Backends with higher weights receive proportionally more requests
  • Weight Support: Yes (required)
  • Weight Range: 1-100 (recommended)

Pros: - Respects backend capabilities - Flexible traffic distribution - Easy to adjust load ratios

Cons: - Requires manual weight tuning - Static weights don't adapt to real-time conditions

3. Least-Latency

selection_strategy: LeastLatency
  • Description: Routes requests to the backend with the lowest average response time
  • Use Case: Optimizing for response speed in production environments
  • Behavior: Continuously tracks response times and adapts routing accordingly
  • Note: Falls back to round-robin until sufficient latency data is collected (typically 10 requests per backend)

Pros: - Automatically optimizes for performance - Adapts to real-time conditions - No manual tuning required

Cons: - Needs warm-up period - May concentrate load on fastest backend - Sensitive to temporary latency spikes

4. Random

selection_strategy: Random
  • Description: Randomly selects a healthy backend for each request
  • Use Case: Simple load distribution without state tracking
  • Behavior: Each request has equal probability of going to any healthy backend
  • Advantages: No state management overhead, good for stateless workloads

Pros: - No state management overhead - Simple implementation - Good for stateless workloads - Natural load distribution

Cons: - Less predictable - May have uneven distribution in short term - No performance optimization

5. Consistent-Hash

selection_strategy: ConsistentHash
  • Description: Uses consistent hashing to ensure the same model requests go to the same backend
  • Use Case: When you need session affinity or want to maximize cache efficiency
  • Behavior: Hashes the model name to consistently select the same backend
  • Benefits: Improves model caching, reduces model loading overhead

Pros: - Maximizes cache efficiency - Reduces model loading overhead - Predictable routing - Good for stateful models

Cons: - May cause load imbalance - Less flexible for dynamic scaling - Model-specific routing only

Configuration Examples

High-Performance Setup

Optimized for lowest latency:

# Automatically routes to fastest backend
selection_strategy: LeastLatency

backends:
    - name: local-gpu
    url: http://localhost:8000
    models: ["llama3", "mistral"]
    health_check:
      interval: 10s  # Frequent checks for accurate latency data
    - name: remote-gpu
    url: http://gpu-cluster:8000
    models: ["llama3", "mistral"]
    health_check:
      interval: 10s

Weighted Distribution

Distribute load based on server capacity:

selection_strategy: WeightedRoundRobin

backends:
    - name: powerful-server
    url: http://high-end:8000
    weight: 5  # Handles 5x more traffic
    models: ["gpt-4", "claude-opus"]

    - name: medium-server
    url: http://medium:8000
    weight: 2  # Handles 2x base traffic
    models: ["gpt-3.5-turbo", "claude-sonnet"]

    - name: basic-server
    url: http://basic:8000
    weight: 1  # Base traffic level
    models: ["gpt-3.5-turbo"]

Cache-Optimized Setup

Maximize model cache hits:

selection_strategy: ConsistentHash

# Models always go to same backend for cache efficiency
backends:
    - name: backend-1
    url: http://server1:8000
    models: ["gpt-4", "gpt-3.5-turbo"]

    - name: backend-2
    url: http://server2:8000
    models: ["gpt-4", "gpt-3.5-turbo"]

    - name: backend-3
    url: http://server3:8000
    models: ["claude-opus", "claude-sonnet"]

Mixed Strategy with Fallback

routing:
  strategy: LeastLatency
  fallback_strategy: RoundRobin  # Used when latency data insufficient

  # Override for specific models
  model_overrides:
    "gpt-4": ConsistentHash  # Always use same backend for GPT-4
    "llama3": WeightedRoundRobin  # Distribute based on weights

backends:
    - name: primary
    url: http://primary:8000
    weight: 3
    priority: 1  # Preferred backend

    - name: secondary
    url: http://secondary:8000
    weight: 1
    priority: 2  # Fallback backend

Dynamic Strategy Switching

Via Environment Variable

# Change strategy at startup
export CONTINUUM_SELECTION_STRATEGY=LeastLatency
continuum-router --config config.yaml

Via Configuration Hot-Reload

# Update config.yaml
sed -i 's/selection_strategy: .*/selection_strategy: WeightedRoundRobin/' config.yaml

# Router automatically reloads configuration

Checking Current Strategy

# View current configuration including load balancing strategy
curl http://localhost:8080/admin/config | jq '.selection_strategy'

# View full configuration details
curl http://localhost:8080/admin/config

Monitoring Load Distribution

Backend Statistics

# Check backend statistics
curl http://localhost:8080/admin/backends

Response:

{
  "backends": [
    {
      "name": "backend-1",
      "url": "http://server1:8000",
      "status": "healthy",
      "total_requests": 1523,
      "successful_requests": 1520,
      "failed_requests": 3,
      "average_latency_ms": 245,
      "p95_latency_ms": 450,
      "p99_latency_ms": 890,
      "weight": 2,
      "models_served": ["gpt-4", "gpt-3.5-turbo"],
      "last_selected": "2024-01-15T10:30:45Z"
    },
    {
      "name": "backend-2",
      "url": "http://server2:8000",
      "status": "healthy",
      "total_requests": 761,
      "successful_requests": 760,
      "failed_requests": 1,
      "average_latency_ms": 312,
      "p95_latency_ms": 520,
      "p99_latency_ms": 950,
      "weight": 1,
      "models_served": ["gpt-4", "gpt-3.5-turbo"],
      "last_selected": "2024-01-15T10:30:44Z"
    }
  ],
  "strategy": "WeightedRoundRobin",
  "total_requests": 2284,
  "distribution_ratio": {
    "backend-1": 0.667,
    "backend-2": 0.333
  }
}

Prometheus Metrics

# Request distribution by backend
sum(rate(routing_decisions_total[5m])) by (selected_backend)

# Backend selection latency
histogram_quantile(0.95, routing_backend_selection_duration_seconds)

# Load balancing effectiveness
stddev(rate(backend_request_total[5m]) by (backend_id))

Advanced Configuration

Health-Aware Load Balancing

health_checks:
  enabled: true
  interval: 30s
  timeout: 5s
  unhealthy_threshold: 3
  healthy_threshold: 2

  # Adjust weight based on health
  dynamic_weight_adjustment:
    enabled: true
    degraded_weight_factor: 0.5  # Reduce weight by 50% when degraded

selection_strategy: WeightedRoundRobin

backends:
    - name: primary
    url: http://primary:8000
    weight: 100
    health_score_threshold:
      healthy: 0.9    # >90% success rate
      degraded: 0.7    # 70-90% success rate
      unhealthy: 0.0   # <70% success rate

Request-Aware Routing

routing:
  strategy: Custom

  # Route based on request characteristics
  rules:
        - condition:
        model: "gpt-4"
        max_tokens: { greater_than: 2000 }
      strategy: ConsistentHash  # Long requests to same backend

        - condition:
        model: "gpt-3.5-turbo"
        stream: true
      strategy: LeastLatency  # Streaming to fastest backend

        - condition:
        default: true
      strategy: WeightedRoundRobin

Geographic Load Balancing

routing:
  strategy: Geographic

backends:
    - name: us-west
    url: http://us-west.example.com:8000
    region: us-west
    weight: 1

    - name: us-east
    url: http://us-east.example.com:8000
    region: us-east
    weight: 1

    - name: eu-central
    url: http://eu-central.example.com:8000
    region: eu-central
    weight: 1

geographic_routing:
  detect_client_region: true
  fallback_to_nearest: true
  latency_based_selection: true

Best Practices

1. Start Simple

  • Begin with Round-Robin for initial deployments
  • Monitor performance metrics
  • Switch strategies based on observed patterns

2. Monitor and Adjust

  • Use /admin/backends to track backend performance
  • Watch for load imbalances
  • Adjust weights incrementally (±10% at a time)

3. Consider Your Workload

  • Uniform requests: Round-Robin or Random
  • Variable capacity: WeightedRoundRobin
  • Performance critical: LeastLatency
  • Cache-heavy: ConsistentHash

4. Health Check Configuration

  • Enable health checks for automatic failover
  • Set appropriate thresholds based on SLAs
  • Use shorter intervals for critical backends

5. Testing Strategies

# Test load distribution
for i in {1..100}; do
  curl -s http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"gpt-3.5-turbo","messages":[{"role":"user","content":"Hi"}]}'
done

# Check distribution
curl http://localhost:8080/admin/backends | jq '.distribution_ratio'

6. Gradual Migration

When changing strategies: 1. Test in staging environment 2. Monitor for 24 hours 3. Gradually roll out to production 4. Keep previous configuration for rollback

Strategy Selection Guide

Strategy Best For Pros Cons When to Use
RoundRobin Equal backends Simple, fair distribution Ignores backend capacity Default choice, homogeneous backends
WeightedRoundRobin Mixed capacity backends Respects backend capabilities Requires weight tuning Known performance differences
LeastLatency Performance optimization Adapts to real conditions Needs warm-up period Production environments, SLA critical
Random Stateless workloads No state overhead Less predictable Simple deployments, testing
ConsistentHash Cache optimization Maximizes cache hits Can cause imbalance Model-heavy workloads, stateful services

Troubleshooting

Uneven Load Distribution

# Check strategy
curl http://localhost:8080/admin/config | jq '.selection_strategy'

# Verify backend weights
curl http://localhost:8080/admin/backends | jq '.[].weight'

# Check health status
curl http://localhost:8080/admin/backends | jq '.[].status'

High Latency with LeastLatency

  • Check if warm-up period has completed
  • Verify latency measurements are accurate
  • Consider increasing health check frequency
  • Check for network issues

ConsistentHash Imbalance

  • Review model distribution across backends
  • Consider adding more backends
  • Use weight adjustments to compensate
  • Monitor cache hit rates

See Also