Load Balancing Guide¶
Continuum Router provides advanced load balancing capabilities to distribute requests across multiple backends efficiently. The router supports five different strategies, each optimized for specific use cases.
Table of Contents¶
- Available Strategies
- Configuration Examples
- Dynamic Strategy Switching
- Monitoring Load Distribution
- Advanced Configuration
- Best Practices
- Strategy Selection Guide
Available Strategies¶
1. Round-Robin (Default)¶
- Description: Distributes requests evenly across all healthy backends in sequential order
- Use Case: General-purpose load distribution when all backends have similar capabilities
- Behavior: Each backend gets exactly one request before the cycle repeats
- Weight Support: No (weights are ignored)
- State Management: Minimal (only tracks current backend index)
Pros: - Simple and predictable - Fair distribution - Low overhead - No configuration required
Cons: - Ignores backend performance differences - Doesn't account for request complexity - May overload slower backends
2. Weighted Round-Robin¶
selection_strategy: WeightedRoundRobin
backends:
- name: high-performance
url: http://gpu-server:8000
weight: 3 # Gets 3x more traffic
- name: standard
url: http://cpu-server:8000
weight: 1 # Base traffic level
- Description: Distributes requests proportionally based on backend weights
- Use Case: When backends have different performance characteristics or capacity
- Behavior: Backends with higher weights receive proportionally more requests
- Weight Support: Yes (required)
- Weight Range: 1-100 (recommended)
Pros: - Respects backend capabilities - Flexible traffic distribution - Easy to adjust load ratios
Cons: - Requires manual weight tuning - Static weights don't adapt to real-time conditions
3. Least-Latency¶
- Description: Routes requests to the backend with the lowest average response time
- Use Case: Optimizing for response speed in production environments
- Behavior: Continuously tracks response times and adapts routing accordingly
- Note: Falls back to round-robin until sufficient latency data is collected (typically 10 requests per backend)
Pros: - Automatically optimizes for performance - Adapts to real-time conditions - No manual tuning required
Cons: - Needs warm-up period - May concentrate load on fastest backend - Sensitive to temporary latency spikes
4. Random¶
- Description: Randomly selects a healthy backend for each request
- Use Case: Simple load distribution without state tracking
- Behavior: Each request has equal probability of going to any healthy backend
- Advantages: No state management overhead, good for stateless workloads
Pros: - No state management overhead - Simple implementation - Good for stateless workloads - Natural load distribution
Cons: - Less predictable - May have uneven distribution in short term - No performance optimization
5. Consistent-Hash¶
- Description: Uses consistent hashing to ensure the same model requests go to the same backend
- Use Case: When you need session affinity or want to maximize cache efficiency
- Behavior: Hashes the model name to consistently select the same backend
- Benefits: Improves model caching, reduces model loading overhead
Pros: - Maximizes cache efficiency - Reduces model loading overhead - Predictable routing - Good for stateful models
Cons: - May cause load imbalance - Less flexible for dynamic scaling - Model-specific routing only
Configuration Examples¶
High-Performance Setup¶
Optimized for lowest latency:
# Automatically routes to fastest backend
selection_strategy: LeastLatency
backends:
- name: local-gpu
url: http://localhost:8000
models: ["llama3", "mistral"]
health_check:
interval: 10s # Frequent checks for accurate latency data
- name: remote-gpu
url: http://gpu-cluster:8000
models: ["llama3", "mistral"]
health_check:
interval: 10s
Weighted Distribution¶
Distribute load based on server capacity:
selection_strategy: WeightedRoundRobin
backends:
- name: powerful-server
url: http://high-end:8000
weight: 5 # Handles 5x more traffic
models: ["gpt-4", "claude-opus"]
- name: medium-server
url: http://medium:8000
weight: 2 # Handles 2x base traffic
models: ["gpt-3.5-turbo", "claude-sonnet"]
- name: basic-server
url: http://basic:8000
weight: 1 # Base traffic level
models: ["gpt-3.5-turbo"]
Cache-Optimized Setup¶
Maximize model cache hits:
selection_strategy: ConsistentHash
# Models always go to same backend for cache efficiency
backends:
- name: backend-1
url: http://server1:8000
models: ["gpt-4", "gpt-3.5-turbo"]
- name: backend-2
url: http://server2:8000
models: ["gpt-4", "gpt-3.5-turbo"]
- name: backend-3
url: http://server3:8000
models: ["claude-opus", "claude-sonnet"]
Mixed Strategy with Fallback¶
routing:
strategy: LeastLatency
fallback_strategy: RoundRobin # Used when latency data insufficient
# Override for specific models
model_overrides:
"gpt-4": ConsistentHash # Always use same backend for GPT-4
"llama3": WeightedRoundRobin # Distribute based on weights
backends:
- name: primary
url: http://primary:8000
weight: 3
priority: 1 # Preferred backend
- name: secondary
url: http://secondary:8000
weight: 1
priority: 2 # Fallback backend
Dynamic Strategy Switching¶
Via Environment Variable¶
# Change strategy at startup
export CONTINUUM_SELECTION_STRATEGY=LeastLatency
continuum-router --config config.yaml
Via Configuration Hot-Reload¶
# Update config.yaml
sed -i 's/selection_strategy: .*/selection_strategy: WeightedRoundRobin/' config.yaml
# Router automatically reloads configuration
Checking Current Strategy¶
# View current configuration including load balancing strategy
curl http://localhost:8080/admin/config | jq '.selection_strategy'
# View full configuration details
curl http://localhost:8080/admin/config
Monitoring Load Distribution¶
Backend Statistics¶
Response:
{
"backends": [
{
"name": "backend-1",
"url": "http://server1:8000",
"status": "healthy",
"total_requests": 1523,
"successful_requests": 1520,
"failed_requests": 3,
"average_latency_ms": 245,
"p95_latency_ms": 450,
"p99_latency_ms": 890,
"weight": 2,
"models_served": ["gpt-4", "gpt-3.5-turbo"],
"last_selected": "2024-01-15T10:30:45Z"
},
{
"name": "backend-2",
"url": "http://server2:8000",
"status": "healthy",
"total_requests": 761,
"successful_requests": 760,
"failed_requests": 1,
"average_latency_ms": 312,
"p95_latency_ms": 520,
"p99_latency_ms": 950,
"weight": 1,
"models_served": ["gpt-4", "gpt-3.5-turbo"],
"last_selected": "2024-01-15T10:30:44Z"
}
],
"strategy": "WeightedRoundRobin",
"total_requests": 2284,
"distribution_ratio": {
"backend-1": 0.667,
"backend-2": 0.333
}
}
Prometheus Metrics¶
# Request distribution by backend
sum(rate(routing_decisions_total[5m])) by (selected_backend)
# Backend selection latency
histogram_quantile(0.95, routing_backend_selection_duration_seconds)
# Load balancing effectiveness
stddev(rate(backend_request_total[5m]) by (backend_id))
Advanced Configuration¶
Health-Aware Load Balancing¶
health_checks:
enabled: true
interval: 30s
timeout: 5s
unhealthy_threshold: 3
healthy_threshold: 2
# Adjust weight based on health
dynamic_weight_adjustment:
enabled: true
degraded_weight_factor: 0.5 # Reduce weight by 50% when degraded
selection_strategy: WeightedRoundRobin
backends:
- name: primary
url: http://primary:8000
weight: 100
health_score_threshold:
healthy: 0.9 # >90% success rate
degraded: 0.7 # 70-90% success rate
unhealthy: 0.0 # <70% success rate
Request-Aware Routing¶
routing:
strategy: Custom
# Route based on request characteristics
rules:
- condition:
model: "gpt-4"
max_tokens: { greater_than: 2000 }
strategy: ConsistentHash # Long requests to same backend
- condition:
model: "gpt-3.5-turbo"
stream: true
strategy: LeastLatency # Streaming to fastest backend
- condition:
default: true
strategy: WeightedRoundRobin
Geographic Load Balancing¶
routing:
strategy: Geographic
backends:
- name: us-west
url: http://us-west.example.com:8000
region: us-west
weight: 1
- name: us-east
url: http://us-east.example.com:8000
region: us-east
weight: 1
- name: eu-central
url: http://eu-central.example.com:8000
region: eu-central
weight: 1
geographic_routing:
detect_client_region: true
fallback_to_nearest: true
latency_based_selection: true
Best Practices¶
1. Start Simple¶
- Begin with Round-Robin for initial deployments
- Monitor performance metrics
- Switch strategies based on observed patterns
2. Monitor and Adjust¶
- Use
/admin/backendsto track backend performance - Watch for load imbalances
- Adjust weights incrementally (±10% at a time)
3. Consider Your Workload¶
- Uniform requests: Round-Robin or Random
- Variable capacity: WeightedRoundRobin
- Performance critical: LeastLatency
- Cache-heavy: ConsistentHash
4. Health Check Configuration¶
- Enable health checks for automatic failover
- Set appropriate thresholds based on SLAs
- Use shorter intervals for critical backends
5. Testing Strategies¶
# Test load distribution
for i in {1..100}; do
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"gpt-3.5-turbo","messages":[{"role":"user","content":"Hi"}]}'
done
# Check distribution
curl http://localhost:8080/admin/backends | jq '.distribution_ratio'
6. Gradual Migration¶
When changing strategies: 1. Test in staging environment 2. Monitor for 24 hours 3. Gradually roll out to production 4. Keep previous configuration for rollback
Strategy Selection Guide¶
| Strategy | Best For | Pros | Cons | When to Use |
|---|---|---|---|---|
| RoundRobin | Equal backends | Simple, fair distribution | Ignores backend capacity | Default choice, homogeneous backends |
| WeightedRoundRobin | Mixed capacity backends | Respects backend capabilities | Requires weight tuning | Known performance differences |
| LeastLatency | Performance optimization | Adapts to real conditions | Needs warm-up period | Production environments, SLA critical |
| Random | Stateless workloads | No state overhead | Less predictable | Simple deployments, testing |
| ConsistentHash | Cache optimization | Maximizes cache hits | Can cause imbalance | Model-heavy workloads, stateful services |
Troubleshooting¶
Uneven Load Distribution¶
# Check strategy
curl http://localhost:8080/admin/config | jq '.selection_strategy'
# Verify backend weights
curl http://localhost:8080/admin/backends | jq '.[].weight'
# Check health status
curl http://localhost:8080/admin/backends | jq '.[].status'
High Latency with LeastLatency¶
- Check if warm-up period has completed
- Verify latency measurements are accurate
- Consider increasing health check frequency
- Check for network issues
ConsistentHash Imbalance¶
- Review model distribution across backends
- Consider adding more backends
- Use weight adjustments to compensate
- Monitor cache hit rates
See Also¶
- Configuration Guide - Full configuration options
- Metrics Guide - Monitor load balancing effectiveness
- Performance Guide - Optimize routing performance