Load Balancing Guide¶
Continuum Router provides advanced load balancing capabilities to distribute requests across multiple backends efficiently. The router supports six different strategies, each optimized for specific use cases.
Table of Contents¶
- Available Strategies
- Configuration Examples
- Dynamic Strategy Switching
- Monitoring Load Distribution
- Advanced Configuration
- Best Practices
- Strategy Selection Guide
Available Strategies¶
1. Round-Robin (Default)¶
- Description: Distributes requests evenly across all healthy backends in sequential order
- Use Case: General-purpose load distribution when all backends have similar capabilities
- Behavior: Each backend gets exactly one request before the cycle repeats
- Weight Support: No (weights are ignored)
- State Management: Minimal (only tracks current backend index)
Pros: - Simple and predictable - Fair distribution - Low overhead - No configuration required
Cons: - Ignores backend performance differences - Doesn't account for request complexity - May overload slower backends
2. Weighted Round-Robin¶
selection_strategy: WeightedRoundRobin
backends:
- name: high-performance
url: http://gpu-server:8000
weight: 3 # Gets 3x more traffic
- name: standard
url: http://cpu-server:8000
weight: 1 # Base traffic level
- Description: Distributes requests proportionally based on backend weights
- Use Case: When backends have different performance characteristics or capacity
- Behavior: Backends with higher weights receive proportionally more requests
- Weight Support: Yes (required)
- Weight Range: 1-100 (recommended)
Pros: - Respects backend capabilities - Flexible traffic distribution - Easy to adjust load ratios
Cons: - Requires manual weight tuning - Static weights don't adapt to real-time conditions
3. Least-Latency¶
- Description: Routes requests to the backend with the lowest average response time
- Use Case: Optimizing for response speed in production environments
- Behavior: Continuously tracks response times and adapts routing accordingly
- Note: Falls back to round-robin until sufficient latency data is collected (typically 10 requests per backend)
Pros: - Automatically optimizes for performance - Adapts to real-time conditions - No manual tuning required
Cons: - Needs warm-up period - May concentrate load on fastest backend - Sensitive to temporary latency spikes
4. Random¶
- Description: Randomly selects a healthy backend for each request
- Use Case: Simple load distribution without state tracking
- Behavior: Each request has equal probability of going to any healthy backend
- Advantages: No state management overhead, good for stateless workloads
Pros: - No state management overhead - Simple implementation - Good for stateless workloads - Natural load distribution
Cons: - Less predictable - May have uneven distribution in short term - No performance optimization
5. Consistent-Hash¶
- Description: Uses consistent hashing to ensure the same model requests go to the same backend
- Use Case: When you need session affinity or want to maximize cache efficiency
- Behavior: Hashes the model name to consistently select the same backend
- Benefits: Improves model caching, reduces model loading overhead
Pros: - Maximizes cache efficiency - Reduces model loading overhead - Predictable routing - Good for stateful models
Cons: - May cause load imbalance - Less flexible for dynamic scaling - Model-specific routing only
6. Prefix-Aware Hash (KV Cache Optimized)¶
selection_strategy: PrefixAwareHash
prefix_routing:
enabled: true
max_prefix_length: 1024
load_factor_epsilon: 0.25
virtual_nodes: 150
anthropic_cache_control_injection: true
- Description: Routes requests sharing the same prompt prefix to the same backend, maximizing KV cache reuse on inference engines (vLLM, SGLang, TensorRT-LLM). Uses Consistent Hashing with Bounded Loads (CHWBL) to prevent hotspots.
- Use Case: Multi-backend vLLM/SGLang deployments where KV cache reuse is critical for latency
- Behavior: Extracts a prefix key (SHA256) from the system prompt or first user message, then uses the hash to consistently route to the same backend. CHWBL caps per-backend load at
ceil(avg_load * (1 + epsilon)), overflowing to the next ring node when a backend is overloaded. - Fallback: Falls back to model-based
ConsistentHashwhen no prefix key is available (e.g., non-chat requests).
Pros:
- 40-60% TTFT reduction with KV cache hits
- 40%+ throughput improvement for shared-prefix workloads
- Automatic hotspot prevention via CHWBL
- Composable with KV cache index scoring (Tier 4)
Cons:
- Requires backends with KV cache support (vLLM, SGLang)
- Less benefit for unique/random prompts
- Additional configuration required
Prefix Key Extraction¶
The router extracts a prefix key from chat completion requests:
- With system prompt:
SHA256(model + "\0" + "S" + system_prompt[:max_prefix_length]) - Without system prompt:
SHA256(model + "\0" + "M" + first_user_message[:max_prefix_length])
This ensures that requests with the same system prompt are routed to the same backend, while the model name prevents cross-model collisions.
CHWBL Load Balancing¶
When the preferred backend's in-flight request count exceeds the load cap, the router walks clockwise around the hash ring to find the next eligible backend:
With the default epsilon = 0.25, a backend can handle up to 25% more than the average load before overflow. Lower epsilon values provide stricter balance; higher values preserve more prefix affinity.
KV Cache Index Integration (Tier 4)¶
When a KV cache index is available, the router can use real-time cache state data to make even more precise routing decisions. The KvOverlapScorer runs before the PrefixAwareHash strategy and selects a backend based on actual cached token overlap:
final_score = overlap_weight * overlap_score + load_weight * (1 - load_ratio) + health_weight * health_score
If the best score exceeds the minimum threshold (default: 0.3), the scorer selects that backend directly. Otherwise, the PrefixAwareHash strategy takes over. See KV Cache Architecture for details.
Configuration Examples¶
High-Performance Setup¶
Optimized for lowest latency:
# Automatically routes to fastest backend
selection_strategy: LeastLatency
backends:
- name: local-gpu
url: http://localhost:8000
models: ["llama3", "mistral"]
health_check:
interval: 10s # Frequent checks for accurate latency data
- name: remote-gpu
url: http://gpu-cluster:8000
models: ["llama3", "mistral"]
health_check:
interval: 10s
Weighted Distribution¶
Distribute load based on server capacity:
selection_strategy: WeightedRoundRobin
backends:
- name: powerful-server
url: http://high-end:8000
weight: 5 # Handles 5x more traffic
models: ["gpt-5.4", "claude-opus-4-6"]
- name: medium-server
url: http://medium:8000
weight: 2 # Handles 2x base traffic
models: ["gpt-5.4-mini", "claude-sonnet-4-6"]
- name: basic-server
url: http://basic:8000
weight: 1 # Base traffic level
models: ["gpt-5.4-nano"]
Cache-Optimized Setup¶
Maximize model cache hits:
selection_strategy: ConsistentHash
# Models always go to same backend for cache efficiency
backends:
- name: backend-1
url: http://server1:8000
models: ["gpt-5.4", "gpt-5.4-mini"]
- name: backend-2
url: http://server2:8000
models: ["gpt-5.4", "gpt-5.4-mini"]
- name: backend-3
url: http://server3:8000
models: ["claude-opus-4-6", "claude-sonnet-4-6"]
Mixed Strategy with Fallback¶
routing:
strategy: LeastLatency
fallback_strategy: RoundRobin # Used when latency data insufficient
# Override for specific models
model_overrides:
"gpt-4": ConsistentHash # Always use same backend for GPT-4
"llama3": WeightedRoundRobin # Distribute based on weights
backends:
- name: primary
url: http://primary:8000
weight: 3
priority: 1 # Preferred backend
- name: secondary
url: http://secondary:8000
weight: 1
priority: 2 # Fallback backend
Dynamic Strategy Switching¶
Via Environment Variable¶
# Change strategy at startup
export CONTINUUM_SELECTION_STRATEGY=LeastLatency
continuum-router --config config.yaml
Via Configuration Hot-Reload¶
# Update config.yaml
sed -i 's/selection_strategy: .*/selection_strategy: WeightedRoundRobin/' config.yaml
# Router automatically reloads configuration
Checking Current Strategy¶
# View current configuration including load balancing strategy
curl http://localhost:8080/admin/config | jq '.selection_strategy'
# View full configuration details
curl http://localhost:8080/admin/config
Monitoring Load Distribution¶
Backend Statistics¶
Response:
{
"backends": [
{
"name": "backend-1",
"url": "http://server1:8000",
"status": "healthy",
"total_requests": 1523,
"successful_requests": 1520,
"failed_requests": 3,
"average_latency_ms": 245,
"p95_latency_ms": 450,
"p99_latency_ms": 890,
"weight": 2,
"models_served": ["gpt-4", "gpt-3.5-turbo"],
"last_selected": "2024-01-15T10:30:45Z"
},
{
"name": "backend-2",
"url": "http://server2:8000",
"status": "healthy",
"total_requests": 761,
"successful_requests": 760,
"failed_requests": 1,
"average_latency_ms": 312,
"p95_latency_ms": 520,
"p99_latency_ms": 950,
"weight": 1,
"models_served": ["gpt-4", "gpt-3.5-turbo"],
"last_selected": "2024-01-15T10:30:44Z"
}
],
"strategy": "WeightedRoundRobin",
"total_requests": 2284,
"distribution_ratio": {
"backend-1": 0.667,
"backend-2": 0.333
}
}
Prometheus Metrics¶
# Request distribution by backend
sum(rate(routing_decisions_total[5m])) by (selected_backend)
# Backend selection latency
histogram_quantile(0.95, routing_backend_selection_duration_seconds)
# Load balancing effectiveness
stddev(rate(backend_request_total[5m]) by (backend_id))
Advanced Configuration¶
Health-Aware Load Balancing¶
health_checks:
enabled: true
interval: 30s
timeout: 5s
unhealthy_threshold: 3
healthy_threshold: 2
# Adjust weight based on health
dynamic_weight_adjustment:
enabled: true
degraded_weight_factor: 0.5 # Reduce weight by 50% when degraded
selection_strategy: WeightedRoundRobin
backends:
- name: primary
url: http://primary:8000
weight: 100
health_score_threshold:
healthy: 0.9 # >90% success rate
degraded: 0.7 # 70-90% success rate
unhealthy: 0.0 # <70% success rate
Request-Aware Routing¶
routing:
strategy: Custom
# Route based on request characteristics
rules:
- condition:
model: "gpt-4"
max_tokens: { greater_than: 2000 }
strategy: ConsistentHash # Long requests to same backend
- condition:
model: "gpt-3.5-turbo"
stream: true
strategy: LeastLatency # Streaming to fastest backend
- condition:
default: true
strategy: WeightedRoundRobin
Geographic Load Balancing¶
routing:
strategy: Geographic
backends:
- name: us-west
url: http://us-west.example.com:8000
region: us-west
weight: 1
- name: us-east
url: http://us-east.example.com:8000
region: us-east
weight: 1
- name: eu-central
url: http://eu-central.example.com:8000
region: eu-central
weight: 1
geographic_routing:
detect_client_region: true
fallback_to_nearest: true
latency_based_selection: true
Best Practices¶
1. Start Simple¶
- Begin with Round-Robin for initial deployments
- Monitor performance metrics
- Switch strategies based on observed patterns
2. Monitor and Adjust¶
- Use
/admin/backendsto track backend performance - Watch for load imbalances
- Adjust weights incrementally (±10% at a time)
3. Consider Your Workload¶
- Uniform requests: Round-Robin or Random
- Variable capacity: WeightedRoundRobin
- Performance critical: LeastLatency
- Cache-heavy: ConsistentHash
4. Health Check Configuration¶
- Enable health checks for automatic failover
- Set appropriate thresholds based on SLAs
- Use shorter intervals for critical backends
5. Testing Strategies¶
# Test load distribution
for i in {1..100}; do
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"gpt-3.5-turbo","messages":[{"role":"user","content":"Hi"}]}'
done
# Check distribution
curl http://localhost:8080/admin/backends | jq '.distribution_ratio'
6. Gradual Migration¶
When changing strategies: 1. Test in staging environment 2. Monitor for 24 hours 3. Gradually roll out to production 4. Keep previous configuration for rollback
Strategy Selection Guide¶
| Strategy | Best For | Pros | Cons | When to Use |
|---|---|---|---|---|
| RoundRobin | Equal backends | Simple, fair distribution | Ignores backend capacity | Default choice, homogeneous backends |
| WeightedRoundRobin | Mixed capacity backends | Respects backend capabilities | Requires weight tuning | Known performance differences |
| LeastLatency | Performance optimization | Adapts to real conditions | Needs warm-up period | Production environments, SLA critical |
| Random | Stateless workloads | No state overhead | Less predictable | Simple deployments, testing |
| ConsistentHash | Cache optimization | Maximizes cache hits | Can cause imbalance | Model-heavy workloads, stateful services |
| PrefixAwareHash | KV cache optimization | 40-60% TTFT reduction | Requires KV cache backends | vLLM/SGLang with shared prompts |
Troubleshooting¶
Uneven Load Distribution¶
# Check strategy
curl http://localhost:8080/admin/config | jq '.selection_strategy'
# Verify backend weights
curl http://localhost:8080/admin/backends | jq '.[].weight'
# Check health status
curl http://localhost:8080/admin/backends | jq '.[].status'
High Latency with LeastLatency¶
- Check if warm-up period has completed
- Verify latency measurements are accurate
- Consider increasing health check frequency
- Check for network issues
ConsistentHash Imbalance¶
- Review model distribution across backends
- Consider adding more backends
- Use weight adjustments to compensate
- Monitor cache hit rates
See Also¶
- Configuration Guide - Full configuration options
- Metrics Guide - Monitor load balancing effectiveness
- Performance Guide - Optimize routing performance