Load Balancing Guide¶
Continuum Router distributes requests across multiple backends using one of six selection strategies, each optimized for a different traffic pattern.
Table of Contents¶
- Available Strategies
- Configuration Examples
- Dynamic Strategy Switching
- Monitoring Load Distribution
- Advanced Configuration
- Best Practices
- Strategy Selection Guide
Available Strategies¶
1. Round-Robin (Default)¶
- Description: Distributes requests evenly across all healthy backends in sequential order
- Use Case: General-purpose load distribution when all backends have similar capabilities
- Behavior: Each backend gets exactly one request before the cycle repeats
- Weight Support: No (weights are ignored)
- State Management: Minimal (only tracks current backend index)
Pros: - Simple and predictable - Fair distribution - Low overhead - No configuration required
Cons: - Ignores backend performance differences - Doesn't account for request complexity - May overload slower backends
2. Weighted Round-Robin¶
selection_strategy: WeightedRoundRobin
backends:
- name: high-performance
url: http://gpu-server:8000
weight: 3 # Gets 3x more traffic
- name: standard
url: http://cpu-server:8000
weight: 1 # Base traffic level
- Description: Distributes requests proportionally based on backend weights
- Use Case: When backends have different performance characteristics or capacity
- Behavior: Backends with higher weights receive proportionally more requests
- Weight Support: Yes (required)
- Weight Range: 1-100 (recommended)
Pros: - Respects backend capabilities - Flexible traffic distribution - Easy to adjust load ratios
Cons: - Requires manual weight tuning - Static weights don't adapt to real-time conditions
3. Least-Latency¶
- Description: Routes requests to the backend with the lowest average response time
- Use Case: Optimizing for response speed in production environments
- Behavior: Continuously tracks response times and adapts routing accordingly
- Note: Falls back to round-robin until sufficient latency data is collected (typically 10 requests per backend)
Pros: - Automatically optimizes for performance - Adapts to real-time conditions - No manual tuning required
Cons: - Needs warm-up period - May concentrate load on fastest backend - Sensitive to temporary latency spikes
4. Random¶
- Description: Randomly selects a healthy backend for each request
- Use Case: Simple load distribution without state tracking
- Behavior: Each request has equal probability of going to any healthy backend
- Advantages: No state management overhead, good for stateless workloads
Pros: - No state management overhead - Simple implementation - Good for stateless workloads - Natural load distribution
Cons: - Less predictable - May have uneven distribution in short term - No performance optimization
5. Consistent-Hash¶
- Description: Uses consistent hashing to ensure the same model requests go to the same backend
- Use Case: When you need session affinity or want to maximize cache efficiency
- Behavior: Hashes the model name to consistently select the same backend
- Benefits: Improves model caching, reduces model loading overhead
Pros: - Maximizes cache efficiency - Reduces model loading overhead - Predictable routing - Good for stateful models
Cons: - May cause load imbalance - Less flexible for dynamic scaling - Model-specific routing only
6. Prefix-Aware Hash (KV Cache Optimized)¶
selection_strategy: PrefixAwareHash
prefix_routing:
enabled: true
max_prefix_length: 1024
load_factor_epsilon: 0.25
virtual_nodes: 150
anthropic_cache_control_injection: true
- Description: Routes requests sharing the same prompt prefix to the same backend, maximizing KV cache reuse on inference engines (vLLM, SGLang, TensorRT-LLM). Uses Consistent Hashing with Bounded Loads (CHWBL) to prevent hotspots.
- Use Case: Multi-backend vLLM/SGLang deployments where KV cache reuse is critical for latency
- Behavior: Extracts a prefix key (SHA256) from the system prompt or first user message, then uses the hash to consistently route to the same backend. CHWBL caps per-backend load at
ceil(avg_load * (1 + epsilon)), overflowing to the next ring node when a backend is overloaded. - Fallback: Falls back to model-based
ConsistentHashwhen no prefix key is available (e.g., non-chat requests).
Pros:
- 40-60% TTFT reduction with KV cache hits
- 40%+ throughput improvement for shared-prefix workloads
- Automatic hotspot prevention via CHWBL
- Composable with KV cache index scoring (Tier 4)
Cons:
- Requires backends with KV cache support (vLLM, SGLang)
- Less benefit for unique/random prompts
- Additional configuration required
Prefix Key Extraction¶
The router extracts a prefix key from chat completion requests:
- With system prompt:
SHA256(model + "\0" + "S" + system_prompt[:max_prefix_length]) - Without system prompt:
SHA256(model + "\0" + "M" + first_user_message[:max_prefix_length])
This ensures that requests with the same system prompt are routed to the same backend, while the model name prevents cross-model collisions.
CHWBL Load Balancing¶
When the preferred backend's in-flight request count exceeds the load cap, the router walks clockwise around the hash ring to find the next eligible backend:
With the default epsilon = 0.25, a backend can handle up to 25% more than the average load before overflow. Lower epsilon values provide stricter balance; higher values preserve more prefix affinity.
KV Cache Index Integration (Tier 4)¶
When a KV cache index is available, the router can use real-time cache state data to make even more precise routing decisions. The KvOverlapScorer runs before the PrefixAwareHash strategy and selects a backend based on actual cached token overlap:
final_score = overlap_weight * overlap_score + load_weight * (1 - load_ratio) + health_weight * health_score
If the best score exceeds the minimum threshold (default: 0.3), the scorer selects that backend directly. Otherwise, the PrefixAwareHash strategy takes over. See KV Cache Architecture for details.
Configuration Examples¶
High-Performance Setup¶
Optimized for lowest latency:
# Automatically routes to fastest backend
selection_strategy: LeastLatency
backends:
- name: local-gpu
url: http://localhost:8000
models: ["llama3", "mistral"]
health_check:
interval: 10s # Frequent checks for accurate latency data
- name: remote-gpu
url: http://gpu-cluster:8000
models: ["llama3", "mistral"]
health_check:
interval: 10s
Weighted Distribution¶
Distribute load based on server capacity:
selection_strategy: WeightedRoundRobin
backends:
- name: powerful-server
url: http://high-end:8000
weight: 5 # Handles 5x more traffic
models: ["gpt-5.4", "claude-opus-4-6"]
- name: medium-server
url: http://medium:8000
weight: 2 # Handles 2x base traffic
models: ["gpt-5.4-mini", "claude-sonnet-4-6"]
- name: basic-server
url: http://basic:8000
weight: 1 # Base traffic level
models: ["gpt-5.4-nano"]
Cache-Optimized Setup¶
Maximize model cache hits:
selection_strategy: ConsistentHash
# Models always go to same backend for cache efficiency
backends:
- name: backend-1
url: http://server1:8000
models: ["gpt-5.4", "gpt-5.4-mini"]
- name: backend-2
url: http://server2:8000
models: ["gpt-5.4", "gpt-5.4-mini"]
- name: backend-3
url: http://server3:8000
models: ["claude-opus-4-6", "claude-sonnet-4-6"]
Mixed Strategy with Fallback¶
routing:
strategy: LeastLatency
fallback_strategy: RoundRobin # Used when latency data insufficient
# Override for specific models
model_overrides:
"gpt-4": ConsistentHash # Always use same backend for GPT-4
"llama3": WeightedRoundRobin # Distribute based on weights
backends:
- name: primary
url: http://primary:8000
weight: 3
priority: 1 # Preferred backend
- name: secondary
url: http://secondary:8000
weight: 1
priority: 2 # Fallback backend
Dynamic Strategy Switching¶
Via Environment Variable¶
# Change strategy at startup
export CONTINUUM_SELECTION_STRATEGY=LeastLatency
continuum-router --config config.yaml
Via Configuration Hot-Reload¶
# Update config.yaml
sed -i 's/selection_strategy: .*/selection_strategy: WeightedRoundRobin/' config.yaml
# Router automatically reloads configuration
Checking Current Strategy¶
# View current configuration including load balancing strategy
curl http://localhost:8080/admin/config | jq '.selection_strategy'
# View full configuration details
curl http://localhost:8080/admin/config
Monitoring Load Distribution¶
Backend Statistics¶
Response:
{
"backends": [
{
"name": "backend-1",
"url": "http://server1:8000",
"status": "healthy",
"total_requests": 1523,
"successful_requests": 1520,
"failed_requests": 3,
"average_latency_ms": 245,
"p95_latency_ms": 450,
"p99_latency_ms": 890,
"weight": 2,
"models_served": ["gpt-4", "gpt-3.5-turbo"],
"last_selected": "2024-01-15T10:30:45Z"
},
{
"name": "backend-2",
"url": "http://server2:8000",
"status": "healthy",
"total_requests": 761,
"successful_requests": 760,
"failed_requests": 1,
"average_latency_ms": 312,
"p95_latency_ms": 520,
"p99_latency_ms": 950,
"weight": 1,
"models_served": ["gpt-4", "gpt-3.5-turbo"],
"last_selected": "2024-01-15T10:30:44Z"
}
],
"strategy": "WeightedRoundRobin",
"total_requests": 2284,
"distribution_ratio": {
"backend-1": 0.667,
"backend-2": 0.333
}
}
Prometheus Metrics¶
# Request distribution by backend
sum(rate(routing_decisions_total[5m])) by (selected_backend)
# Backend selection latency
histogram_quantile(0.95, routing_backend_selection_duration_seconds)
# Load balancing effectiveness
stddev(rate(backend_request_total[5m]) by (backend_id))
Advanced Configuration¶
Health-Aware Load Balancing¶
health_checks:
enabled: true
interval: 30s
timeout: 5s
unhealthy_threshold: 3
healthy_threshold: 2
# Adjust weight based on health
dynamic_weight_adjustment:
enabled: true
degraded_weight_factor: 0.5 # Reduce weight by 50% when degraded
selection_strategy: WeightedRoundRobin
backends:
- name: primary
url: http://primary:8000
weight: 100
health_score_threshold:
healthy: 0.9 # >90% success rate
degraded: 0.7 # 70-90% success rate
unhealthy: 0.0 # <70% success rate
Request-Aware Routing¶
routing:
strategy: Custom
# Route based on request characteristics
rules:
- condition:
model: "gpt-4"
max_tokens: { greater_than: 2000 }
strategy: ConsistentHash # Long requests to same backend
- condition:
model: "gpt-3.5-turbo"
stream: true
strategy: LeastLatency # Streaming to fastest backend
- condition:
default: true
strategy: WeightedRoundRobin
Geographic Load Balancing¶
routing:
strategy: Geographic
backends:
- name: us-west
url: http://us-west.example.com:8000
region: us-west
weight: 1
- name: us-east
url: http://us-east.example.com:8000
region: us-east
weight: 1
- name: eu-central
url: http://eu-central.example.com:8000
region: eu-central
weight: 1
geographic_routing:
detect_client_region: true
fallback_to_nearest: true
latency_based_selection: true
Best Practices¶
1. Start Simple¶
- Begin with Round-Robin for initial deployments
- Monitor performance metrics
- Switch strategies based on observed patterns
2. Monitor and Adjust¶
- Use
/admin/backendsto track backend performance - Watch for load imbalances
- Adjust weights incrementally (±10% at a time)
3. Consider Your Workload¶
- Uniform requests: Round-Robin or Random
- Variable capacity: WeightedRoundRobin
- Performance critical: LeastLatency
- Cache-heavy: ConsistentHash
4. Health Check Configuration¶
- Enable health checks for automatic failover
- Set appropriate thresholds based on SLAs
- Use shorter intervals for critical backends
5. Testing Strategies¶
# Test load distribution
for i in {1..100}; do
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"gpt-3.5-turbo","messages":[{"role":"user","content":"Hi"}]}'
done
# Check distribution
curl http://localhost:8080/admin/backends | jq '.distribution_ratio'
6. Gradual Migration¶
When changing strategies: 1. Test in staging environment 2. Monitor for 24 hours 3. Gradually roll out to production 4. Keep previous configuration for rollback
Strategy Selection Guide¶
| Strategy | Best For | Pros | Cons | When to Use |
|---|---|---|---|---|
| RoundRobin | Equal backends | Simple, fair distribution | Ignores backend capacity | Default choice, homogeneous backends |
| WeightedRoundRobin | Mixed capacity backends | Respects backend capabilities | Requires weight tuning | Known performance differences |
| LeastLatency | Performance optimization | Adapts to real conditions | Needs warm-up period | Production environments, SLA critical |
| Random | Stateless workloads | No state overhead | Less predictable | Simple deployments, testing |
| ConsistentHash | Cache optimization | Maximizes cache hits | Can cause imbalance | Model-heavy workloads, stateful services |
| PrefixAwareHash | KV cache optimization | 40-60% TTFT reduction | Requires KV cache backends | vLLM/SGLang with shared prompts |
Troubleshooting¶
Uneven Load Distribution¶
# Check strategy
curl http://localhost:8080/admin/config | jq '.selection_strategy'
# Verify backend weights
curl http://localhost:8080/admin/backends | jq '.[].weight'
# Check health status
curl http://localhost:8080/admin/backends | jq '.[].status'
High Latency with LeastLatency¶
- Check if warm-up period has completed
- Verify latency measurements are accurate
- Consider increasing health check frequency
- Check for network issues
ConsistentHash Imbalance¶
- Review model distribution across backends
- Consider adding more backends
- Use weight adjustments to compensate
- Monitor cache hit rates
See Also¶
- Configuration Guide - Full configuration options
- Metrics Guide - Monitor load balancing effectiveness
- Performance Guide - Optimize routing performance