Load Balancing Guide¶

Continuum Router distributes requests across multiple backends using one of six selection strategies, each optimized for a different traffic pattern.

Table of Contents¶

Available Strategies
Configuration Examples
Dynamic Strategy Switching
Monitoring Load Distribution
Advanced Configuration
Best Practices
Strategy Selection Guide

Available Strategies¶

1. Round-Robin (Default)¶

selection_strategy: RoundRobin

Description: Distributes requests evenly across all healthy backends in sequential order
Use Case: General-purpose load distribution when all backends have similar capabilities
Behavior: Each backend gets exactly one request before the cycle repeats
Weight Support: No (weights are ignored)
State Management: Minimal (only tracks current backend index)

Pros: - Simple and predictable - Fair distribution - Low overhead - No configuration required

Cons: - Ignores backend performance differences - Doesn't account for request complexity - May overload slower backends

2. Weighted Round-Robin¶

selection_strategy: WeightedRoundRobin

backends:
    - name: high-performance
    url: http://gpu-server:8000
    weight: 3  # Gets 3x more traffic
    - name: standard
    url: http://cpu-server:8000
    weight: 1  # Base traffic level

Description: Distributes requests proportionally based on backend weights
Use Case: When backends have different performance characteristics or capacity
Behavior: Backends with higher weights receive proportionally more requests
Weight Support: Yes (required)
Weight Range: 1-100 (recommended)

Pros: - Respects backend capabilities - Flexible traffic distribution - Easy to adjust load ratios

Cons: - Requires manual weight tuning - Static weights don't adapt to real-time conditions

3. Least-Latency¶

selection_strategy: LeastLatency

Description: Routes requests to the backend with the lowest average response time
Use Case: Optimizing for response speed in production environments
Behavior: Continuously tracks response times and adapts routing accordingly
Note: Falls back to round-robin until sufficient latency data is collected (typically 10 requests per backend)

Pros: - Automatically optimizes for performance - Adapts to real-time conditions - No manual tuning required

Cons: - Needs warm-up period - May concentrate load on fastest backend - Sensitive to temporary latency spikes

4. Random¶

selection_strategy: Random

Description: Randomly selects a healthy backend for each request
Use Case: Simple load distribution without state tracking
Behavior: Each request has equal probability of going to any healthy backend
Advantages: No state management overhead, good for stateless workloads

Pros: - No state management overhead - Simple implementation - Good for stateless workloads - Natural load distribution

Cons: - Less predictable - May have uneven distribution in short term - No performance optimization

5. Consistent-Hash¶

selection_strategy: ConsistentHash

Description: Uses consistent hashing to ensure the same model requests go to the same backend
Use Case: When you need session affinity or want to maximize cache efficiency
Behavior: Hashes the model name to consistently select the same backend
Benefits: Improves model caching, reduces model loading overhead

Pros: - Maximizes cache efficiency - Reduces model loading overhead - Predictable routing - Good for stateful models

Cons: - May cause load imbalance - Less flexible for dynamic scaling - Model-specific routing only

6. Prefix-Aware Hash (KV Cache Optimized)¶

selection_strategy: PrefixAwareHash

prefix_routing:
  enabled: true
  max_prefix_length: 1024
  load_factor_epsilon: 0.25
  virtual_nodes: 150
  anthropic_cache_control_injection: true

Description: Routes requests sharing the same prompt prefix to the same backend, maximizing KV cache reuse on inference engines (vLLM, SGLang, TensorRT-LLM). Uses Consistent Hashing with Bounded Loads (CHWBL) to prevent hotspots.
Use Case: Multi-backend vLLM/SGLang deployments where KV cache reuse is critical for latency
Behavior: Extracts a prefix key (SHA256) from the system prompt or first user message, then uses the hash to consistently route to the same backend. CHWBL caps per-backend load at ceil(avg_load * (1 + epsilon)), overflowing to the next ring node when a backend is overloaded.
Fallback: Falls back to model-based ConsistentHash when no prefix key is available (e.g., non-chat requests).

Pros:

40-60% TTFT reduction with KV cache hits
40%+ throughput improvement for shared-prefix workloads
Automatic hotspot prevention via CHWBL
Composable with KV cache index scoring (Tier 4)

Cons:

Requires backends with KV cache support (vLLM, SGLang)
Less benefit for unique/random prompts
Additional configuration required

Prefix Key Extraction¶

The router extracts a prefix key from chat completion requests:

With system prompt: SHA256(model + "\0" + "S" + system_prompt[:max_prefix_length])
Without system prompt: SHA256(model + "\0" + "M" + first_user_message[:max_prefix_length])

This ensures that requests with the same system prompt are routed to the same backend, while the model name prevents cross-model collisions.

CHWBL Load Balancing¶

When the preferred backend's in-flight request count exceeds the load cap, the router walks clockwise around the hash ring to find the next eligible backend:

load_cap = ceil(average_in_flight * (1 + epsilon))

With the default epsilon = 0.25, a backend can handle up to 25% more than the average load before overflow. Lower epsilon values provide stricter balance; higher values preserve more prefix affinity.

KV Cache Index Integration (Tier 4)¶

When a KV cache index is available, the router can use real-time cache state data to make even more precise routing decisions. The KvOverlapScorer runs before the PrefixAwareHash strategy and selects a backend based on actual cached token overlap:

final_score = overlap_weight * overlap_score + load_weight * (1 - load_ratio) + health_weight * health_score

If the best score exceeds the minimum threshold (default: 0.3), the scorer selects that backend directly. Otherwise, the PrefixAwareHash strategy takes over. See KV Cache Architecture for details.

Configuration Examples¶

High-Performance Setup¶

Optimized for lowest latency:

# Automatically routes to fastest backend
selection_strategy: LeastLatency

backends:
    - name: local-gpu
    url: http://localhost:8000
    models: ["llama3", "mistral"]
    health_check:
      interval: 10s  # Frequent checks for accurate latency data
    - name: remote-gpu
    url: http://gpu-cluster:8000
    models: ["llama3", "mistral"]
    health_check:
      interval: 10s

Weighted Distribution¶

Distribute load based on server capacity:

selection_strategy: WeightedRoundRobin

backends:
    - name: powerful-server
    url: http://high-end:8000
    weight: 5  # Handles 5x more traffic
    models: ["gpt-5.4", "claude-opus-4-6"]

    - name: medium-server
    url: http://medium:8000
    weight: 2  # Handles 2x base traffic
    models: ["gpt-5.4-mini", "claude-sonnet-4-6"]

    - name: basic-server
    url: http://basic:8000
    weight: 1  # Base traffic level
    models: ["gpt-5.4-nano"]

Cache-Optimized Setup¶

Maximize model cache hits:

selection_strategy: ConsistentHash

# Models always go to same backend for cache efficiency
backends:
    - name: backend-1
    url: http://server1:8000
    models: ["gpt-5.4", "gpt-5.4-mini"]

    - name: backend-2
    url: http://server2:8000
    models: ["gpt-5.4", "gpt-5.4-mini"]

    - name: backend-3
    url: http://server3:8000
    models: ["claude-opus-4-6", "claude-sonnet-4-6"]

Mixed Strategy with Fallback¶

routing:
  strategy: LeastLatency
  fallback_strategy: RoundRobin  # Used when latency data insufficient

  # Override for specific models
  model_overrides:
    "gpt-4": ConsistentHash  # Always use same backend for GPT-4
    "llama3": WeightedRoundRobin  # Distribute based on weights

backends:
    - name: primary
    url: http://primary:8000
    weight: 3
    priority: 1  # Preferred backend

    - name: secondary
    url: http://secondary:8000
    weight: 1
    priority: 2  # Fallback backend

Dynamic Strategy Switching¶

Via Environment Variable¶

# Change strategy at startup
export CONTINUUM_SELECTION_STRATEGY=LeastLatency
continuum-router --config config.yaml

Via Configuration Hot-Reload¶

# Update config.yaml
sed -i 's/selection_strategy: .*/selection_strategy: WeightedRoundRobin/' config.yaml

# Router automatically reloads configuration

Checking Current Strategy¶

# View current configuration including load balancing strategy
curl http://localhost:8080/admin/config | jq '.selection_strategy'

# View full configuration details
curl http://localhost:8080/admin/config

Monitoring Load Distribution¶

Backend Statistics¶

# Check backend statistics
curl http://localhost:8080/admin/backends

Response:

href="#__codelineno-16-1">{ "backends": [ { "name": "backend-1", "url": "http://server1:8000", "status": "healthy", "total_requests": 1523, "successful_requests": 1520, "failed_requests": 3, "average_latency_ms": 245, "p95_latency_ms": 450, "p99_latency_ms": 890, "weight": 2, "models_served": ["gpt-4", "gpt-3.5-turbo"], "last_selected": "2024-01-15T10:30:45Z" }, { "name": "backend-2", "url": "http://server2:8000", "status": "healthy", "total_requests": 761, "successful_requests": 760, "failed_requests": 1, "average_latency_ms": 312, "p95_latency_ms": 520, "p99_latency_ms": 950, "weight": 1, "models_served": ["gpt-4", "gpt-3.5-turbo"], "last_selected": "2024-01-15T10:30:44Z" } ], "strategy": "WeightedRoundRobin", "total_requests": 2284, "distribution_ratio": { "backend-1": 0.667, "backend-2": 0.333 } }

Prometheus Metrics¶

# Request distribution by backend
sum(rate(routing_decisions_total[5m])) by (selected_backend)

# Backend selection latency
histogram_quantile(0.95, routing_backend_selection_duration_seconds)

# Load balancing effectiveness
stddev(rate(backend_request_total[5m]) by (backend_id))

Advanced Configuration¶

Health-Aware Load Balancing¶

health_checks:
  enabled: true
  interval: 30s
  timeout: 5s
  unhealthy_threshold: 3
  healthy_threshold: 2

  # Adjust weight based on health
  dynamic_weight_adjustment:
    enabled: true
    degraded_weight_factor: 0.5  # Reduce weight by 50% when degraded

selection_strategy: WeightedRoundRobin

backends:
    - name: primary
    url: http://primary:8000
    weight: 100
    health_score_threshold:
      healthy: 0.9    # >90% success rate
      degraded: 0.7    # 70-90% success rate
      unhealthy: 0.0   # <70% success rate

Request-Aware Routing¶

routing:
  strategy: Custom

  # Route based on request characteristics
  rules:
        - condition:
        model: "gpt-4"
        max_tokens: { greater_than: 2000 }
      strategy: ConsistentHash  # Long requests to same backend

        - condition:
        model: "gpt-3.5-turbo"
        stream: true
      strategy: LeastLatency  # Streaming to fastest backend

        - condition:
        default: true
      strategy: WeightedRoundRobin

Geographic Load Balancing¶

routing:
  strategy: Geographic

backends:
    - name: us-west
    url: http://us-west.example.com:8000
    region: us-west
    weight: 1

    - name: us-east
    url: http://us-east.example.com:8000
    region: us-east
    weight: 1

    - name: eu-central
    url: http://eu-central.example.com:8000
    region: eu-central
    weight: 1

geographic_routing:
  detect_client_region: true
  fallback_to_nearest: true
  latency_based_selection: true

Best Practices¶

1. Start Simple¶

Begin with Round-Robin for initial deployments
Monitor performance metrics
Switch strategies based on observed patterns

2. Monitor and Adjust¶

Use /admin/backends to track backend performance
Watch for load imbalances
Adjust weights incrementally (±10% at a time)

3. Consider Your Workload¶

Uniform requests: Round-Robin or Random
Variable capacity: WeightedRoundRobin
Performance critical: LeastLatency
Cache-heavy: ConsistentHash

4. Health Check Configuration¶

Enable health checks for automatic failover
Set appropriate thresholds based on SLAs
Use shorter intervals for critical backends

5. Testing Strategies¶

# Test load distribution
for i in {1..100}; do
  curl -s http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"gpt-3.5-turbo","messages":[{"role":"user","content":"Hi"}]}'
done

# Check distribution
curl http://localhost:8080/admin/backends | jq '.distribution_ratio'

6. Gradual Migration¶

When changing strategies: 1. Test in staging environment 2. Monitor for 24 hours 3. Gradually roll out to production 4. Keep previous configuration for rollback

Strategy Selection Guide¶

Strategy	Best For	Pros	Cons	When to Use
RoundRobin	Equal backends	Simple, fair distribution	Ignores backend capacity	Default choice, homogeneous backends
WeightedRoundRobin	Mixed capacity backends	Respects backend capabilities	Requires weight tuning	Known performance differences
LeastLatency	Performance optimization	Adapts to real conditions	Needs warm-up period	Production environments, SLA critical
Random	Stateless workloads	No state overhead	Less predictable	Simple deployments, testing
ConsistentHash	Cache optimization	Maximizes cache hits	Can cause imbalance	Model-heavy workloads, stateful services
PrefixAwareHash	KV cache optimization	40-60% TTFT reduction	Requires KV cache backends	vLLM/SGLang with shared prompts

Troubleshooting¶

Uneven Load Distribution¶

# Check strategy
curl http://localhost:8080/admin/config | jq '.selection_strategy'

# Verify backend weights
curl http://localhost:8080/admin/backends | jq '.[].weight'

# Check health status
curl http://localhost:8080/admin/backends | jq '.[].status'

High Latency with LeastLatency¶

Check if warm-up period has completed
Verify latency measurements are accurate
Consider increasing health check frequency
Check for network issues

ConsistentHash Imbalance¶

Review model distribution across backends
Consider adding more backends
Use weight adjustments to compensate
Monitor cache hit rates