Performance Guide¶
This guide covers performance characteristics, optimization strategies, benchmarking, and tuning for the Continuum Router.
Table of Contents¶
- Performance Characteristics
- Benchmarks
- Performance Tuning
- Optimization Strategies
- Caching Strategy
- Connection Pooling
- Load Testing
- Monitoring Performance
- Troubleshooting Performance Issues
Performance Characteristics¶
Target Metrics¶
| Metric | Target | Notes |
|---|---|---|
| Routing Latency | < 10ms | Overhead added by router |
| Throughput | 1000+ req/s | Per instance |
| Memory Usage | < 100MB | Without caching |
| CPU Usage | < 20% | Under moderate load |
| Connection Pool | 500 connections | Per backend |
| Concurrent Requests | 1000+ | With proper tuning |
| Cache Hit Rate | > 80% | Model list caching |
| Startup Time | < 5s | Cold start |
| Health Check Latency | < 100ms | Per backend |
Scalability Limits¶
- Backends: Support for multiple backends tested
- Models: Hundreds of models can be aggregated and cached
- Request Size: Configurable maximum request size
- Response Streaming: No buffering, minimal memory overhead
- Concurrent Connections: Thousands of concurrent connections supported with tuning
Resource Requirements¶
Minimum (Development)¶
- CPU: 1 core
- Memory: 256MB
- Network: 100 Mbps
- Disk: 10GB
Recommended (Production)¶
- CPU: 4 cores
- Memory: 2GB
- Network: 1 Gbps
- Disk: 50GB SSD
High-Performance¶
- CPU: 16+ cores
- Memory: 8GB
- Network: 10 Gbps
- Disk: 100GB NVMe SSD
Benchmarks¶
Running Benchmarks¶
The project includes performance benchmarks in the benches/ directory:
# Run all benchmarks
cargo bench
# Run specific benchmark
cargo bench performance
# Generate benchmark report
cargo bench -- --save-baseline main
Example Throughput Test¶
Using wrk for throughput testing:
# Install wrk
brew install wrk # macOS
apt-get install wrk # Ubuntu
# Basic throughput test
wrk -t12 -c400 -d30s --latency \
http://localhost:8080/v1/chat/completions \
-s scripts/chat_completion.lua
# Note: Create a Lua script for POST requests with proper payloads
Example Lua script (scripts/chat_completion.lua):
wrk.method = "POST"
wrk.body = '{"model":"gpt-3.5-turbo","messages":[{"role":"user","content":"Hello"}]}'
wrk.headers["Content-Type"] = "application/json"
Example Latency Test¶
Using hey for latency testing:
# Install hey
go install github.com/rakyll/hey@latest
# Latency test
hey -z 30s -c 100 -m POST \
-H "Content-Type: application/json" \
-d '{"model":"gpt-3.5-turbo","messages":[{"role":"user","content":"Hi"}]}' \
http://localhost:8080/v1/chat/completions
Memory Usage Monitoring¶
# Monitor memory usage during operation
while true; do
ps aux | grep continuum-router | grep -v grep | awk '{print $6/1024 " MB"}'
sleep 5
done
Streaming Performance Test¶
# Test SSE streaming performance
curl -N -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "Tell me a story"}],
"stream": true
}' | pv > /dev/null
Performance Tuning¶
Operating System Tuning¶
Linux Kernel Parameters¶
# /etc/sysctl.conf
# Network tuning
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 15
# File descriptor limits
fs.file-max = 2097152
fs.nr_open = 2097152
# Memory tuning
vm.swappiness = 10
vm.dirty_ratio = 15
vm.dirty_background_ratio = 5
# Apply settings
sudo sysctl -p
User Limits¶
# /etc/security/limits.conf
continuum soft nofile 65535
continuum hard nofile 65535
continuum soft nproc 32768
continuum hard nproc 32768
Application Tuning¶
High-Throughput Configuration¶
# config-high-throughput.yaml
server:
bind_address: "0.0.0.0:8080"
connection_pool_size: 1000
keepalive_timeout: 75
request:
timeout: 30
max_retries: 1
cache:
model_cache_ttl: 900 # 15 minutes
http_client:
pool_idle_timeout: 90
pool_max_idle_per_host: 100
Low-Latency Configuration¶
# config-low-latency.yaml
server:
bind_address: "0.0.0.0:8080"
routing:
strategy: "LeastLatency" # Available: RoundRobin, WeightedRoundRobin, LeastLatency, Random, ConsistentHash
fallback_strategy: "RoundRobin"
request:
timeout: 10
max_retries: 0 # No retries for lowest latency
cache:
model_cache_ttl: 3600 # 1 hour cache for consistency
health_checks:
interval: 10 # Frequent checks for accurate routing
Memory-Constrained Configuration¶
# config-low-memory.yaml
server:
connection_pool_size: 10
cache:
model_cache_ttl: 60 # Short TTL to reduce memory
request:
timeout: 30
logging:
level: "error" # Reduce log volume
Optimization Strategies¶
Request Optimization¶
-
Request Deduplication
-
Connection Pooling
-
Timeout Configuration
Response Optimization¶
- Streaming
- SSE streaming is automatically enabled for requests with
"stream": true - No buffering ensures minimal memory overhead
-
Chunks are forwarded immediately
-
Model Caching
Backend Optimization¶
- Load Balancing Strategies
Available strategies: - RoundRobin: Simple round-robin selection - WeightedRoundRobin: Weight-based distribution - LeastLatency: Route to fastest backend - Random: Random backend selection - ConsistentHash: Hash-based routing for session affinity
Example configuration:
routing:
strategy: "LeastLatency"
fallback_strategy: "RoundRobin"
# For WeightedRoundRobin
weights:
"http://backend1:11434": 3
"http://backend2:11434": 1
- Health Checks
Caching Strategy¶
Model List Caching¶
The router implements in-memory caching for model lists:
Cache Metrics¶
Monitor cache effectiveness through metrics: - Cache hit rate - Cache miss rate - Cache eviction rate - Average cache entry size
Connection Pooling¶
Configuration¶
server:
connection_pool_size: 500 # Maximum connections per backend
http_client:
pool_idle_timeout: 90 # seconds
pool_max_idle_per_host: 100
Best Practices¶
- Set pool size based on expected concurrent requests
- Monitor connection reuse rate
- Adjust idle timeout based on request patterns
- Use HTTP/2 when supported by backends
TTFB Optimization¶
Time To First Byte (TTFB) is critical for streaming LLM responses. The router implements several optimizations to minimize TTFB overhead.
Connection Pre-warming¶
The router pre-warms connections to all backends during startup:
// Automatic connection pre-warming on startup
// - Establishes HTTP/2 connections early
// - Reduces cold-start latency for first requests
// - Includes authentication headers for proper connection state
Pre-warming Behavior by Backend Type:
| Backend Type | Pre-warm Endpoint | Headers |
|---|---|---|
| OpenAI | GET /v1/models | Authorization: Bearer |
| Anthropic | POST /v1/messages (empty) | x-api-key, anthropic-version |
| Gemini | GET /v1/models | Authorization: Bearer |
Streaming Client Optimization¶
For streaming requests, the router uses an optimized HTTP client:
// HttpClientFactory::optimized_streaming()
// - HTTP/2 with aggressive keep-alive
// - Large connection pool (100 per host)
// - 600s timeout for extended thinking models
// - TCP keepalive enabled
TTFB Test Scripts¶
The project includes TTFB comparison scripts in tests/scripts/:
# Test individual backends
./tests/scripts/test_anthropic_ttfb.sh claude-haiku-4-5 5
./tests/scripts/test_openai_ttfb.sh gpt-4o-mini 5
./tests/scripts/test_gemini_ttfb.sh gemini-2.5-flash 5
# Test all backends
./tests/scripts/test_all_ttfb.sh 5
Example Output:
=== Anthropic TTFB Test ===
Model: claude-haiku-4-5
Requests per test: 5
Direct to Anthropic API:
Request 1: TTFB=1.106s
Request 2: TTFB=0.904s
Average: 1.005s
Through Router:
Request 1: TTFB=1.186s
Request 2: TTFB=0.980s
Average: 1.083s
Router overhead: 0.078s
TTFB Metrics¶
| Backend | Direct API | Through Router | Overhead |
|---|---|---|---|
| Anthropic (claude-haiku-4-5) | ~1.0s | ~1.1s | ~0.1s |
| OpenAI (gpt-4o-mini) | ~0.7s | ~0.3s* | -0.4s |
| Gemini (gemini-2.5-flash) | ~0.9s | ~0.8s | -0.1s |
*Router often faster than direct due to connection pooling and HTTP/2 reuse
Reducing TTFB¶
- Enable Connection Pre-warming: Enabled by default on startup
- Use HTTP/2: All backends support HTTP/2, enabled by default
- Minimize TLS Handshakes: Connection pooling reuses TLS sessions
- Backend Selection: Use
LeastLatencyrouting strategy for optimal TTFB
Load Testing¶
Using k6¶
Install k6:
# macOS
brew install k6
# Linux
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D69
echo "deb https://dl.k6.io/deb stable main" | sudo tee /etc/apt/sources.list.d/k6.list
sudo apt-get update
sudo apt-get install k6
Create a test script (load-test.js):
import http from 'k6/http';
import { check, sleep } from 'k6';
export let options = {
stages: [
{ duration: '2m', target: 100 }, // Ramp up
{ duration: '5m', target: 100 }, // Stay at 100
{ duration: '2m', target: 200 }, // Ramp to 200
{ duration: '5m', target: 200 }, // Stay at 200
{ duration: '2m', target: 0 }, // Ramp down
],
thresholds: {
http_req_duration: ['p(95)<500'], // 95% under 500ms
http_req_failed: ['rate<0.1'], // Error rate < 10%
},
};
export default function() {
let payload = JSON.stringify({
model: 'gpt-3.5-turbo',
messages: [{ role: 'user', content: 'Hello' }],
});
let params = {
headers: { 'Content-Type': 'application/json' },
};
let res = http.post('http://localhost:8080/v1/chat/completions', payload, params);
check(res, {
'status is 200': (r) => r.status === 200,
'response time < 500ms': (r) => r.timings.duration < 500,
});
sleep(1);
}
Run the test:
Using Apache Bench (ab)¶
For simple load testing:
# Install ab (usually comes with Apache)
apt-get install apache2-utils # Ubuntu
brew install httpd # macOS
# Simple test
ab -n 1000 -c 100 -p payload.json -T application/json \
http://localhost:8080/v1/chat/completions
Monitoring Performance¶
Prometheus Metrics¶
The router exposes Prometheus metrics at /metrics endpoint (when enabled):
metrics:
enabled: true
endpoint: "/metrics"
auth:
enabled: true
username: "metrics"
password: "secure_password"
Key metrics to monitor: - http_requests_total: Total number of requests - http_request_duration_seconds: Request latency histogram - backend_request_duration_seconds: Backend latency - backend_health_status: Health status of backends - active_connections: Current active connections
Example Prometheus Queries¶
# Request rate
rate(http_requests_total[5m])
# Error rate
rate(http_requests_total{status=~"5.."}[5m])
# P95 latency
histogram_quantile(0.95, http_request_duration_seconds_bucket)
# Backend health
backend_health_status
Grafana Dashboard¶
Create a dashboard with these panels: 1. Request rate (req/s) 2. Error rate (%) 3. Latency percentiles (p50, p95, p99) 4. Active connections 5. Backend health status 6. Cache hit rate 7. Memory usage 8. CPU usage
Troubleshooting Performance Issues¶
High Latency¶
-
Check Backend Latency
-
Enable Debug Logging
-
Check Health Status
High Memory Usage¶
-
Check Cache Size
-
Reduce Cache TTL
-
Profile Memory Usage
Low Throughput¶
-
Check Connection Pool
-
Verify System Limits
-
Test Backend Directly
Connection Issues¶
-
Check File Descriptor Limits
-
Monitor Connection States
Best Practices¶
Development¶
- Run benchmarks using
cargo benchbefore merging performance-critical changes - Use profiling tools during development
- Set up performance regression tests
- Monitor resource usage in staging
Production¶
- Start with conservative settings and tune gradually
- Monitor key metrics continuously
- Set up alerting for performance degradation
- Plan capacity based on peak load + 20% buffer
- Use horizontal scaling for high availability
- Implement graceful degradation under load
Testing¶
- Test with realistic workloads
- Include streaming and non-streaming requests
- Test with various model configurations
- Simulate network issues and backend failures
- Perform regular load testing in staging
Future Improvements¶
The following features are planned for future releases: - Redis-based distributed caching (L2 cache) - Advanced cache warming strategies - Locality-aware routing - WebSocket support for bidirectional streaming - Built-in distributed tracing - Auto-scaling based on load metrics