Performance Guide¶

This guide covers performance characteristics, optimization strategies, benchmarking, and tuning for the Continuum Router.

Table of Contents¶

Performance Characteristics
Benchmarks
Performance Tuning
Optimization Strategies
Caching Strategy
Connection Pooling
Load Testing
Monitoring Performance
Troubleshooting Performance Issues

Performance Characteristics¶

Target Metrics¶

Metric	Target	Notes
Routing Latency	< 10ms	Overhead added by router
Throughput	1000+ req/s	Per instance
Memory Usage	< 100MB	Without caching
CPU Usage	< 20%	Under moderate load
Connection Pool	500 connections	Per backend
Concurrent Requests	1000+	With proper tuning
Cache Hit Rate	> 80%	Model list caching
Startup Time	< 5s	Cold start
Health Check Latency	< 100ms	Per backend

Scalability Limits¶

Backends: Support for multiple backends tested
Models: Hundreds of models can be aggregated and cached
Request Size: Configurable maximum request size
Response Streaming: No buffering, minimal memory overhead
Concurrent Connections: Thousands of concurrent connections supported with tuning

Resource Requirements¶

Minimum (Development)¶

CPU: 1 core
Memory: 256MB
Network: 100 Mbps
Disk: 10GB

Recommended (Production)¶

CPU: 4 cores
Memory: 2GB
Network: 1 Gbps
Disk: 50GB SSD

High-Performance¶

CPU: 16+ cores
Memory: 8GB
Network: 10 Gbps
Disk: 100GB NVMe SSD

Benchmarks¶

Running Benchmarks¶

The project includes performance benchmarks in the benches/ directory:

# Run all benchmarks
cargo bench

# Run specific benchmark
cargo bench performance

# Generate benchmark report
cargo bench -- --save-baseline main

Example Throughput Test¶

Using wrk for throughput testing:

# Install wrk
brew install wrk  # macOS
apt-get install wrk  # Ubuntu

# Basic throughput test
wrk -t12 -c400 -d30s --latency \
  http://localhost:8080/v1/chat/completions \
  -s scripts/chat_completion.lua

# Note: Create a Lua script for POST requests with proper payloads

Example Lua script (scripts/chat_completion.lua):

wrk.method = "POST"
wrk.body   = '{"model":"gpt-3.5-turbo","messages":[{"role":"user","content":"Hello"}]}'
wrk.headers["Content-Type"] = "application/json"

Example Latency Test¶

Using hey for latency testing:

# Install hey
go install github.com/rakyll/hey@latest

# Latency test
hey -z 30s -c 100 -m POST \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-3.5-turbo","messages":[{"role":"user","content":"Hi"}]}' \
  http://localhost:8080/v1/chat/completions

Memory Usage Monitoring¶

# Monitor memory usage during operation
while true; do
  ps aux | grep continuum-router | grep -v grep | awk '{print $6/1024 " MB"}'
  sleep 5
done

Streaming Performance Test¶

# Test SSE streaming performance
curl -N -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [{"role": "user", "content": "Tell me a story"}],
    "stream": true
  }' | pv > /dev/null

Performance Tuning¶

Operating System Tuning¶

Linux Kernel Parameters¶

# /etc/sysctl.conf

# Network tuning
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 15

# File descriptor limits
fs.file-max = 2097152
fs.nr_open = 2097152

# Memory tuning
vm.swappiness = 10
vm.dirty_ratio = 15
vm.dirty_background_ratio = 5

# Apply settings
sudo sysctl -p

User Limits¶

# /etc/security/limits.conf
continuum soft nofile 65535
continuum hard nofile 65535
continuum soft nproc 32768
continuum hard nproc 32768

Application Tuning¶

High-Throughput Configuration¶

# config-high-throughput.yaml
server:
  bind_address: "0.0.0.0:8080"
  connection_pool_size: 1000
  keepalive_timeout: 75

request:
  timeout: 30
  max_retries: 1

cache:
  model_cache_ttl: 900  # 15 minutes

http_client:
  pool_idle_timeout: 90
  pool_max_idle_per_host: 100

Low-Latency Configuration¶

# config-low-latency.yaml
server:
  bind_address: "0.0.0.0:8080"

routing:
  strategy: "LeastLatency"  # Available: RoundRobin, WeightedRoundRobin, LeastLatency, Random, ConsistentHash
  fallback_strategy: "RoundRobin"

request:
  timeout: 10
  max_retries: 0  # No retries for lowest latency

cache:
  model_cache_ttl: 3600  # 1 hour cache for consistency

health_checks:
  interval: 10  # Frequent checks for accurate routing

Memory-Constrained Configuration¶

# config-low-memory.yaml
server:
  connection_pool_size: 10

cache:
  model_cache_ttl: 60  # Short TTL to reduce memory

request:
  timeout: 30

logging:
  level: "error"  # Reduce log volume

Optimization Strategies¶

Request Optimization¶

Request Deduplication

deduplication:
  enabled: true
  ttl: 60  # seconds

Connection Pooling

server:
  connection_pool_size: 500  # Adjust based on load

Timeout Configuration

request:
  timeout: 30  # seconds
  streaming_timeout: 300  # 5 minutes for streaming

Response Optimization¶

Streaming
SSE streaming is automatically enabled for requests with "stream": true
No buffering ensures minimal memory overhead
Chunks are forwarded immediately

Model Caching

cache:
  model_cache_ttl: 300  # 5 minutes

Backend Optimization¶

Load Balancing Strategies

Available strategies: - RoundRobin: Simple round-robin selection - WeightedRoundRobin: Weight-based distribution - LeastLatency: Route to fastest backend - Random: Random backend selection - ConsistentHash: Hash-based routing for session affinity

Example configuration:

routing:
  strategy: "LeastLatency"
  fallback_strategy: "RoundRobin"

  # For WeightedRoundRobin
  weights:
    "http://backend1:11434": 3
    "http://backend2:11434": 1

Health Checks

health_checks:
  enabled: true
  interval: 30  # seconds
  timeout: 5    # seconds
  unhealthy_threshold: 3
  healthy_threshold: 2

Caching Strategy¶

Model List Caching¶

The router implements in-memory caching for model lists:

cache:
  model_cache_ttl: 300  # 5 minutes
  deduplication:
    enabled: true
    ttl: 60  # 1 minute

Cache Metrics¶

Monitor cache effectiveness through metrics: - Cache hit rate - Cache miss rate - Cache eviction rate - Average cache entry size

Connection Pooling¶

Configuration¶

server:
  connection_pool_size: 500  # Maximum connections per backend

http_client:
  pool_idle_timeout: 90  # seconds
  pool_max_idle_per_host: 100

Best Practices¶

Set pool size based on expected concurrent requests
Monitor connection reuse rate
Adjust idle timeout based on request patterns
Use HTTP/2 when supported by backends

TTFB Optimization¶

Time To First Byte (TTFB) is critical for streaming LLM responses. The router implements several optimizations to minimize TTFB overhead.

Connection Pre-warming¶

The router pre-warms connections to all backends during startup:

// Automatic connection pre-warming on startup
// - Establishes HTTP/2 connections early
// - Reduces cold-start latency for first requests
// - Includes authentication headers for proper connection state

Pre-warming Behavior by Backend Type:

Backend Type	Pre-warm Endpoint	Headers
OpenAI	`GET /v1/models`	`Authorization: Bearer`
Anthropic	`POST /v1/messages` (empty)	`x-api-key`, `anthropic-version`
Gemini	`GET /v1/models`	`Authorization: Bearer`

Streaming Client Optimization¶

For streaming requests, the router uses an optimized HTTP client:

// HttpClientFactory::optimized_streaming()
// - HTTP/2 with aggressive keep-alive
// - Large connection pool (100 per host)
// - 600s timeout for extended thinking models
// - TCP keepalive enabled

TTFB Test Scripts¶

The project includes TTFB comparison scripts in tests/scripts/:

# Test individual backends
./tests/scripts/test_anthropic_ttfb.sh claude-haiku-4-5 5
./tests/scripts/test_openai_ttfb.sh gpt-4o-mini 5
./tests/scripts/test_gemini_ttfb.sh gemini-2.5-flash 5

# Test all backends
./tests/scripts/test_all_ttfb.sh 5

Example Output:

=== Anthropic TTFB Test ===
Model: claude-haiku-4-5
Requests per test: 5

Direct to Anthropic API:
  Request 1: TTFB=1.106s
  Request 2: TTFB=0.904s
  Average: 1.005s

Through Router:
  Request 1: TTFB=1.186s
  Request 2: TTFB=0.980s
  Average: 1.083s

Router overhead: 0.078s

TTFB Metrics¶

Backend	Direct API	Through Router	Overhead
Anthropic (claude-haiku-4-5)	~1.0s	~1.1s	~0.1s
OpenAI (gpt-4o-mini)	~0.7s	~0.3s*	-0.4s
Gemini (gemini-2.5-flash)	~0.9s	~0.8s	-0.1s

*Router often faster than direct due to connection pooling and HTTP/2 reuse

Reducing TTFB¶

Enable Connection Pre-warming: Enabled by default on startup
Use HTTP/2: All backends support HTTP/2, enabled by default
Minimize TLS Handshakes: Connection pooling reuses TLS sessions
Backend Selection: Use LeastLatency routing strategy for optimal TTFB

Load Testing¶

Using k6¶

Install k6:

# macOS
brew install k6

# Linux
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D69
echo "deb https://dl.k6.io/deb stable main" | sudo tee /etc/apt/sources.list.d/k6.list
sudo apt-get update
sudo apt-get install k6

Create a test script (load-test.js):

import http from 'k6/http';
import { check, sleep } from 'k6';

export let options = {
  stages: [
    { duration: '2m', target: 100 },  // Ramp up
    { duration: '5m', target: 100 },  // Stay at 100
    { duration: '2m', target: 200 },  // Ramp to 200
    { duration: '5m', target: 200 },  // Stay at 200
    { duration: '2m', target: 0 },    // Ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<500'],  // 95% under 500ms
    http_req_failed: ['rate<0.1'],     // Error rate < 10%
  },
};

export default function() {
  let payload = JSON.stringify({
    model: 'gpt-3.5-turbo',
    messages: [{ role: 'user', content: 'Hello' }],
  });

  let params = {
    headers: { 'Content-Type': 'application/json' },
  };

  let res = http.post('http://localhost:8080/v1/chat/completions', payload, params);

  check(res, {
    'status is 200': (r) => r.status === 200,
    'response time < 500ms': (r) => r.timings.duration < 500,
  });

  sleep(1);
}

Run the test:

k6 run load-test.js

Using Apache Bench (ab)¶

For simple load testing:

# Install ab (usually comes with Apache)
apt-get install apache2-utils  # Ubuntu
brew install httpd  # macOS

# Simple test
ab -n 1000 -c 100 -p payload.json -T application/json \
  http://localhost:8080/v1/chat/completions

Monitoring Performance¶

Prometheus Metrics¶

The router exposes Prometheus metrics at /metrics endpoint (when enabled):

metrics:
  enabled: true
  endpoint: "/metrics"
  auth:
    enabled: true
    username: "metrics"
    password: "secure_password"

Key metrics to monitor: - http_requests_total: Total number of requests - http_request_duration_seconds: Request latency histogram - backend_request_duration_seconds: Backend latency - backend_health_status: Health status of backends - active_connections: Current active connections

Example Prometheus Queries¶

# Request rate
rate(http_requests_total[5m])

# Error rate
rate(http_requests_total{status=~"5.."}[5m])

# P95 latency
histogram_quantile(0.95, http_request_duration_seconds_bucket)

# Backend health
backend_health_status

Grafana Dashboard¶

Create a dashboard with these panels: 1. Request rate (req/s) 2. Error rate (%) 3. Latency percentiles (p50, p95, p99) 4. Active connections 5. Backend health status 6. Cache hit rate 7. Memory usage 8. CPU usage

Troubleshooting Performance Issues¶

High Latency¶

Check Backend Latency

# Direct backend test
time curl -X POST http://backend:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-3.5-turbo","messages":[{"role":"user","content":"test"}]}'

Enable Debug Logging

RUST_LOG=continuum_router=debug cargo run

Check Health Status

curl http://localhost:8080/admin/health

High Memory Usage¶

Check Cache Size

# Monitor memory usage
ps aux | grep continuum-router

Reduce Cache TTL

cache:
  model_cache_ttl: 60  # Reduce from default

Profile Memory Usage

# Use heaptrack for memory profiling
heaptrack ./target/release/continuum-router
heaptrack_gui heaptrack.continuum-router.*.gz

Low Throughput¶

Check Connection Pool

server:
  connection_pool_size: 1000  # Increase if needed

Verify System Limits

# Check file descriptor limits
ulimit -n

# Check TCP settings
sysctl net.core.somaxconn

Test Backend Directly

# Bypass router to test backend performance
ab -n 1000 -c 100 http://backend:11434/v1/models

Connection Issues¶

Check File Descriptor Limits

# Current limits
ulimit -n

# Process limits
cat /proc/$(pgrep continuum-router)/limits | grep "open files"

Monitor Connection States

# Connection state distribution
ss -ant | awk '{print $1}' | sort | uniq -c

# TIME_WAIT connections
ss -ant | grep TIME-WAIT | wc -l

Best Practices¶

Development¶

Run benchmarks using cargo bench before merging performance-critical changes
Use profiling tools during development
Set up performance regression tests
Monitor resource usage in staging

Production¶

Start with conservative settings and tune gradually
Monitor key metrics continuously
Set up alerting for performance degradation
Plan capacity based on peak load + 20% buffer
Use horizontal scaling for high availability
Implement graceful degradation under load

Testing¶

Test with realistic workloads
Include streaming and non-streaming requests
Test with various model configurations
Simulate network issues and backend failures
Perform regular load testing in staging

Future Improvements¶

The following features are planned for future releases: - Redis-based distributed caching (L2 cache) - Advanced cache warming strategies - Locality-aware routing - WebSocket support for bidirectional streaming - Built-in distributed tracing - Auto-scaling based on load metrics

Performance Guide¶

Table of Contents¶

Performance Characteristics¶

Target Metrics¶

Scalability Limits¶

Resource Requirements¶

Minimum (Development)¶

Recommended (Production)¶

High-Performance¶

Benchmarks¶

Running Benchmarks¶

Example Throughput Test¶

Example Latency Test¶

Memory Usage Monitoring¶

Streaming Performance Test¶

Performance Tuning¶

Operating System Tuning¶

Linux Kernel Parameters¶

User Limits¶

Application Tuning¶

High-Throughput Configuration¶

Low-Latency Configuration¶

Memory-Constrained Configuration¶

Optimization Strategies¶

Request Optimization¶

Response Optimization¶

Backend Optimization¶

Caching Strategy¶

Model List Caching¶

Cache Metrics¶

Connection Pooling¶

Configuration¶

Best Practices¶

TTFB Optimization¶

Connection Pre-warming¶

Streaming Client Optimization¶

TTFB Test Scripts¶

TTFB Metrics¶

Reducing TTFB¶

Load Testing¶

Using k6¶

Using Apache Bench (ab)¶

Monitoring Performance¶

Prometheus Metrics¶

Example Prometheus Queries¶

Grafana Dashboard¶

Troubleshooting Performance Issues¶

High Latency¶

High Memory Usage¶

Low Throughput¶

Connection Issues¶

Best Practices¶

Development¶

Production¶

Testing¶

Future Improvements¶

See Also¶