Skip to content

Performance Guide

This guide covers performance characteristics, optimization strategies, benchmarking, and tuning for the Continuum Router.

Table of Contents

Performance Characteristics

Target Metrics

Metric Target Notes
Routing Latency < 10ms Overhead added by router
Throughput 1000+ req/s Per instance
Memory Usage < 100MB Without caching
CPU Usage < 20% Under moderate load
Connection Pool 500 connections Per backend
Concurrent Requests 1000+ With proper tuning
Cache Hit Rate > 80% Model list caching
Startup Time < 5s Cold start
Health Check Latency < 100ms Per backend

Scalability Limits

  • Backends: Support for multiple backends tested
  • Models: Hundreds of models can be aggregated and cached
  • Request Size: Configurable maximum request size
  • Response Streaming: No buffering, minimal memory overhead
  • Concurrent Connections: Thousands of concurrent connections supported with tuning

Resource Requirements

Minimum (Development)

  • CPU: 1 core
  • Memory: 256MB
  • Network: 100 Mbps
  • Disk: 10GB
  • CPU: 4 cores
  • Memory: 2GB
  • Network: 1 Gbps
  • Disk: 50GB SSD

High-Performance

  • CPU: 16+ cores
  • Memory: 8GB
  • Network: 10 Gbps
  • Disk: 100GB NVMe SSD

Benchmarks

Running Benchmarks

The project includes performance benchmarks in the benches/ directory:

# Run all benchmarks
cargo bench

# Run specific benchmark
cargo bench performance

# Generate benchmark report
cargo bench -- --save-baseline main

Example Throughput Test

Using wrk for throughput testing:

# Install wrk
brew install wrk  # macOS
apt-get install wrk  # Ubuntu

# Basic throughput test
wrk -t12 -c400 -d30s --latency \
  http://localhost:8080/v1/chat/completions \
  -s scripts/chat_completion.lua

# Note: Create a Lua script for POST requests with proper payloads

Example Lua script (scripts/chat_completion.lua):

wrk.method = "POST"
wrk.body   = '{"model":"gpt-3.5-turbo","messages":[{"role":"user","content":"Hello"}]}'
wrk.headers["Content-Type"] = "application/json"

Example Latency Test

Using hey for latency testing:

# Install hey
go install github.com/rakyll/hey@latest

# Latency test
hey -z 30s -c 100 -m POST \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-3.5-turbo","messages":[{"role":"user","content":"Hi"}]}' \
  http://localhost:8080/v1/chat/completions

Memory Usage Monitoring

# Monitor memory usage during operation
while true; do
  ps aux | grep continuum-router | grep -v grep | awk '{print $6/1024 " MB"}'
  sleep 5
done

Streaming Performance Test

# Test SSE streaming performance
curl -N -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [{"role": "user", "content": "Tell me a story"}],
    "stream": true
  }' | pv > /dev/null

Performance Tuning

Operating System Tuning

Linux Kernel Parameters

# /etc/sysctl.conf

# Network tuning
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 15

# File descriptor limits
fs.file-max = 2097152
fs.nr_open = 2097152

# Memory tuning
vm.swappiness = 10
vm.dirty_ratio = 15
vm.dirty_background_ratio = 5

# Apply settings
sudo sysctl -p

User Limits

# /etc/security/limits.conf
continuum soft nofile 65535
continuum hard nofile 65535
continuum soft nproc 32768
continuum hard nproc 32768

Application Tuning

High-Throughput Configuration

# config-high-throughput.yaml
server:
  bind_address: "0.0.0.0:8080"
  connection_pool_size: 1000
  keepalive_timeout: 75

request:
  timeout: 30
  max_retries: 1

cache:
  model_cache_ttl: 900  # 15 minutes

http_client:
  pool_idle_timeout: 90
  pool_max_idle_per_host: 100

Low-Latency Configuration

# config-low-latency.yaml
server:
  bind_address: "0.0.0.0:8080"

routing:
  strategy: "LeastLatency"  # Available: RoundRobin, WeightedRoundRobin, LeastLatency, Random, ConsistentHash
  fallback_strategy: "RoundRobin"

request:
  timeout: 10
  max_retries: 0  # No retries for lowest latency

cache:
  model_cache_ttl: 3600  # 1 hour cache for consistency

health_checks:
  interval: 10  # Frequent checks for accurate routing

Memory-Constrained Configuration

# config-low-memory.yaml
server:
  connection_pool_size: 10

cache:
  model_cache_ttl: 60  # Short TTL to reduce memory

request:
  timeout: 30

logging:
  level: "error"  # Reduce log volume

Optimization Strategies

Request Optimization

  1. Request Deduplication

    deduplication:
      enabled: true
      ttl: 60  # seconds
    

  2. Connection Pooling

    server:
      connection_pool_size: 500  # Adjust based on load
    

  3. Timeout Configuration

    request:
      timeout: 30  # seconds
      streaming_timeout: 300  # 5 minutes for streaming
    

Response Optimization

  1. Streaming
  2. SSE streaming is automatically enabled for requests with "stream": true
  3. No buffering ensures minimal memory overhead
  4. Chunks are forwarded immediately

  5. Model Caching

    cache:
      model_cache_ttl: 300  # 5 minutes
    

Backend Optimization

  1. Load Balancing Strategies

Available strategies: - RoundRobin: Simple round-robin selection - WeightedRoundRobin: Weight-based distribution - LeastLatency: Route to fastest backend - Random: Random backend selection - ConsistentHash: Hash-based routing for session affinity

Example configuration:

routing:
  strategy: "LeastLatency"
  fallback_strategy: "RoundRobin"

  # For WeightedRoundRobin
  weights:
    "http://backend1:11434": 3
    "http://backend2:11434": 1

  1. Health Checks
    health_checks:
      enabled: true
      interval: 30  # seconds
      timeout: 5    # seconds
      unhealthy_threshold: 3
      healthy_threshold: 2
    

Caching Strategy

Model List Caching

The router implements in-memory caching for model lists:

cache:
  model_cache_ttl: 300  # 5 minutes
  deduplication:
    enabled: true
    ttl: 60  # 1 minute

Cache Metrics

Monitor cache effectiveness through metrics: - Cache hit rate - Cache miss rate - Cache eviction rate - Average cache entry size

Connection Pooling

Configuration

server:
  connection_pool_size: 500  # Maximum connections per backend

http_client:
  pool_idle_timeout: 90  # seconds
  pool_max_idle_per_host: 100

Best Practices

  1. Set pool size based on expected concurrent requests
  2. Monitor connection reuse rate
  3. Adjust idle timeout based on request patterns
  4. Use HTTP/2 when supported by backends

TTFB Optimization

Time To First Byte (TTFB) is critical for streaming LLM responses. The router implements several optimizations to minimize TTFB overhead.

Connection Pre-warming

The router pre-warms connections to all backends during startup:

// Automatic connection pre-warming on startup
// - Establishes HTTP/2 connections early
// - Reduces cold-start latency for first requests
// - Includes authentication headers for proper connection state

Pre-warming Behavior by Backend Type:

Backend Type Pre-warm Endpoint Headers
OpenAI GET /v1/models Authorization: Bearer
Anthropic POST /v1/messages (empty) x-api-key, anthropic-version
Gemini GET /v1/models Authorization: Bearer

Streaming Client Optimization

For streaming requests, the router uses an optimized HTTP client:

// HttpClientFactory::optimized_streaming()
// - HTTP/2 with aggressive keep-alive
// - Large connection pool (100 per host)
// - 600s timeout for extended thinking models
// - TCP keepalive enabled

TTFB Test Scripts

The project includes TTFB comparison scripts in tests/scripts/:

# Test individual backends
./tests/scripts/test_anthropic_ttfb.sh claude-haiku-4-5 5
./tests/scripts/test_openai_ttfb.sh gpt-4o-mini 5
./tests/scripts/test_gemini_ttfb.sh gemini-2.5-flash 5

# Test all backends
./tests/scripts/test_all_ttfb.sh 5

Example Output:

=== Anthropic TTFB Test ===
Model: claude-haiku-4-5
Requests per test: 5

Direct to Anthropic API:
  Request 1: TTFB=1.106s
  Request 2: TTFB=0.904s
  Average: 1.005s

Through Router:
  Request 1: TTFB=1.186s
  Request 2: TTFB=0.980s
  Average: 1.083s

Router overhead: 0.078s

TTFB Metrics

Backend Direct API Through Router Overhead
Anthropic (claude-haiku-4-5) ~1.0s ~1.1s ~0.1s
OpenAI (gpt-4o-mini) ~0.7s ~0.3s* -0.4s
Gemini (gemini-2.5-flash) ~0.9s ~0.8s -0.1s

*Router often faster than direct due to connection pooling and HTTP/2 reuse

Reducing TTFB

  1. Enable Connection Pre-warming: Enabled by default on startup
  2. Use HTTP/2: All backends support HTTP/2, enabled by default
  3. Minimize TLS Handshakes: Connection pooling reuses TLS sessions
  4. Backend Selection: Use LeastLatency routing strategy for optimal TTFB

Load Testing

Using k6

Install k6:

# macOS
brew install k6

# Linux
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D69
echo "deb https://dl.k6.io/deb stable main" | sudo tee /etc/apt/sources.list.d/k6.list
sudo apt-get update
sudo apt-get install k6

Create a test script (load-test.js):

import http from 'k6/http';
import { check, sleep } from 'k6';

export let options = {
  stages: [
    { duration: '2m', target: 100 },  // Ramp up
    { duration: '5m', target: 100 },  // Stay at 100
    { duration: '2m', target: 200 },  // Ramp to 200
    { duration: '5m', target: 200 },  // Stay at 200
    { duration: '2m', target: 0 },    // Ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<500'],  // 95% under 500ms
    http_req_failed: ['rate<0.1'],     // Error rate < 10%
  },
};

export default function() {
  let payload = JSON.stringify({
    model: 'gpt-3.5-turbo',
    messages: [{ role: 'user', content: 'Hello' }],
  });

  let params = {
    headers: { 'Content-Type': 'application/json' },
  };

  let res = http.post('http://localhost:8080/v1/chat/completions', payload, params);

  check(res, {
    'status is 200': (r) => r.status === 200,
    'response time < 500ms': (r) => r.timings.duration < 500,
  });

  sleep(1);
}

Run the test:

k6 run load-test.js

Using Apache Bench (ab)

For simple load testing:

# Install ab (usually comes with Apache)
apt-get install apache2-utils  # Ubuntu
brew install httpd  # macOS

# Simple test
ab -n 1000 -c 100 -p payload.json -T application/json \
  http://localhost:8080/v1/chat/completions

Monitoring Performance

Prometheus Metrics

The router exposes Prometheus metrics at /metrics endpoint (when enabled):

metrics:
  enabled: true
  endpoint: "/metrics"
  auth:
    enabled: true
    username: "metrics"
    password: "secure_password"

Key metrics to monitor: - http_requests_total: Total number of requests - http_request_duration_seconds: Request latency histogram - backend_request_duration_seconds: Backend latency - backend_health_status: Health status of backends - active_connections: Current active connections

Example Prometheus Queries

# Request rate
rate(http_requests_total[5m])

# Error rate
rate(http_requests_total{status=~"5.."}[5m])

# P95 latency
histogram_quantile(0.95, http_request_duration_seconds_bucket)

# Backend health
backend_health_status

Grafana Dashboard

Create a dashboard with these panels: 1. Request rate (req/s) 2. Error rate (%) 3. Latency percentiles (p50, p95, p99) 4. Active connections 5. Backend health status 6. Cache hit rate 7. Memory usage 8. CPU usage

Troubleshooting Performance Issues

High Latency

  1. Check Backend Latency

    # Direct backend test
    time curl -X POST http://backend:11434/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{"model":"gpt-3.5-turbo","messages":[{"role":"user","content":"test"}]}'
    

  2. Enable Debug Logging

    RUST_LOG=continuum_router=debug cargo run
    

  3. Check Health Status

    curl http://localhost:8080/admin/health
    

High Memory Usage

  1. Check Cache Size

    # Monitor memory usage
    ps aux | grep continuum-router
    

  2. Reduce Cache TTL

    cache:
      model_cache_ttl: 60  # Reduce from default
    

  3. Profile Memory Usage

    # Use heaptrack for memory profiling
    heaptrack ./target/release/continuum-router
    heaptrack_gui heaptrack.continuum-router.*.gz
    

Low Throughput

  1. Check Connection Pool

    server:
      connection_pool_size: 1000  # Increase if needed
    

  2. Verify System Limits

    # Check file descriptor limits
    ulimit -n
    
    # Check TCP settings
    sysctl net.core.somaxconn
    

  3. Test Backend Directly

    # Bypass router to test backend performance
    ab -n 1000 -c 100 http://backend:11434/v1/models
    

Connection Issues

  1. Check File Descriptor Limits

    # Current limits
    ulimit -n
    
    # Process limits
    cat /proc/$(pgrep continuum-router)/limits | grep "open files"
    

  2. Monitor Connection States

    # Connection state distribution
    ss -ant | awk '{print $1}' | sort | uniq -c
    
    # TIME_WAIT connections
    ss -ant | grep TIME-WAIT | wc -l
    

Best Practices

Development

  1. Run benchmarks using cargo bench before merging performance-critical changes
  2. Use profiling tools during development
  3. Set up performance regression tests
  4. Monitor resource usage in staging

Production

  1. Start with conservative settings and tune gradually
  2. Monitor key metrics continuously
  3. Set up alerting for performance degradation
  4. Plan capacity based on peak load + 20% buffer
  5. Use horizontal scaling for high availability
  6. Implement graceful degradation under load

Testing

  1. Test with realistic workloads
  2. Include streaming and non-streaming requests
  3. Test with various model configurations
  4. Simulate network issues and backend failures
  5. Perform regular load testing in staging

Future Improvements

The following features are planned for future releases: - Redis-based distributed caching (L2 cache) - Advanced cache warming strategies - Locality-aware routing - WebSocket support for bidirectional streaming - Built-in distributed tracing - Auto-scaling based on load metrics

See Also