API Reference¶
Continuum Router provides a comprehensive OpenAI-compatible API with additional administrative endpoints for monitoring and management. This document provides detailed information about all available endpoints, request/response formats, and error handling.
Table of Contents¶
- Overview
- Authentication
- Core API Endpoints
- Anthropic Native API
- Admin Endpoints
- Configuration Management API
- Error Handling
- Rate Limiting
- Streaming
- Examples
Overview¶
Base URL¶
Content Type¶
All requests and responses use application/json unless otherwise specified.
OpenAI Compatibility¶
Continuum Router is fully compatible with OpenAI API v1, supporting: - Chat completions with streaming - Text completions - Embeddings (text-embedding-3-small, text-embedding-3-large, text-embedding-ada-002) - Image generation (DALL-E, gpt-image-1) - Image editing/inpainting (DALL-E 2, gpt-image-1) - Image variations (DALL-E 2) - Files API (upload, list, retrieve, delete) - File resolution in chat completions (image_file references) - Model listing - Error response formats
Authentication¶
Continuum Router supports API key authentication with configurable enforcement modes.
Authentication Modes¶
The router supports two authentication modes for API endpoints:
| Mode | Behavior |
|---|---|
permissive (default) | Requests without API key are allowed. Requests with valid API keys are authenticated and can access user-specific features. |
blocking | Only authenticated requests are processed. Requests without valid API key receive 401 Unauthorized. |
Configuration¶
api_keys:
# Authentication mode: "permissive" (default) or "blocking"
mode: blocking
# API key definitions
api_keys:
- key: "${API_KEY_1}"
id: "key-production-1"
user_id: "user-admin"
organization_id: "org-main"
scopes: [read, write, files, admin]
Protected Endpoints (when mode is blocking)¶
/v1/chat/completions/v1/completions/v1/responses/v1/images/generations/v1/images/edits/v1/images/variations/v1/models/v1/embeddings
Note: Health endpoints (/health, /healthz) are always accessible without authentication. Admin, Files, and Metrics endpoints have separate authentication mechanisms.
Making Authenticated Requests¶
Include the API key in the Authorization header:
POST /v1/chat/completions HTTP/1.1
Authorization: Bearer sk-your-api-key
Content-Type: application/json
{
"model": "gpt-4",
"messages": [{"role": "user", "content": "Hello"}]
}
Authentication Errors¶
When authentication fails, the API returns:
{
"error": {
"message": "Missing or invalid Authorization header. Expected: Bearer <api_key>",
"type": "authentication_error",
"code": "invalid_api_key"
}
}
Status Codes:
401 Unauthorized: Missing or invalid API key
Core API Endpoints¶
Health Check¶
Check the health status of the router service.
Response:
Status Codes:
200: Service is healthy
List Models¶
Retrieve all available models from all healthy backends.
Response:
{
"object": "list",
"data": [
{
"id": "gpt-4",
"object": "model",
"created": 1677610602,
"owned_by": "openai-compatible",
"permission": [],
"root": "gpt-4",
"parent": null
},
{
"id": "llama2:7b",
"object": "model",
"created": 1677610602,
"owned_by": "local-ollama",
"permission": [],
"root": "llama2:7b",
"parent": null
}
]
}
Status Codes:
200: Models retrieved successfully503: All backends are unhealthy
Features:
- Model Aggregation: Combines models from all healthy backends
- Deduplication: Removes duplicate models across backends
- Caching: Results cached for 5 minutes by default
- Health Awareness: Only includes models from healthy backends
Get Single Model¶
Retrieve information about a specific model, including its availability status and optional rich metadata.
Path Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Model identifier (e.g., "gpt-4", "llama2:7b") |
Response (Basic):
{
"id": "gpt-4",
"object": "model",
"created": 1677610602,
"owned_by": "openai",
"available": true
}
Response (With Extended Metadata):
When model metadata is available in model-metadata.yaml, the response includes additional fields:
{
"id": "gpt-4o",
"object": "model",
"created": 1704067200,
"owned_by": "openai",
"available": true,
"supported_methods": ["chat.completions"],
"features": ["chat", "vision", "audio", "code"],
"max_tokens": 16384,
"metadata": {
"display_name": "GPT-4o",
"developer": "OpenAI",
"summary": "Multimodal optimized model with text, image, and audio capabilities.",
"knowledge_cutoff": "2023-10",
"relative_speed": 4,
"pricing": {
"input_tokens": 2.50,
"output_tokens": 10.0
},
"limits": {
"context_window": 128000,
"max_output": 16384
}
}
}
Response Fields:
| Field | Type | Description |
|---|---|---|
id | string | Model identifier |
object | string | Object type (always "model") |
created | integer | Unix timestamp when model was created |
owned_by | string | Organization that owns/provides the model |
available | boolean | Whether the model can currently be used (true if at least one healthy backend provides it) |
supported_methods | array | (Optional) API methods this model supports (e.g., ["chat.completions"], ["images.generations"]) |
features | array | (Optional) Model capabilities (e.g., ["chat", "vision", "function_calling"]) |
max_tokens | integer | (Optional) Maximum output tokens for this model |
metadata | object | (Optional) Rich metadata object with detailed model information |
Metadata Object Fields:
| Field | Type | Description |
|---|---|---|
display_name | string | Human-readable display name for the model |
developer | string | Developer or organization that created the model |
summary | string | Brief summary describing the model's capabilities |
knowledge_cutoff | string | Knowledge cutoff date (e.g., "2025-01") |
relative_speed | integer | Relative speed indicator (1-5, where 1 is slowest and 5 is fastest) |
pricing | object | Pricing information (inputtokens, outputtokens per 1K tokens) |
limits | object | Model limits (contextwindow, maxoutput in tokens) |
Supported Methods Mapping:
The supported_methods field is derived from model capabilities:
| Capability | API Method |
|---|---|
chat, vision, code, reasoning, audio, video, function_calling, tool | chat.completions |
embedding | embeddings |
image_generation | images.generations |
image_edit | images.edits |
image_variation | images.variations |
moderation | moderations |
Status Codes:
200: Model found and information returned404: Model does not exist in any configured backend
Features:
- OpenAI-Compatible: Response format matches OpenAI API with additional extension fields
- Health-Aware: The
availablefield reflects real-time backend health status - Privacy: Does not expose internal backend information
- Backward Compatible: Extended fields are optional and only present when metadata is available
- Rich Metadata: Provides comprehensive model information for informed model selection
Model Availability Algorithm:
The available field is determined by the following algorithm:
| Condition | Result |
|---|---|
| Health checker enabled + At least one backend providing the model is healthy | true |
| Health checker enabled + All backends are unhealthy | false |
| Health checker disabled + Model's backend list is non-empty | true |
| Health checker disabled + Model's backend list is empty | false |
The algorithm uses short-circuit evaluation: it returns true as soon as the first healthy backend is found, avoiding unnecessary health checks for remaining backends. This optimizes performance when backends are generally healthy.
Performance Optimization:
- Fast Path: If the model cache is valid, lookup is O(n) where n = number of models
- Slow Path: If cache is empty/expired, triggers one-time aggregation with singleflight protection to prevent cache stampede
- Stale-While-Revalidate: Cache serves stale data while refreshing in background
Example Request:
Example Response (Model Available with Full Metadata):
{
"id": "gpt-4o",
"object": "model",
"created": 1704067200,
"owned_by": "openai",
"available": true,
"supported_methods": ["chat.completions"],
"features": ["chat", "vision", "audio", "code"],
"max_tokens": 16384,
"metadata": {
"display_name": "GPT-4o",
"developer": "OpenAI",
"summary": "Multimodal optimized model with text, image, and audio capabilities; faster and cheaper than GPT-4 Turbo.",
"knowledge_cutoff": "2023-10",
"relative_speed": 4,
"pricing": {
"input_tokens": 2.50,
"output_tokens": 10.0
},
"limits": {
"context_window": 128000,
"max_output": 16384
}
}
}
Example Response (Model Without Extended Metadata):
For models without metadata in model-metadata.yaml, only the basic fields are returned:
{
"id": "custom-model",
"object": "model",
"created": 1677610602,
"owned_by": "local",
"available": true
}
Example Response (Model Exists but Unavailable):
{
"id": "gpt-4o",
"object": "model",
"created": 1704067200,
"owned_by": "openai",
"available": false,
"supported_methods": ["chat.completions"],
"features": ["chat", "vision", "audio", "code"],
"max_tokens": 16384,
"metadata": {
"display_name": "GPT-4o",
"developer": "OpenAI"
}
}
Chat Completions¶
Generate chat completions using the OpenAI Chat API format.
Request Body:
{
"model": "gpt-3.5-turbo",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Explain quantum computing in simple terms."
}
],
"temperature": 0.7,
"max_tokens": 150,
"top_p": 1.0,
"frequency_penalty": 0.0,
"presence_penalty": 0.0,
"stream": false,
"stop": null,
"logit_bias": {},
"user": "user123"
}
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Model identifier (must be available on at least one healthy backend) |
messages | array | Yes | Array of message objects with role and content |
temperature | number | No | Sampling temperature (0.0 to 2.0, default: 1.0) |
max_tokens | integer | No | Maximum tokens to generate |
top_p | number | No | Nucleus sampling parameter (0.0 to 1.0) |
frequency_penalty | number | No | Frequency penalty (-2.0 to 2.0) |
presence_penalty | number | No | Presence penalty (-2.0 to 2.0) |
stream | boolean | No | Enable streaming response (default: false) |
stop | string/array | No | Stop sequences |
logit_bias | object | No | Token logit bias |
user | string | No | User identifier for tracking |
reasoning_effort | string | No | Reasoning effort level for reasoning-capable models. Valid values vary by backend (see below). Supported by O-series (o1, o3, o4) and GPT-5.2 thinking models (OpenAI), Gemini thinking models (Gemini). |
reasoning | object | No | Alternative nested format: {"effort": "high"}. Automatically normalized to reasoning_effort. |
Valid reasoning_effort values by backend:
| Backend | Effort Levels | Token Conversion | Notes |
|---|---|---|---|
| OpenAI O-series (o1, o3, o4-mini) | low, medium, high | Native | Standard reasoning models |
| OpenAI GPT-5.2 thinking (gpt-5.2, gpt-5.2-pro) | low, medium, high, xhigh | Native | xhigh only for GPT-5.2 family |
| Anthropic Claude (opus, sonnet-4) | none, minimal, low, medium, high | → budget_tokens (1K-32K) | Extended thinking |
| Gemini (2.x-flash, 2.5-pro, 3-*) | none*, minimal, low, medium, high | Native | *none only for Flash |
| Generic/llama.cpp | Any | Pass-through | Backend handles validation |
For detailed conversion tables and backend-specific behavior, see Reasoning Effort Architecture.
The router automatically validates values and applies intelligent fallbacks. Notably, xhigh is automatically downgraded to high for all backends except GPT-5.2 family models, allowing clients to always request xhigh without compatibility concerns.
Response (Non-streaming):
{
"id": "chatcmpl-123456789",
"object": "chat.completion",
"created": 1677652288,
"model": "gpt-3.5-turbo",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Quantum computing is a revolutionary computing paradigm that harnesses quantum mechanical phenomena..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 150,
"total_tokens": 175
}
}
Response (Streaming): When stream: true, the response uses Server-Sent Events (SSE):
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1677652288,"model":"gpt-3.5-turbo","choices":[{"delta":{"role":"assistant","content":""},"index":0,"finish_reason":null}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1677652288,"model":"gpt-3.5-turbo","choices":[{"delta":{"content":"Quantum"},"index":0,"finish_reason":null}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1677652288,"model":"gpt-3.5-turbo","choices":[{"delta":{"content":" computing"},"index":0,"finish_reason":null}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1677652288,"model":"gpt-3.5-turbo","choices":[{"delta":{},"index":0,"finish_reason":"stop"}]}
data: [DONE]
Status Codes:
200: Completion generated successfully400: Invalid request format or parameters404: Model not found on any healthy backend502: Backend connection error504: Request timeout503: All backends unhealthy
Features:
- Model-Based Routing: Automatically routes to backends serving the requested model
- Load Balancing: Distributes load across healthy backends
- Streaming Support: Real-time response streaming via SSE
- Error Recovery: Automatic retry on transient failures
- Request Deduplication: Prevents duplicate processing of identical requests
- Reasoning Parameter Normalization: Automatically normalizes nested
reasoningformat to flatreasoning_effortformat; removes reasoning parameters for models that don't support them
Responses API¶
Generate responses using OpenAI's Responses API format. This endpoint provides an alternative interface to Chat Completions, internally converting requests to the Chat Completions format for backend processing.
Request Body:
{
"model": "gpt-4o",
"input": "Explain quantum computing in simple terms.",
"instructions": "You are a helpful assistant that explains complex topics simply.",
"max_output_tokens": 1000,
"temperature": 0.7,
"reasoning": {
"effort": "high"
}
}
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Model identifier (must be available on at least one healthy backend) |
input | string/array | Yes | The input text or array of input items |
instructions | string | No | System instructions for the model (converted to system message) |
max_output_tokens | integer | No | Maximum tokens to generate |
temperature | number | No | Sampling temperature (0.0 to 2.0) |
top_p | number | No | Nucleus sampling parameter (0.0 to 1.0) |
stream | boolean | No | Enable streaming response (default: false) |
include_reasoning | boolean | No | Include reasoning content in the response |
reasoning | object | No | Reasoning configuration with nested format (see below) |
tools | array | No | List of tools available for the model (flat format) |
tool_choice | string/object | No | Controls tool usage |
previous_response_id | string | No | Reference to a previous response for multi-turn conversations |
Tool Definition Format:
The Responses API uses a flat tool format where function properties are at the same level as the type field. This differs from the Chat Completions API which uses a nested function object.
{
"type": "function",
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
},
"strict": true
}
| Field | Type | Required | Description |
|---|---|---|---|
type | string | Yes | Must be "function" |
name | string | Yes | Name of the function |
description | string | No | Description of what the function does |
parameters | object | No | JSON Schema for function parameters |
strict | boolean | No | Enable strict parameter validation |
Other supported tool types include code_interpreter, file_search, web_search, and computer_use.
Multi-Modal Input Types:
When input is an array of items, each item can be a message containing multi-modal content parts. The following content part types are supported:
| Type | Description |
|---|---|
text | Plain text content |
input_text | Text content (alternative format for Responses API) |
input_file | File content (PDF, images, or other files) |
input_image | Image content |
image_url | Image from URL or base64 data |
Input File Format:
The input_file content part supports three input methods:
{
"type": "input_file",
"filename": "document.pdf",
"file_data": "data:application/pdf;base64,JVBERi0xLjQ..."
}
| Field | Type | Description |
|---|---|---|
filename | string | Optional filename for the file |
file_data | string | Base64 data URL (e.g., data:application/pdf;base64,...) |
file_url | string | External URL to the file (validated for SSRF) |
file_id | string | Reference to a file uploaded via Files API |
Input Image Format:
| Field | Type | Description |
|---|---|---|
image_url | string | Image URL or base64 data URL |
detail | string | Image detail level: low, high, or auto (default) |
Multi-Modal Request Example:
curl -X POST http://localhost:8080/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"input": [
{
"type": "message",
"role": "user",
"content": [
{"type": "input_text", "text": "What does this document say?"},
{"type": "input_file", "filename": "report.pdf", "file_data": "data:application/pdf;base64,..."}
]
}
]
}'
Security Notes:
- External URLs in
file_urlare validated to prevent SSRF attacks - Private IP addresses and localhost URLs are rejected
- Only HTTPS URLs are recommended for external files
- The
file_idfield references files uploaded via the Files API. Files are resolved and converted to base64 before sending to backends. File ownership is verified and a 10MB size limit applies for file injection
Reasoning Parameter:
The reasoning parameter controls the reasoning effort level for reasoning-capable models. It uses a nested format:
Valid effort values:
| Value | Description |
|---|---|
low | Minimal reasoning effort, faster responses |
medium | Balanced reasoning effort |
high | Maximum standard reasoning effort |
xhigh | Extended reasoning (GPT-5.2 family only) |
The router automatically converts this nested format to the flat reasoning_effort format used by Chat Completions backends. Invalid effort levels are rejected with a 400 Bad Request error.
Example Request with Reasoning:
curl -X POST http://localhost:8080/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "o1",
"input": "Solve this complex mathematical problem step by step.",
"reasoning": {
"effort": "high"
}
}'
Response:
{
"id": "resp_abc123",
"object": "response",
"created_at": 1699000000,
"model": "o1",
"output": [
{
"type": "message",
"id": "msg_001",
"role": "assistant",
"content": [
{
"type": "output_text",
"text": "Let me solve this step by step..."
}
]
}
],
"usage": {
"input_tokens": 25,
"output_tokens": 150
},
"status": "completed"
}
Status Codes:
200: Response generated successfully400: Invalid request format, parameters, or invalid reasoning effort value404: Model not found on any healthy backend502: Backend connection error504: Request timeout503: All backends unhealthy
Features:
- Smart Routing: Routes requests based on backend capabilities - native pass-through for OpenAI/Azure, automatic conversion for others
- Native Pass-through: For OpenAI and Azure OpenAI backends, requests are forwarded directly to
/v1/responsesendpoint, preserving all native features - Automatic Conversion: For other backends (Anthropic, Gemini, vLLM, Ollama, etc.), converts Responses API format to their native format
- Reasoning Support: Full support for reasoning parameter with type-safe validation
- Multi-Backend Support: Works with OpenAI, Anthropic, Gemini, Ollama, vLLM, and other backends
- Streaming Support: Real-time response streaming via SSE when
stream: true - Session Management: Supports multi-turn conversations via
previous_response_id
Routing Strategy:
The router automatically determines the best strategy for each backend:
| Backend Type | Strategy | Description |
|---|---|---|
| OpenAI | Pass-through | Native Responses API support - requests forwarded directly |
| Azure OpenAI | Pass-through | Native Responses API support - requests forwarded directly |
| Anthropic | Native Convert | Converted to native Anthropic Messages API format with full PDF/image support |
| Gemini | Convert | Converted to Gemini generateContent API format |
| vLLM | Convert | Converted to Chat Completions format |
| Ollama | Convert | Converted to Chat Completions format |
| LlamaCpp | Convert | Converted to Chat Completions format |
| Generic | Convert | Converted to Chat Completions format |
Pass-through Benefits:
When using OpenAI or Azure OpenAI backends with pass-through mode:
- Native PDF file support (Chat Completions only supports images)
- Preserved reasoning state between turns for better performance
- Access to built-in tools (websearch, filesearch, etc.)
- Better cache utilization (40-80% improvement per OpenAI documentation)
- Full compatibility with the latest OpenAI Responses API features
Anthropic Native Conversion Benefits:
When using Anthropic (Claude) backends with native conversion:
- Native PDF file support via Anthropic's document understanding
- Image file support with automatic format detection
- Extended thinking support for Claude 3+ models
- SSRF protection for external file URLs
- Media type whitelisting for security
Image Generation¶
Generate images using OpenAI's DALL-E, GPT Image models, or Google's Nano Banana (Gemini) models.
Request Body:
{
"model": "dall-e-3",
"prompt": "A serene Japanese garden with cherry blossoms",
"n": 1,
"size": "1024x1024",
"quality": "standard",
"response_format": "url"
}
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Image model: dall-e-2, dall-e-3, gpt-image-1, gpt-image-1.5, gpt-image-1-mini, nano-banana, or nano-banana-pro |
prompt | string | Yes | Description of the image to generate |
n | integer | No | Number of images (1-10, varies by model) |
size | string | No | Image size (varies by model, see below) |
quality | string | No | Image quality (varies by model, see below) |
style | string | No | Image style: vivid or natural (DALL-E 3 only) |
response_format | string | No | Response format: url or b64_json |
output_format | string | No | Output file format: png, jpeg, webp (GPT Image models only, default: png) |
output_compression | integer | No | Compression level 0-100 for jpeg/webp (GPT Image models only) |
background | string | No | Background: transparent, opaque, auto (GPT Image models only) |
stream | boolean | No | Enable streaming for partial images (GPT Image models only, default: false) |
partial_images | integer | No | Number of partial images 0-3 during streaming (GPT Image models only) |
user | string | No | User identifier for tracking |
Model-specific constraints:
| Model | Sizes | n | Quality | Notes |
|---|---|---|---|---|
dall-e-2 | 256x256, 512x512, 1024x1024 | 1-10 | N/A | Classic DALL-E 2 |
dall-e-3 | 1024x1024, 1792x1024, 1024x1792 | 1 | standard, hd | High quality with prompt revision |
gpt-image-1 | 1024x1024, 1536x1024, 1024x1536, auto | 1 | low, medium, high, auto | Latest GPT Image model, supports streaming |
gpt-image-1.5 | 1024x1024, 1536x1024, 1024x1536, auto | 1 | low, medium, high, auto | 4x faster, better text rendering |
gpt-image-1-mini | 1024x1024, 1536x1024, 1024x1536, auto | 1 | low, medium, high, auto | Cost-effective option |
nano-banana | 256x256 to 1024x1024 | 1-4 | N/A | Gemini 2.5 Flash Image (fast) |
nano-banana-pro | 256x256 to 4096x4096 | 1-4 | N/A | Gemini 2.0 Flash Image (advanced, up to 4K) |
Quality Parameter (GPT Image Models):
For backward compatibility, standard maps to medium and hd maps to high when using GPT Image models.
| Quality | Description |
|---|---|
low | Fast generation with lower quality |
medium | Balanced quality and speed (default) |
high | Best quality, slower generation |
auto | Model selects optimal quality |
Output Format Options (GPT Image Models):
| Format | Description | Supports Transparency |
|---|---|---|
png | Lossless format (default) | Yes |
jpeg | Lossy format, smaller file size | No |
webp | Modern format, good compression | Yes |
Note: Transparent background (
background: "transparent") requirespngorwebpformat.
Nano Banana (Gemini) Models:
Nano Banana provides access to Google's Gemini image generation capabilities through an OpenAI-compatible interface:
nano-banana: Maps to Gemini 2.5 Flash Image - fast, general-purpose image generationnano-banana-pro: Maps to Gemini 2.0 Flash Image - advanced model with high-resolution support (up to 4K)
Nano Banana Size Mapping:
The router automatically converts OpenAI-style size parameters to Gemini's aspectRatio and imageSize format:
| OpenAI Size | Gemini aspectRatio | Gemini imageSize | Notes |
|---|---|---|---|
256x256 | 1:1 | 1K | Falls back to Gemini minimum |
512x512 | 1:1 | 1K | Falls back to Gemini minimum |
1024x1024 | 1:1 | 1K | Default |
1536x1024 | 3:2 | 1K | Landscape (new) |
1024x1536 | 2:3 | 1K | Portrait (new) |
1024x1792 | 9:16 | 1K | Tall portrait |
1792x1024 | 16:9 | 1K | Wide landscape |
2048x2048 | 1:1 | 2K | Pro only |
4096x4096 | 1:1 | 4K | Pro only |
auto | 1:1 | 1K | Default fallback |
The conversion sends the following Gemini API structure:
{
"contents": [{"parts": [{"text": "Your prompt"}]}],
"generationConfig": {
"imageConfig": {
"aspectRatio": "3:2",
"imageSize": "1K"
}
}
}
Example Nano Banana Request:
{
"model": "nano-banana",
"prompt": "A white siamese cat with blue eyes, photorealistic",
"n": 1,
"size": "1024x1024",
"response_format": "b64_json"
}
Response:
{
"created": 1677652288,
"data": [
{
"url": "https://oaidalleapiprodscus.blob.core.windows.net/...",
"revised_prompt": "A tranquil Japanese garden featuring..."
}
]
}
Response (with b64_json):
{
"created": 1677652288,
"data": [
{
"b64_json": "/9j/4AAQSkZJRgABAQAA...",
"revised_prompt": "A tranquil Japanese garden featuring..."
}
]
}
Nano Banana Response Notes:
- When using
response_format: "url"with Nano Banana, the image is returned as a data URL (data:image/png;base64,...) since Gemini's native API returns inline base64 data - The
revised_promptfield contains any text response from Gemini describing the generated image
Streaming Image Generation (GPT Image Models):
When stream: true is specified for GPT Image models, the response will be streamed as Server-Sent Events (SSE):
Example Streaming Request:
{
"model": "gpt-image-1",
"prompt": "A beautiful sunset over mountains",
"stream": true,
"partial_images": 2,
"response_format": "b64_json"
}
Streaming Response Format:
data: {"type":"image_generation.partial_image","partial_image_index":0,"b64_json":"...","created":1702345678}
data: {"type":"image_generation.partial_image","partial_image_index":1,"b64_json":"...","created":1702345679}
data: {"type":"image_generation.complete","b64_json":"...","created":1702345680}
data: {"type":"image_generation.usage","usage":{"input_tokens":25,"output_tokens":1024}}
data: {"type":"done"}
SSE Event Types:
| Event Type | Description |
|---|---|
image_generation.partial_image | Intermediate image during generation |
image_generation.complete | Final complete image |
image_generation.usage | Token usage information (for cost tracking) |
done | Stream completion marker |
Example GPT Image Request with New Options:
{
"model": "gpt-image-1.5",
"prompt": "A white cat with blue eyes, photorealistic",
"size": "auto",
"quality": "high",
"output_format": "webp",
"output_compression": 85,
"background": "transparent",
"response_format": "b64_json"
}
Status Codes:
200: Image(s) generated successfully400: Invalid request (e.g., invalid size for model, n > 1 for DALL-E 3)401: Invalid API key429: Rate limit exceeded500: Backend error503: Gemini backend unavailable (for Nano Banana models)
Timeout Configuration: Image generation requests use a configurable timeout (default: 3 minutes). See timeouts.request.image_generation in configuration.
Image Edit (Inpainting)¶
Edit existing images using OpenAI's inpainting capabilities. This endpoint allows you to modify specific regions of an image based on a text prompt and optional mask. Supports GPT Image models and DALL-E 2.
Request Parameters (multipart/form-data):
| Parameter | Type | Required | Description |
|---|---|---|---|
image | file | Yes | The source image to edit (PNG, < 4MB, square) |
prompt | string | Yes | Description of the desired edit |
mask | file | No | Mask image indicating edit regions (PNG, same dimensions as image) |
model | string | No | Model to use (default: gpt-image-1) |
n | integer | No | Number of images to generate (1-10, default: 1) |
size | string | No | Output size (model-dependent, default: 1024x1024) |
response_format | string | No | Response format: url or b64_json (default: url) |
user | string | No | Unique user identifier for tracking |
Supported Models and Sizes:
| Model | Sizes | Notes |
|---|---|---|
gpt-image-1 | 1024x1024, 1536x1024, 1024x1536, auto | Latest GPT Image model (recommended) |
gpt-image-1-mini | 1024x1024, 1536x1024, 1024x1536, auto | Cost-optimized version |
gpt-image-1.5 | 1024x1024, 1536x1024, 1024x1536, auto | Newest with improved instruction following |
dall-e-2 | 256x256, 512x512, 1024x1024 | Legacy DALL-E 2 model |
Note: DALL-E 3 and Gemini (nano-banana) do NOT support image editing via this endpoint. Gemini uses semantic masking via natural language, which is incompatible with OpenAI's mask-based editing format.
Image Requirements:
- Format: PNG only
- Size: Less than 4MB
- Dimensions: Must be square (width equals height)
Mask Requirements:
- Format: PNG with alpha channel (RGBA)
- Dimensions: Must match the source image exactly
- Transparent areas: Indicate regions to edit/generate
- Opaque areas: Indicate regions to preserve
Example Request:
curl -X POST http://localhost:8080/v1/images/edits \
-F "image=@source_image.png" \
-F "mask=@mask.png" \
-F "prompt=A sunlit indoor lounge area with a pool containing a flamingo" \
-F "n=1" \
-F "size=1024x1024" \
-F "response_format=url"
Example Request (without mask):
curl -X POST http://localhost:8080/v1/images/edits \
-F "image=@source_image.png" \
-F "prompt=Add a sunset in the background" \
-F "n=1" \
-F "size=512x512"
Response:
{
"created": 1677652288,
"data": [
{
"url": "https://oaidalleapiprodscus.blob.core.windows.net/..."
}
]
}
Response (with b64_json):
Status Codes:
200: Image(s) edited successfully400: Invalid request (e.g., non-square image, invalid size, missing required field)401: Invalid API key503: OpenAI backend unavailable
Error Examples:
Non-square image:
{
"error": {
"message": "Image must be square (800x600 is not square)",
"type": "invalid_request_error",
"param": "image",
"code": "image_not_square"
}
}
Mask dimension mismatch:
{
"error": {
"message": "Mask dimensions (256x256) do not match image dimensions (512x512)",
"type": "invalid_request_error",
"param": "mask",
"code": "dimension_mismatch"
}
}
Unsupported model:
{
"error": {
"message": "Model 'dall-e-3' does not support image editing. Supported models: gpt-image-1, gpt-image-1-mini, gpt-image-1.5, dall-e-2. Note: dall-e-3 does NOT support image editing.",
"type": "invalid_request_error",
"param": "model",
"code": "unsupported_model"
}
}
Notes:
- Supported models:
gpt-image-1,gpt-image-1-mini,gpt-image-1.5,dall-e-2 - DALL-E 3 does NOT support image editing via API
- Gemini (nano-banana) is NOT supported - uses different editing approach (semantic masking)
- When no mask is provided, the entire image may be modified
- The source image should have transparent regions if editing without a mask
- Request timeout uses the image generation timeout configuration
Image Variations¶
Generate variations of an existing image using OpenAI's DALL-E 2 model.
Form Fields:
| Parameter | Type | Required | Description |
|---|---|---|---|
image | file | Yes | Source image for variations (PNG, < 4MB, must be square) |
model | string | No | Model to use (default: dall-e-2) |
n | integer | No | Number of variations to generate (1-10, default: 1) |
size | string | No | Output size: 256x256, 512x512, 1024x1024 (default: 1024x1024) |
response_format | string | No | Response format: url or b64_json (default: url) |
user | string | No | User identifier for tracking |
Example Request:
curl -X POST http://localhost:8080/v1/images/variations \
-F "image=@source_image.png" \
-F "model=dall-e-2" \
-F "n=2" \
-F "size=512x512" \
-F "response_format=url"
Response:
{
"created": 1677652288,
"data": [
{
"url": "https://oaidalleapiprodscus.blob.core.windows.net/..."
},
{
"url": "https://oaidalleapiprodscus.blob.core.windows.net/..."
}
]
}
Response (with b64_json):
Model Support:
| Model | Variations Support | Notes |
|---|---|---|
dall-e-2 | Yes (native) | Full support, 1-10 variations |
dall-e-3 | No | Not supported by OpenAI API |
gpt-image-1 | No | Not supported |
nano-banana | No | Gemini does not support variations API |
nano-banana-pro | No | Gemini does not support variations API |
Image Requirements:
- Format: PNG only
- Size: Less than 4MB
- Dimensions: Must be square (width == height)
- Supported input sizes: Any square dimensions (will be processed by the model)
Error Scenarios:
| Error | Status | Description |
|---|---|---|
| Image not PNG | 400 | Only PNG format is supported |
| Image not square | 400 | Image dimensions must be equal |
| Image too large | 400 | Image exceeds 4MB size limit |
| Model not supported | 400 | Requested model doesn't support variations |
| Missing image | 400 | Image field is required |
| Invalid n value | 400 | n must be between 1 and 10 |
| Invalid size | 400 | Size must be one of the supported values |
Status Codes:
200: Variation(s) generated successfully400: Invalid request (invalid format, non-square image, unsupported model)401: Invalid API key429: Rate limit exceeded500: Backend error503: Backend unavailable
Text Completions¶
Generate text completions using the OpenAI Completions API format.
Request Body:
{
"model": "gpt-3.5-turbo-instruct",
"prompt": "Once upon a time in a distant galaxy",
"max_tokens": 100,
"temperature": 0.7,
"top_p": 1.0,
"frequency_penalty": 0.0,
"presence_penalty": 0.0,
"stream": false,
"stop": null,
"logit_bias": {},
"user": "user123"
}
Response:
{
"id": "cmpl-123456789",
"object": "text_completion",
"created": 1677652288,
"model": "gpt-3.5-turbo-instruct",
"choices": [
{
"text": ", there lived a young explorer named Zara who dreamed of discovering new worlds...",
"index": 0,
"finish_reason": "stop",
"logprobs": null
}
],
"usage": {
"prompt_tokens": 10,
"completion_tokens": 90,
"total_tokens": 100
}
}
Status Codes: Same as Chat Completions
Embeddings¶
Generate embeddings for input text using the OpenAI Embeddings API format.
Request Body:
{
"model": "text-embedding-3-small",
"input": "The quick brown fox jumps over the lazy dog",
"encoding_format": "float",
"dimensions": 512,
"user": "user123"
}
Request Parameters:
| Field | Type | Required | Description |
|---|---|---|---|
model | string | Yes | ID of the model to use (e.g., text-embedding-3-small, text-embedding-3-large, text-embedding-ada-002) |
input | string or array | Yes | Input text to embed. Can be a single string, array of strings, or array of token arrays |
encoding_format | string | No | Format to return the embeddings: float or base64 (default: float) |
dimensions | integer | No | Number of dimensions for the output embeddings (only supported for text-embedding-3-* models) |
user | string | No | Unique identifier for the end-user |
Input Format Examples:
Single string:
Array of strings:
{
"model": "text-embedding-3-large",
"input": [
"First text to embed",
"Second text to embed",
"Third text to embed"
]
}
Token arrays:
Response:
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [0.0023064255, -0.009327292, 0.0045318254, ...]
}
],
"model": "text-embedding-3-small",
"usage": {
"prompt_tokens": 8,
"total_tokens": 8
}
}
Response Fields:
| Field | Type | Description |
|---|---|---|
object | string | Always "list" |
data | array | Array of embedding objects |
data[].object | string | Always "embedding" |
data[].index | integer | Index of the embedding in the input array |
data[].embedding | array | Embedding vector (array of floats) |
model | string | Model used to generate the embedding |
usage | object | Token usage information |
usage.prompt_tokens | integer | Number of tokens in the input |
usage.total_tokens | integer | Total tokens used (same as prompt_tokens for embeddings) |
Supported Models:
| Backend | Model | Notes |
|---|---|---|
| OpenAI | text-embedding-3-small | 1536 dimensions by default, supports dimensions parameter |
| OpenAI | text-embedding-3-large | 3072 dimensions by default, supports dimensions parameter |
| OpenAI | text-embedding-ada-002 | 1536 dimensions, legacy model |
| Gemini | text-embedding-004 | Via OpenAI-compatible endpoint |
| Self-hosted | bge-m3 | 1024 dimensions, 100+ languages, 8192 context. Supports dense, sparse, and ColBERT retrieval |
| Self-hosted | bge-large-en-v1.5 | 1024 dimensions, English-only, 512 context |
| Self-hosted | multilingual-e5-large | 1024 dimensions, 100+ languages, 514 context |
| vLLM | Deployment-specific | Depends on deployed model |
| llama.cpp | Deployment-specific | Native /v1/embeddings support |
| TEI | Deployment-specific | Hugging Face Text Embeddings Inference server |
| Ollama | Deployment-specific | Via Ollama embedding models |
Status Codes:
200: Embeddings generated successfully400: Invalid request (missing model/input, invalid dimensions)401: Invalid API key404: Model not found or doesn't support embeddings429: Rate limit exceeded500: Backend error503: Backend unavailable
Features:
- Multiple Input Formats: Supports single string, array of strings, or token arrays
- Dimension Control: For text-embedding-3 models, specify custom dimensions for reduced vector size
- Backend Agnostic: Routes to appropriate backend based on model
- Load Balancing: Applies configured load balancing strategy
- Error Handling: Provides detailed error messages for invalid requests
Example Request:
curl -X POST http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-api-key" \
-d '{
"model": "text-embedding-3-small",
"input": "The quick brown fox jumps over the lazy dog",
"encoding_format": "float"
}'
Example with Multiple Inputs:
curl -X POST http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-api-key" \
-d '{
"model": "text-embedding-3-large",
"input": [
"First document text",
"Second document text",
"Third document text"
]
}'
Example with Custom Dimensions:
curl -X POST http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-api-key" \
-d '{
"model": "text-embedding-3-small",
"input": "Sample text",
"dimensions": 512
}'
Rerank¶
Rerank documents based on their relevance to a query using the Cohere-compatible Rerank API. This is commonly used as a second-stage retrieval step after initial vector search to improve accuracy.
Request Body:
{
"model": "bge-reranker-v2-m3",
"query": "What is machine learning?",
"documents": ["Document 1 content", "Document 2 content", "Document 3 content"],
"top_n": 3,
"return_documents": false
}
Request Parameters:
| Field | Type | Required | Description |
|---|---|---|---|
model | string | Yes | ID of the reranking model to use (e.g., bge-reranker-v2-m3, rerank-english-v3.0, jina-reranker-v2-base-multilingual) |
query | string | Yes | The search query to compare documents against |
documents | array | Yes | List of documents to rerank. Can be an array of strings or an array of objects with a text field |
top_n | integer | No | Number of top results to return. If not specified, returns all documents ranked |
return_documents | boolean | No | Whether to return the document text in the response (default: false) |
max_chunks_per_doc | integer | No | Maximum number of chunks to process per document for long document handling |
Document Format Options:
Simple string array:
{
"model": "bge-reranker-v2-m3",
"query": "What is deep learning?",
"documents": [
"Deep learning uses neural networks with multiple layers",
"Machine learning is a subset of artificial intelligence",
"Natural language processing deals with text understanding"
]
}
Structured documents with text field:
{
"model": "rerank-english-v3.0",
"query": "What is deep learning?",
"documents": [
{"text": "Deep learning uses neural networks with multiple layers"},
{"text": "Machine learning is a subset of artificial intelligence"}
]
}
Response:
{
"results": [
{
"index": 0,
"relevance_score": 0.95
},
{
"index": 2,
"relevance_score": 0.72
},
{
"index": 1,
"relevance_score": 0.45
}
],
"model": "bge-reranker-v2-m3",
"id": "rerank-abc123",
"usage": {
"prompt_tokens": 150,
"total_tokens": 150
}
}
Response with Documents (when return_documents: true):
{
"results": [
{
"index": 0,
"relevance_score": 0.95,
"document": {
"text": "Deep learning uses neural networks with multiple layers"
}
}
],
"model": "bge-reranker-v2-m3"
}
Response Fields:
| Field | Type | Description |
|---|---|---|
results | array | List of reranked results ordered by relevance score (highest first) |
results[].index | integer | The index of the document in the original input list |
results[].relevance_score | number | Relevance score (typically 0.0 to 1.0, higher is more relevant) |
results[].document | object | The document text (only present if return_documents was true) |
model | string | The model used for reranking |
id | string | Unique identifier for this request (optional) |
usage | object | Token usage information (optional) |
Supported Backends:
| Backend | Endpoint | Notes |
|---|---|---|
| vLLM | /v1/rerank | Cohere-compatible, supports BGE, Jina rerankers |
| llama.cpp | /v1/rerank | Requires --reranking flag at startup |
| Hugging Face TEI | /rerank | Text Embeddings Inference server |
| Cohere API | /v1/rerank | Native Cohere rerank endpoint |
| Jina AI | /v1/rerank | Native Jina rerank endpoint |
Status Codes:
200: Documents reranked successfully400: Invalid request (missing model/query/documents, empty documents array)401: Invalid API key404: Model not found or doesn't support reranking429: Rate limit exceeded500: Backend error503: Backend unavailable
Use Cases:
- Two-stage retrieval: Use vector search to retrieve candidates, then rerank for higher precision
- RAG systems: Improve context quality by reranking retrieved documents before LLM processing
- Search result refinement: Reorder search results based on semantic relevance
Example Request:
curl -X POST http://localhost:8080/v1/rerank \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-api-key" \
-d '{
"model": "bge-reranker-v2-m3",
"query": "What are the benefits of renewable energy?",
"documents": [
"Solar panels convert sunlight into electricity",
"Wind turbines generate power from wind",
"Coal is a fossil fuel used for electricity"
],
"top_n": 2
}'
Sparse Embeddings¶
Generate sparse embeddings for input text using the TEI/Jina-compatible Sparse Embedding API. Sparse embeddings (e.g., SPLADE) preserve lexical information through explicit term weights, complementing dense embeddings for hybrid search.
Request Body:
Request Parameters:
| Field | Type | Required | Description |
|---|---|---|---|
model | string | Yes | ID of the sparse embedding model to use (e.g., naver/splade-v3, naver/splade-cocondenser-ensembledistil) |
input | string or array | Yes | Input text to embed. Can be a single string or an array of strings |
Input Format Examples:
Single string:
Array of strings:
{
"model": "naver/splade-v3",
"input": [
"First text to embed",
"Second text to embed",
"Third text to embed"
]
}
Response:
{
"data": [
{
"index": 0,
"sparse_embedding": {
"indices": [123, 456, 789, 1024, 2048],
"values": [0.5, 0.3, 0.1, 0.8, 0.2]
}
}
],
"model": "naver/splade-v3",
"usage": {
"prompt_tokens": 8,
"total_tokens": 8
}
}
Response Fields:
| Field | Type | Description |
|---|---|---|
data | array | Array of sparse embedding objects |
data[].index | integer | Index of the embedding in the input array |
data[].sparse_embedding | object | The sparse embedding vector |
data[].sparse_embedding.indices | array | Indices of non-zero elements in the vocabulary |
data[].sparse_embedding.values | array | Values at the corresponding indices |
model | string | Model used to generate the embedding (optional) |
usage | object | Token usage information (optional) |
Understanding Sparse Vectors:
A sparse vector only stores non-zero values along with their vocabulary indices. For example:
indices: [123, 456, 789]- positions in the vocabularyvalues: [0.5, 0.3, 0.1]- weights for those terms
This is memory-efficient for high-dimensional vectors with few non-zero elements (typically 100-500 non-zero values out of 30,000+ vocabulary size).
Supported Backends:
| Backend | Endpoint | Notes |
|---|---|---|
| vLLM | /embed_sparse | Supports SPLADE models via OpenAI-compatible server |
| Hugging Face TEI | /embed_sparse | Requires --pooling splade flag |
| Jina AI | Native | Native sparse embedding support |
Status Codes:
200: Sparse embeddings generated successfully400: Invalid request (missing model/input, empty input)401: Invalid API key404: Model not found or doesn't support sparse embeddings429: Rate limit exceeded500: Backend error503: Backend unavailable
Use Cases:
- Hybrid search: Combine dense (semantic) and sparse (lexical) retrieval for better results
- Keyword matching: Exact term matching with learned weights
- Domain-specific retrieval: Better handling of specialized terminology and rare words
- Cross-lingual retrieval: Some SPLADE models support multilingual sparse retrieval
Example Request:
curl -X POST http://localhost:8080/embed_sparse \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-api-key" \
-d '{
"model": "naver/splade-v3",
"input": "What are the benefits of sparse embeddings for search?"
}'
Example with Multiple Inputs:
curl -X POST http://localhost:8080/embed_sparse \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-api-key" \
-d '{
"model": "naver/splade-v3",
"input": [
"First query text",
"Second query text"
]
}'
Files API¶
The Files API allows you to upload, manage, and use files in chat completions. Uploaded files can be referenced in messages using the image_file content type, and the router automatically resolves these references by injecting the file content.
Upload File¶
Upload a file for use in chat completions.
Form Fields:
| Field | Type | Required | Description |
|---|---|---|---|
file | file | Yes | The file to upload |
purpose | string | Yes | Purpose of the file: vision, assistants, fine-tune, batch, user_data, evals |
Example:
Response:
{
"id": "file-abc123def456",
"object": "file",
"bytes": 12345,
"created_at": 1699061776,
"filename": "image.png",
"purpose": "vision"
}
Status Codes:
200: File uploaded successfully400: Invalid request (missing file, invalid purpose)413: File too large (exceeds configured maxfilesize)
List Files¶
Retrieve a list of uploaded files.
Query Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
purpose | string | No | Filter by purpose |
Response:
{
"object": "list",
"data": [
{
"id": "file-abc123def456",
"object": "file",
"bytes": 12345,
"created_at": 1699061776,
"filename": "image.png",
"purpose": "vision"
}
]
}
Get File Metadata¶
Retrieve metadata for a specific file.
Response:
{
"id": "file-abc123def456",
"object": "file",
"bytes": 12345,
"created_at": 1699061776,
"filename": "image.png",
"purpose": "vision"
}
Status Codes:
200: File metadata retrieved404: File not found
Download File Content¶
Download the content of an uploaded file.
Response: Binary file content with appropriate Content-Type header.
Status Codes:
200: File content returned404: File not found
Delete File¶
Delete an uploaded file.
Response:
Status Codes:
200: File deleted successfully404: File not found
File Resolution in Chat Completions¶
The router automatically resolves file references in chat completion requests. When a message contains an image_file content block, the router:
- Validates the file ID format
- Loads the file content from storage
- Converts the file to a base64 data URL
- Replaces the
image_fileblock with animage_urlblock
Supported File Types¶
File resolution in chat completions supports the following file types:
| File Type | MIME Type | Support |
|---|---|---|
| PNG | image/png | All backends |
| JPEG | image/jpeg | All backends |
| GIF | image/gif | All backends |
| WebP | image/webp | All backends |
application/pdf | OpenAI, Anthropic | |
| Plain Text | text/plain | Anthropic |
Note: PDF and plain text support is available for Anthropic backends (and PDF for OpenAI). The file transformers automatically convert document files to the appropriate format for each backend (OpenAI uses
fileblocks for PDF, Anthropic usesdocumentblocks for both PDF and plain text). Non-image and non-document files will return a400 Bad Requesterror with a helpful message indicating the supported file types.
Request with File Reference:
{
"model": "gpt-4-vision-preview",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_file", "image_file": {"file_id": "file-abc123def456"}}
]
}
]
}
Transformed Request (sent to backend):
{
"model": "gpt-4-vision-preview",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
]
}
]
}
File Resolution Errors:
| Error | Status | Description |
|---|---|---|
| Invalid file ID format | 400 | File ID must start with file- |
| File not found | 404 | Referenced file does not exist |
| Too many file references | 400 | Request contains more than 20 file references |
| Resolution timeout | 504 | File resolution took longer than 30 seconds |
Supported MIME Types:
image/png- All backendsimage/jpeg- All backendsimage/gif- All backendsimage/webp- All backendsapplication/pdf- OpenAI, Anthropic (max 32MB, 100 pages)text/plain- Anthropic (max 32MB)
Anthropic Native API¶
Continuum Router provides native Anthropic API endpoints that allow clients to use Anthropic's API format directly while still benefiting from the router's load balancing, failover, and multi-backend routing capabilities.
Messages API¶
Send messages using Anthropic's native API format.
Headers:
| Header | Required | Description |
|---|---|---|
x-api-key | Yes* | API key for authentication (required in blocking mode) |
anthropic-version | No | API version (e.g., 2023-06-01). Forwarded to native Anthropic backends |
anthropic-beta | No | Beta features (e.g., prompt-caching-2024-07-31). Forwarded to native Anthropic backends |
Request Body:
{
"model": "claude-sonnet-4-20250514",
"max_tokens": 1024,
"messages": [
{
"role": "user",
"content": "Hello, Claude!"
}
],
"stream": false
}
Request Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Model identifier |
max_tokens | integer | Yes | Maximum tokens to generate |
messages | array | Yes | Array of message objects |
system | string/array | No | System prompt (string or array of content blocks) |
stream | boolean | No | Enable streaming (default: false) |
temperature | number | No | Sampling temperature (0-1) |
top_p | number | No | Nucleus sampling parameter |
top_k | integer | No | Top-k sampling parameter |
stop_sequences | array | No | Stop sequences |
metadata | object | No | Request metadata |
tools | array | No | Tool definitions for function calling |
tool_choice | object | No | Tool choice configuration |
Response (non-streaming):
{
"id": "msg_01XFDUDYJgAACzvnptvVoYEL",
"type": "message",
"role": "assistant",
"content": [
{
"type": "text",
"text": "Hello! How can I help you today?"
}
],
"model": "claude-sonnet-4-20250514",
"stop_reason": "end_turn",
"stop_sequence": null,
"usage": {
"input_tokens": 12,
"output_tokens": 15,
"cache_creation_input_tokens": 0,
"cache_read_input_tokens": 0
}
}
Backend Routing:
The router automatically transforms requests based on the target backend:
| Backend Type | Transformation |
|---|---|
| Anthropic | Pass-through with native API format |
| Gemini | Direct transformation to Gemini format |
| OpenAI/vLLM/Ollama | Transform to OpenAI format, then transform response back |
Token Counting¶
Count the number of tokens in a message request payload.
Request Body:
{
"model": "claude-sonnet-4-20250514",
"messages": [
{
"role": "user",
"content": "Hello, how are you?"
}
],
"system": "You are a helpful assistant."
}
Response:
Tiered Token Counting¶
Token counting uses different strategies based on the backend type:
| Backend Type | Strategy | Accuracy |
|---|---|---|
| Anthropic | Native /v1/messages/count_tokens API proxy | Exact |
| llama.cpp | Backend /tokenize endpoint proxy | Exact |
| vLLM | Backend /tokenize endpoint proxy | Exact |
| Others | Character-based estimation (~4 chars/token) | Approximate |
Note: For backends that don't support native tokenization, the router uses a character-based estimation of approximately 4 characters per token. This provides a reasonable approximation for planning purposes but may not match the exact token count used by the model.
Models List¶
List available models in Anthropic API format.
Response:
{
"data": [
{
"id": "claude-sonnet-4-20250514",
"type": "model",
"display_name": "Claude Sonnet 4",
"created_at": "2025-05-14T00:00:00Z"
},
{
"id": "claude-opus-4-20250514",
"type": "model",
"display_name": "Claude Opus 4",
"created_at": "2025-05-14T00:00:00Z"
}
],
"has_more": false,
"first_id": "claude-sonnet-4-20250514",
"last_id": "claude-opus-4-20250514"
}
Claude Code Compatibility¶
The Anthropic Native API includes full compatibility with Claude Code and other Anthropic API clients that require advanced features.
Prompt Caching¶
Prompt caching is fully supported through the cache_control field on content blocks:
{
"model": "claude-sonnet-4-20250514",
"max_tokens": 1024,
"system": [
{
"type": "text",
"text": "You are a helpful coding assistant with extensive knowledge...",
"cache_control": {"type": "ephemeral"}
}
],
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Previous context...",
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": "Current question"
}
]
}
]
}
Supported cache_control locations:
- System prompt text blocks
- User message text blocks
- User message image blocks
- Tool definitions
- Tool use blocks
- Tool result blocks
Beta Features Header¶
The anthropic-beta header is forwarded to native Anthropic backends, enabling beta features:
curl -X POST http://localhost:8080/anthropic/v1/messages \
-H "x-api-key: $ANTHROPIC_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-H "anthropic-beta: prompt-caching-2024-07-31,interleaved-thinking-2025-05-14" \
-H "Content-Type: application/json" \
-d '{
"model": "claude-sonnet-4-20250514",
"max_tokens": 1024,
"messages": [{"role": "user", "content": "Hello"}]
}'
Cache Usage in Streaming¶
When streaming with native Anthropic backends, cache usage information is included in the message_start event:
{
"type": "message_start",
"message": {
"id": "msg_01XFDUDYJgAACzvnptvVoYEL",
"type": "message",
"role": "assistant",
"content": [],
"model": "claude-sonnet-4-20250514",
"usage": {
"input_tokens": 2159,
"cache_creation_input_tokens": 2048,
"cache_read_input_tokens": 0
}
}
}
Interleaved Thinking¶
Extended thinking with interleaved output is supported in streaming mode. When the model produces thinking and text content alternately, the streaming events properly represent this interleaved structure with appropriate content_block_start, content_block_delta, and content_block_stop events for each block.
Admin Endpoints¶
Backend Status¶
Get detailed status information about all configured backends.
Response:
{
"backends": [
{
"name": "local-ollama",
"url": "http://localhost:11434",
"is_healthy": true,
"consecutive_failures": 0,
"consecutive_successes": 15,
"last_check": "2024-01-15T10:30:45Z",
"last_error": null,
"response_time_ms": 45,
"models": ["llama2", "mistral", "codellama"],
"weight": 1,
"total_requests": 150,
"failed_requests": 2
},
{
"name": "openai-compatible",
"url": "https://api.openai.com",
"is_healthy": false,
"consecutive_failures": 3,
"consecutive_successes": 0,
"last_check": "2024-01-15T10:29:30Z",
"last_error": "Connection timeout after 5s",
"response_time_ms": null,
"models": [],
"weight": 1,
"total_requests": 45,
"failed_requests": 8
}
],
"healthy_count": 1,
"total_count": 2,
"summary": {
"total_models": 3,
"total_requests": 195,
"total_failures": 10,
"average_response_time_ms": 45
}
}
Fields:
| Field | Type | Description |
|---|---|---|
name | string | Backend identifier from configuration |
url | string | Backend base URL |
is_healthy | boolean | Current health status |
consecutive_failures | integer | Sequential failed health checks |
consecutive_successes | integer | Sequential successful health checks |
last_check | string | ISO timestamp of last health check |
last_error | string/null | Last error message if unhealthy |
response_time_ms | integer/null | Last health check response time |
models | array | Available models from this backend |
weight | integer | Load balancing weight |
total_requests | integer | Total requests routed to this backend |
failed_requests | integer | Failed requests to this backend |
Status Codes:
200: Backend status retrieved successfully
Service Health¶
Get overall service health and component status.
Response:
{
"status": "healthy",
"version": "1.0.0",
"uptime": "2h 15m 30s",
"timestamp": "2024-01-15T10:30:45Z",
"services": {
"backend_service": {
"status": "healthy",
"message": "All backends operational",
"healthy_backends": 2,
"total_backends": 2
},
"model_service": {
"status": "healthy",
"message": "Model cache operational",
"cached_models": 15,
"cache_hit_rate": 0.95,
"last_refresh": "2024-01-15T10:25:00Z"
},
"proxy_service": {
"status": "healthy",
"message": "Request routing operational",
"total_requests": 1250,
"failed_requests": 12,
"average_latency_ms": 85
},
"health_service": {
"status": "healthy",
"message": "Health monitoring active",
"check_interval": "30s",
"last_check": "2024-01-15T10:30:00Z"
}
},
"metrics": {
"requests_per_second": 5.2,
"error_rate": 0.008,
"memory_usage_mb": 125,
"cpu_usage_percent": 15.5
}
}
Status Values:
healthy: Service operating normallydegraded: Service operating with reduced functionalityunhealthy: Service experiencing issues
Status Codes:
200: Service health retrieved successfully503: Service is unhealthy
Configuration Summary¶
Get current configuration summary including hot reload status.
Response:
{
"server": {
"bind_address": "0.0.0.0:8080",
"workers": 4,
"connection_pool_size": 100
},
"backends": {
"count": 3,
"names": ["openai", "local-ollama", "gemini"]
},
"health_checks": {
"interval": "30s",
"timeout": "10s",
"unhealthy_threshold": 3,
"healthy_threshold": 2
},
"rate_limiting": {
"enabled": false
},
"circuit_breaker": {
"enabled": true
},
"selection_strategy": "RoundRobin",
"hot_reload": {
"available": true,
"note": "Configuration changes will be automatically detected and applied"
}
}
Fields:
| Field | Type | Description |
|---|---|---|
server | object | Server configuration (bindaddress, workers, connectionpool_size) |
backends | object | Backend configuration summary (count, names) |
health_checks | object | Health check settings |
rate_limiting | object | Rate limiting status |
circuit_breaker | object | Circuit breaker status |
selection_strategy | string | Current load balancing strategy |
hot_reload | object | Hot reload availability and status |
Status Codes:
200: Configuration summary retrieved successfully
Note: Sensitive information (API keys, etc.) is automatically redacted from the response.
Hot Reload Status¶
Get detailed information about hot reload functionality and configuration item classification.
Response:
{
"enabled": true,
"description": "Hot reload is enabled. Configuration file changes are automatically detected and applied.",
"capabilities": {
"immediate_update": {
"description": "Changes applied immediately without service interruption",
"items": [
"logging.level",
"rate_limiting.*",
"circuit_breaker.*",
"retry.*",
"global_prompts.*"
]
},
"gradual_update": {
"description": "Existing connections maintained, new connections use new config",
"items": [
"backends.*",
"health_checks.*",
"timeouts.*"
]
},
"requires_restart": {
"description": "Changes logged as warnings, restart required to take effect",
"items": [
"server.bind_address",
"server.workers"
]
}
}
}
Fields:
| Field | Type | Description |
|---|---|---|
enabled | boolean | Whether hot reload is enabled |
description | string | Human-readable description of hot reload status |
capabilities | object | Configuration item classification by hot reload capability |
capabilities.immediate_update | object | Items that update immediately without disruption |
capabilities.gradual_update | object | Items that apply to new connections only |
capabilities.requires_restart | object | Items that require server restart |
Configuration Item Classification:
Immediate Update (no service interruption): - logging.level - Log level changes apply immediately - rate_limiting.* - Rate limiting settings update in real-time - circuit_breaker.* - Circuit breaker thresholds and timeouts - retry.* - Retry policies and backoff strategies - global_prompts.* - Global system prompt injection settings
Gradual Update (existing connections maintained): - backends.* - Backend add/remove/modify (new requests use updated pool) - health_checks.* - Health check intervals and thresholds - timeouts.* - Timeout values for new requests
Requires Restart (logged as warnings): - server.bind_address - TCP bind address - server.workers - Worker thread count
Status Codes:
200: Hot reload status retrieved successfully
Example Usage:
# Check if hot reload is enabled
curl http://localhost:8080/admin/config/hot-reload-status | jq '.enabled'
# List items that support immediate update
curl http://localhost:8080/admin/config/hot-reload-status | jq '.capabilities.immediate_update.items'
Configuration Management API¶
The Configuration Management API enables viewing and modifying router configuration at runtime without requiring a server restart. This provides operational flexibility for adjusting behavior, adding backends, and fine-tuning settings in production environments.
Overview¶
Key capabilities: - Runtime Configuration: View and modify configuration without server restart - Hot Reload Support: Changes to supported settings apply immediately - Validation: Validate configuration changes before applying - History & Rollback: Track configuration changes and rollback to previous versions - Export/Import: Backup and restore configurations across environments - Security: Sensitive information (API keys, passwords, tokens) is automatically masked
Configuration Query APIs¶
Get Full Configuration¶
Returns the complete current configuration with sensitive information masked for security.
Response:
{
"server": {
"bind_address": "0.0.0.0:8080",
"workers": 4,
"connection_pool_size": 100
},
"backends": [
{
"name": "openai",
"url": "https://api.openai.com",
"api_key": "sk-****...**",
"weight": 1,
"models": ["gpt-4", "gpt-3.5-turbo"]
},
{
"name": "local-ollama",
"url": "http://localhost:11434",
"weight": 1,
"models": []
}
],
"health_checks": {
"interval": "30s",
"timeout": "10s",
"unhealthy_threshold": 3,
"healthy_threshold": 2
},
"logging": {
"level": "info",
"format": "json"
},
"retry": {
"max_attempts": 3,
"backoff": "exponential",
"initial_delay_ms": 100
},
"timeouts": {
"connect": "5s",
"request": "60s"
},
"rate_limiting": {
"enabled": false
},
"circuit_breaker": {
"enabled": true,
"failure_threshold": 5,
"recovery_timeout": "30s"
}
}
Notes:
- API keys, passwords, and tokens are masked (e.g.,
sk-****...**) - All configuration sections are included in the response
- Use
/admin/config/{section}for individual section details
Status Codes:
200: Configuration retrieved successfully
List Configuration Sections¶
Returns a list of all available configuration sections.
Response:
{
"sections": [
"server",
"backends",
"health_checks",
"logging",
"retry",
"timeouts",
"rate_limiting",
"circuit_breaker",
"global_prompts",
"admin",
"fallback",
"files",
"api_keys",
"metrics",
"routing"
],
"total": 15
}
Status Codes:
200: Section list retrieved successfully
Get Configuration Section¶
Returns the configuration for a specific section with hot reload capability information.
Path Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
section | string | Yes | Configuration section name |
Example Request:
Response:
{
"section": "logging",
"config": {
"level": "info",
"format": "json",
"output": "stdout",
"include_timestamps": true
},
"hot_reload_capability": "immediate_update",
"description": "Changes to this section apply immediately without service interruption"
}
Hot Reload Capability Values:
| Value | Description |
|---|---|
immediate_update | Changes apply immediately without service interruption |
gradual_update | Existing connections maintained, new connections use new config |
requires_restart | Server restart required for changes to take effect |
Status Codes:
200: Section configuration retrieved successfully404: Invalid section name
Error Response:
{
"error": {
"message": "Configuration section 'invalid_section' not found",
"type": "not_found",
"code": 404,
"details": {
"requested_section": "invalid_section",
"available_sections": ["server", "backends", "logging", "..."]
}
}
}
Get Configuration Schema¶
Returns the JSON Schema for configuration validation. Useful for client-side validation before submitting changes.
Response:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"server": {
"type": "object",
"properties": {
"bind_address": {
"type": "string",
"pattern": "^[0-9.]+:[0-9]+$",
"description": "Server bind address in host:port format"
},
"workers": {
"type": "integer",
"minimum": 1,
"maximum": 256,
"description": "Number of worker threads"
},
"connection_pool_size": {
"type": "integer",
"minimum": 1,
"maximum": 10000,
"description": "HTTP connection pool size per backend"
}
},
"required": ["bind_address"]
},
"backends": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {
"type": "string",
"minLength": 1,
"description": "Unique backend identifier"
},
"url": {
"type": "string",
"format": "uri",
"description": "Backend base URL"
},
"weight": {
"type": "integer",
"minimum": 0,
"maximum": 100,
"default": 1,
"description": "Load balancing weight"
},
"models": {
"type": "array",
"items": {"type": "string"},
"description": "Explicit model list (optional)"
}
},
"required": ["name", "url"]
}
},
"logging": {
"type": "object",
"properties": {
"level": {
"type": "string",
"enum": ["trace", "debug", "info", "warn", "error"],
"description": "Log level"
},
"format": {
"type": "string",
"enum": ["json", "text", "pretty"],
"description": "Log output format"
}
}
}
}
}
Status Codes:
200: Schema retrieved successfully
Configuration Modification APIs¶
Replace Configuration Section¶
Replaces an entire configuration section. Triggers validation and hot reload if applicable.
Path Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
section | string | Yes | Configuration section name |
Request Body: Complete section configuration object.
Example Request:
curl -X PUT http://localhost:8080/admin/config/logging \
-H "Content-Type: application/json" \
-d '{
"level": "debug",
"format": "json",
"output": "stdout",
"include_timestamps": true
}'
Response:
{
"success": true,
"section": "logging",
"hot_reload_applied": true,
"message": "Configuration updated and applied immediately",
"previous": {
"level": "info",
"format": "json",
"output": "stdout",
"include_timestamps": true
},
"current": {
"level": "debug",
"format": "json",
"output": "stdout",
"include_timestamps": true
},
"version": 15
}
Status Codes:
200: Configuration updated successfully400: Invalid configuration format or validation error404: Invalid section name
Partial Update Configuration Section¶
Performs a partial update using JSON merge patch semantics. Only specified fields are updated; unspecified fields retain their current values.
Path Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
section | string | Yes | Configuration section name |
Request Body: Partial configuration object with fields to update.
Example Request:
curl -X PATCH http://localhost:8080/admin/config/logging \
-H "Content-Type: application/json" \
-d '{
"level": "warn"
}'
Response:
{
"success": true,
"section": "logging",
"hot_reload_applied": true,
"message": "Configuration partially updated and applied",
"changes": {
"level": {
"from": "info",
"to": "warn"
}
},
"current": {
"level": "warn",
"format": "json",
"output": "stdout",
"include_timestamps": true
},
"version": 16
}
Merge Behavior:
- Scalar values are replaced
- Objects are merged recursively
- Arrays are replaced entirely (not merged)
nullvalues remove the field (if optional)
Status Codes:
200: Configuration updated successfully400: Invalid configuration format or validation error404: Invalid section name
Validate Configuration¶
Validates configuration without applying changes. Supports dry_run mode for testing configuration changes safely.
Request Body:
{
"section": "backends",
"config": {
"name": "new-backend",
"url": "http://localhost:8000",
"weight": 2
},
"dry_run": true
}
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
section | string | Yes | Configuration section to validate |
config | object | Yes | Configuration to validate |
dry_run | boolean | No | If true, only validate without preparing for apply (default: true) |
Response (Valid):
{
"valid": true,
"section": "backends",
"warnings": [
"Backend 'new-backend' has no explicit model list; models will be auto-discovered"
],
"info": {
"hot_reload_capability": "gradual_update",
"estimated_impact": "New requests may be routed to this backend after apply"
}
}
Response (Invalid):
{
"valid": false,
"section": "backends",
"errors": [
{
"field": "url",
"message": "Invalid URL format: missing scheme",
"value": "localhost:8000"
},
{
"field": "weight",
"message": "Weight must be between 0 and 100",
"value": 150
}
],
"warnings": []
}
Status Codes:
200: Validation completed (checkvalidfield for result)400: Invalid request format
Apply Pending Changes¶
Applies pending configuration changes immediately. Triggers hot reload for applicable settings.
Request Body (optional):
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
sections | array | No | Specific sections to apply (default: all pending) |
force | boolean | No | Force apply even if warnings exist (default: false) |
Response:
{
"success": true,
"applied_sections": ["logging", "rate_limiting"],
"results": {
"logging": {
"status": "applied",
"hot_reload": "immediate_update"
},
"rate_limiting": {
"status": "applied",
"hot_reload": "immediate_update"
}
},
"version": 17,
"timestamp": "2024-01-15T10:45:30Z"
}
Status Codes:
200: Changes applied successfully400: No pending changes or validation errors409: Conflict with concurrent modification
Configuration Save/Restore APIs¶
Export Configuration¶
Exports the current configuration in the specified format.
Request Body:
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
format | string | No | Export format: yaml, json, or toml (default: yaml) |
include_sensitive | boolean | No | Include sensitive data unmasked (requires elevated permissions, default: false) |
sections | array | No | Specific sections to export (default: all) |
Response (format: json):
{
"format": "json",
"content": "{\"server\":{\"bind_address\":\"0.0.0.0:8080\",...}}",
"sections_exported": ["server", "backends", "logging"],
"exported_at": "2024-01-15T10:45:30Z",
"version": 17,
"checksum": "sha256:a1b2c3d4..."
}
Response (format: yaml):
{
"format": "yaml",
"content": "server:\n bind_address: \"0.0.0.0:8080\"\n workers: 4\n...",
"sections_exported": ["server", "backends", "logging"],
"exported_at": "2024-01-15T10:45:30Z",
"version": 17,
"checksum": "sha256:a1b2c3d4..."
}
Status Codes:
200: Export successful400: Invalid format specified403: Elevated permissions required forinclude_sensitive: true
Import Configuration¶
Imports configuration from the provided content.
Request Body:
{
"format": "yaml",
"content": "server:\n bind_address: \"0.0.0.0:8080\"\n workers: 8\nlogging:\n level: debug",
"dry_run": true,
"merge": false
}
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
format | string | Yes | Content format: yaml, json, or toml |
content | string | Yes | Configuration content to import |
dry_run | boolean | No | Validate without applying (default: false) |
merge | boolean | No | Merge with existing config vs replace (default: false) |
Response (dry_run: true):
{
"valid": true,
"dry_run": true,
"changes_preview": {
"server": {
"workers": {"from": 4, "to": 8}
},
"logging": {
"level": {"from": "info", "to": "debug"}
}
},
"sections_affected": ["server", "logging"],
"warnings": [
"server.workers change requires restart to take effect"
]
}
Response (dry_run: false):
{
"success": true,
"imported_sections": ["server", "logging"],
"hot_reload_results": {
"logging": "applied_immediately",
"server": "requires_restart"
},
"version": 18,
"timestamp": "2024-01-15T10:50:00Z"
}
Status Codes:
200: Import successful (or dry_run validation passed)400: Invalid format or content parsing error422: Configuration validation failed
Get Configuration History¶
Retrieves the history of configuration changes.
Query Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
limit | integer | No | Maximum entries to return (default: 20, max: 100) |
offset | integer | No | Number of entries to skip (default: 0) |
section | string | No | Filter by section name |
Example Request:
Response:
{
"history": [
{
"version": 18,
"timestamp": "2024-01-15T10:50:00Z",
"sections_changed": ["logging"],
"source": "api",
"user": "admin",
"changes": {
"logging": {
"level": {"from": "info", "to": "debug"}
}
}
},
{
"version": 17,
"timestamp": "2024-01-15T09:30:00Z",
"sections_changed": ["backends"],
"source": "file_reload",
"user": null,
"changes": {
"backends": {
"added": ["new-backend"],
"modified": [],
"removed": []
}
}
},
{
"version": 16,
"timestamp": "2024-01-14T15:20:00Z",
"sections_changed": ["rate_limiting"],
"source": "api",
"user": "admin",
"changes": {
"rate_limiting": {
"enabled": {"from": false, "to": true}
}
}
}
],
"total": 18,
"limit": 10,
"offset": 0
}
Source Values:
| Source | Description |
|---|---|
api | Changed via Configuration Management API |
file_reload | Changed via configuration file hot reload |
startup | Initial configuration at server startup |
rollback | Restored from previous version |
Status Codes:
200: History retrieved successfully
Rollback Configuration¶
Rolls back to a previous configuration version.
Path Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
version | integer | Yes | Version number to rollback to |
Request Body (optional):
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
dry_run | boolean | No | Preview changes without applying (default: false) |
sections | array | No | Specific sections to rollback (default: all changed sections) |
Response:
{
"success": true,
"rolled_back_from": 18,
"rolled_back_to": 15,
"sections_restored": ["logging", "backends"],
"changes": {
"logging": {
"level": {"from": "debug", "to": "info"}
},
"backends": {
"removed": ["new-backend"]
}
},
"new_version": 19,
"timestamp": "2024-01-15T11:00:00Z"
}
Status Codes:
200: Rollback successful400: Validation error for target configuration404: Version not found in history
Backend Management APIs¶
These endpoints provide convenient shortcuts for managing backends without modifying the full backends configuration section.
Add Backend¶
Dynamically adds a new backend to the router.
Request Body:
{
"name": "new-ollama",
"url": "http://192.168.1.100:11434",
"weight": 2,
"models": ["llama2", "codellama"],
"api_key": null,
"health_check_path": "/api/tags"
}
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Unique backend identifier |
url | string | Yes | Backend base URL |
weight | integer | No | Load balancing weight (default: 1) |
models | array | No | Explicit model list (empty for auto-discovery) |
api_key | string | No | API key for authentication |
health_check_path | string | No | Custom health check endpoint |
Response:
{
"success": true,
"backend": {
"name": "new-ollama",
"url": "http://192.168.1.100:11434",
"weight": 2,
"models": ["llama2", "codellama"],
"is_healthy": null,
"status": "pending_health_check"
},
"message": "Backend added successfully. Health check scheduled.",
"config_version": 20
}
Status Codes:
200: Backend added successfully400: Invalid backend configuration409: Backend with this name already exists
Get Backend Configuration¶
Retrieves the configuration for a specific backend.
Path Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Backend identifier |
Response:
{
"name": "local-ollama",
"url": "http://localhost:11434",
"weight": 1,
"models": ["llama2", "mistral", "codellama"],
"api_key": null,
"health_check_path": "/api/tags",
"is_healthy": true,
"consecutive_failures": 0,
"consecutive_successes": 25,
"last_check": "2024-01-15T10:55:00Z",
"total_requests": 1250,
"failed_requests": 3
}
Status Codes:
200: Backend configuration retrieved404: Backend not found
Update Backend Configuration¶
Updates the configuration for an existing backend.
Path Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Backend identifier |
Request Body:
{
"url": "http://localhost:11434",
"weight": 3,
"models": ["llama2", "mistral", "codellama", "phi"],
"api_key": null
}
Response:
{
"success": true,
"backend": {
"name": "local-ollama",
"url": "http://localhost:11434",
"weight": 3,
"models": ["llama2", "mistral", "codellama", "phi"]
},
"changes": {
"weight": {"from": 1, "to": 3},
"models": {"added": ["phi"], "removed": []}
},
"config_version": 21
}
Status Codes:
200: Backend updated successfully400: Invalid configuration404: Backend not found
Delete Backend¶
Removes a backend from the router.
Path Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Backend identifier |
Query Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
drain | boolean | No | Wait for active requests to complete (default: true) |
timeout | integer | No | Drain timeout in seconds (default: 30) |
Example Request:
Response:
{
"success": true,
"deleted_backend": "old-backend",
"drained": true,
"active_requests_completed": 5,
"config_version": 22,
"message": "Backend removed from rotation"
}
Status Codes:
200: Backend deleted successfully404: Backend not found409: Cannot delete last remaining backend
Update Backend Weight¶
Updates only the load balancing weight for a backend.
Path Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Backend identifier |
Request Body:
Response:
{
"success": true,
"backend": "local-ollama",
"weight": {
"from": 1,
"to": 5
},
"config_version": 23
}
Status Codes:
200: Weight updated successfully400: Invalid weight value404: Backend not found
Update Backend Models¶
Updates only the model list for a backend.
Path Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Backend identifier |
Request Body:
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
models | array | Yes | Model list |
mode | string | No | Update mode: replace, add, or remove (default: replace) |
Response:
{
"success": true,
"backend": "local-ollama",
"models": {
"previous": ["llama2", "mistral", "codellama"],
"current": ["llama2", "mistral", "codellama", "phi", "gemma"],
"added": ["phi", "gemma"],
"removed": []
},
"config_version": 24
}
Status Codes:
200: Models updated successfully400: Invalid model list404: Backend not found
Configuration API Examples¶
Get Full Configuration¶
Update Logging Level¶
curl -X PATCH http://localhost:8080/admin/config/logging \
-H "Content-Type: application/json" \
-d '{"level": "debug"}'
Add a New Backend¶
curl -X POST http://localhost:8080/admin/backends \
-H "Content-Type: application/json" \
-d '{
"name": "remote-ollama",
"url": "http://192.168.1.50:11434",
"weight": 2,
"models": ["llama2", "mistral"]
}'
Export Configuration as JSON¶
curl -X POST http://localhost:8080/admin/config/export \
-H "Content-Type: application/json" \
-d '{"format": "json"}' | jq -r '.content' > config-backup.json
View Configuration History¶
Configuration API Error Responses¶
All Configuration Management API errors follow the standard error format:
{
"error": {
"message": "Human-readable error description",
"type": "error_type_identifier",
"code": 400,
"details": {
"additional": "context information"
}
}
}
Configuration-Specific Error Types:
| Type | HTTP Code | Description |
|---|---|---|
config_validation_error | 400 | Configuration validation failed |
config_section_not_found | 404 | Requested configuration section does not exist |
config_version_not_found | 404 | Requested version not found in history |
config_conflict | 409 | Concurrent modification conflict |
config_permission_denied | 403 | Insufficient permissions for operation |
config_parse_error | 422 | Failed to parse configuration content |
Example Validation Error:
{
"error": {
"message": "Configuration validation failed",
"type": "config_validation_error",
"code": 400,
"details": {
"section": "backends",
"errors": [
{
"field": "url",
"message": "URL must include scheme (http:// or https://)",
"value": "localhost:8000"
}
]
}
}
}
Example Conflict Error:
{
"error": {
"message": "Configuration was modified by another request",
"type": "config_conflict",
"code": 409,
"details": {
"expected_version": 15,
"current_version": 16,
"conflicting_sections": ["backends"]
}
}
}
Error Handling¶
Error Response Format¶
All errors follow a consistent JSON structure:
{
"error": {
"message": "Human-readable error description",
"type": "error_type_identifier",
"code": 404,
"details": {
"additional": "context information"
}
}
}
Error Types¶
| Type | HTTP Code | Description |
|---|---|---|
bad_request | 400 | Invalid request format or parameters |
unauthorized | 401 | Authentication required (future feature) |
forbidden | 403 | Access denied (future feature) |
model_not_found | 404 | Requested model not available |
rate_limit_exceeded | 429 | Rate limit exceeded (future feature) |
internal_error | 500 | Router internal error |
bad_gateway | 502 | Backend connection/response error |
service_unavailable | 503 | All backends unhealthy |
gateway_timeout | 504 | Backend request timeout |
Example Error Responses¶
Model Not Found:
{
"error": {
"message": "Model 'invalid-model' not found on any healthy backend",
"type": "model_not_found",
"code": 404,
"details": {
"requested_model": "invalid-model",
"available_models": ["gpt-4", "gpt-3.5-turbo", "llama2"]
}
}
}
Backend Error:
{
"error": {
"message": "Failed to connect to backend 'local-ollama'",
"type": "bad_gateway",
"code": 502,
"details": {
"backend": "local-ollama",
"backend_error": "Connection refused"
}
}
}
Service Unavailable:
{
"error": {
"message": "All backends are currently unhealthy",
"type": "service_unavailable",
"code": 503,
"details": {
"healthy_backends": 0,
"total_backends": 3
}
}
}
Rate Limiting¶
Note: Rate limiting is not currently implemented but is planned for future releases.
Future rate limiting will support: - Per-IP rate limiting - Per-API-key rate limiting
- Model-specific rate limiting - Sliding window algorithms - Rate limit headers in responses
Streaming¶
Server-Sent Events (SSE)¶
When stream: true is specified, responses are sent as Server-Sent Events with:
Content-Type: text/event-stream Cache-Control: no-cache Connection: keep-alive
SSE Format¶
data: {"id":"chatcmpl-123","object":"chat.completion.chunk",...}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk",...}
data: [DONE]
SSE Compatibility¶
The router supports multiple SSE formats for maximum compatibility:
- Standard Format:
data: {...} - Spaced Format:
data: {...} - Mixed Line Endings: Handles
\r\n,\n, and\r - Empty Lines: Properly processes chunk separators
Connection Management¶
- Keep-Alive: Connections are kept open during streaming
- Timeouts: 5-minute timeout for long-running requests
- Error Handling: Partial responses include error information
- Client Disconnection: Gracefully handles client disconnects
Examples¶
Basic Chat Completion¶
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
]
}'
Streaming Chat Completion¶
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{"role": "user", "content": "Write a short story"}
],
"stream": true,
"max_tokens": 200
}'
Text Completion with Parameters¶
curl -X POST http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo-instruct",
"prompt": "The future of AI is",
"max_tokens": 50,
"temperature": 0.8,
"top_p": 0.9
}'
Check Backend Status¶
Monitor Service Health¶
List Available Models¶
Python Client Example¶
import requests
import json
# Configure the client
BASE_URL = "http://localhost:8080"
def chat_completion(messages, model="gpt-3.5-turbo", stream=False):
"""Send a chat completion request"""
response = requests.post(
f"{BASE_URL}/v1/chat/completions",
headers={"Content-Type": "application/json"},
json={
"model": model,
"messages": messages,
"stream": stream,
"temperature": 0.7
},
stream=stream
)
if stream:
# Handle streaming response
for line in response.iter_lines():
if line:
line = line.decode('utf-8')
if line.startswith('data: '):
data = line[6:] # Remove 'data: ' prefix
if data == '[DONE]':
break
try:
chunk = json.loads(data)
content = chunk['choices'][0]['delta'].get('content', '')
if content:
print(content, end='', flush=True)
except json.JSONDecodeError:
continue
print() # New line after streaming
else:
# Handle non-streaming response
result = response.json()
return result['choices'][0]['message']['content']
# Example usage
messages = [
{"role": "user", "content": "Explain machine learning in simple terms"}
]
print("Streaming response:")
chat_completion(messages, stream=True)
print("\nNon-streaming response:")
response = chat_completion(messages, stream=False)
print(response)
JavaScript/Node.js Client Example¶
const fetch = require('node-fetch');
const BASE_URL = 'http://localhost:8080';
async function chatCompletion(messages, options = {}) {
const response = await fetch(`${BASE_URL}/v1/chat/completions`, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
model: options.model || 'gpt-3.5-turbo',
messages: messages,
stream: options.stream || false,
temperature: options.temperature || 0.7,
...options
})
});
if (options.stream) {
// Handle streaming response
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
const lines = chunk.split('\n');
for (const line of lines) {
if (line.startsWith('data: ')) {
const data = line.slice(6);
if (data === '[DONE]') return;
try {
const parsed = JSON.parse(data);
const content = parsed.choices[0]?.delta?.content;
if (content) {
process.stdout.write(content);
}
} catch (e) {
// Ignore JSON parse errors
}
}
}
}
console.log(); // New line
} else {
const result = await response.json();
return result.choices[0].message.content;
}
}
// Example usage
const messages = [
{ role: 'user', content: 'What is the meaning of life?' }
];
// Streaming
console.log('Streaming response:');
await chatCompletion(messages, { stream: true });
// Non-streaming
console.log('\nNon-streaming response:');
const response = await chatCompletion(messages);
console.log(response);
This API reference provides comprehensive documentation for integrating with Continuum Router. The router maintains full OpenAI API compatibility while adding powerful multi-backend routing and management capabilities.