Reasoning Effort Parameter¶
This document describes how Continuum Router handles the reasoning_effort parameter across different LLM backends, including the token budget conversions and supported effort levels.
Overview¶
The reasoning_effort parameter controls the amount of computational effort a model spends on reasoning before generating a response. Different backends implement this feature differently:
- OpenAI: Native
reasoning_effortparameter for O-series and GPT-5.2 thinking models - Anthropic: Converted to
thinking.budget_tokensfor Claude models with extended thinking - Gemini: Native
reasoning_effortfor thinking-capable models via OpenAI-compatible endpoint - Other backends: Pass-through (no transformation)
Parameter Formats¶
Continuum Router supports two input formats, both normalized internally:
Flat Format (Chat Completions API)¶
Nested Format (Responses API)¶
Both formats are automatically normalized to the flat reasoning_effort format before processing. If both are present, the flat format takes precedence.
Direct Token Budget Specification¶
For advanced use cases, you can specify the token budget directly instead of using reasoning_effort levels:
Anthropic: Direct thinking Parameter¶
{
"model": "claude-sonnet-4-20250514",
"thinking": {
"type": "enabled",
"budget_tokens": 16000
},
"messages": [...]
}
Gemini: Direct thinking_budget via extra_body¶
{
"model": "gemini-2.5-pro",
"extra_body": {
"google": {
"thinking_config": {
"thinking_budget": 10000,
"include_thoughts": true
}
}
},
"messages": [...]
}
Priority
If both reasoning_effort and direct token specification are present, the direct specification takes precedence for Anthropic (thinking parameter). For Gemini, both can coexist but thinking_budget in extra_body is used for fine-grained control.
Backend-Specific Behavior¶
OpenAI Backend¶
OpenAI models support the reasoning_effort parameter natively. The router passes the value directly to the OpenAI API.
Supported Effort Levels¶
| Effort Level | Supported Models | Description |
|---|---|---|
low | O-series, GPT-5.2 thinking | Minimal reasoning, faster responses |
medium | O-series, GPT-5.2 thinking | Balanced reasoning effort |
high | O-series, GPT-5.2 thinking | Deep reasoning, slower responses |
xhigh | GPT-5.2 family only | Maximum reasoning effort |
Models Supporting reasoning_effort¶
O-series models (support low, medium, high):
o1,o1-mini,o1-previewo3,o3-mini,o3-proo4-mini
GPT-5.2 thinking models (support low, medium, high, xhigh):
gpt-5.2,gpt-5.2-thinking,gpt-5.2-latestgpt-5.2-pro
xhigh Automatic Fallback
The xhigh effort level is only natively supported by GPT-5.2 family thinking models. When xhigh is requested for any other model or backend, Continuum Router automatically downgrades it to high with an info-level log message. This allows clients to always request xhigh without worrying about backend compatibility.
Models NOT Supporting reasoning_effort¶
The following models do not support reasoning parameters (parameter is stripped):
- GPT-4o, GPT-4o-mini, GPT-4-turbo, GPT-4
- GPT-5.2-chat-latest, GPT-5.2-instant (non-thinking variants)
- GPT-5.1, GPT-5
- GPT-3.5-turbo
- Embedding models, Image models
Anthropic Backend (Claude)¶
Anthropic Claude models use a different mechanism called "extended thinking" with a thinking.budget_tokens parameter. The router automatically converts reasoning_effort to the appropriate token budget.
Conversion Table¶
reasoning_effort | budget_tokens | Description |
|---|---|---|
none | disabled | Thinking feature disabled |
minimal | 1,024 | Minimum allowed budget |
low | 4,096 | Light reasoning |
medium | 10,240 | Moderate reasoning |
high | 32,768 | Deep reasoning |
Transformation Example¶
Input (OpenAI format):
Transformed (Anthropic format):
{
"model": "claude-sonnet-4-20250514",
"thinking": {
"type": "enabled",
"budget_tokens": 32768
},
"messages": [...]
}
Models Supporting Extended Thinking¶
Extended thinking is supported by:
- Claude Opus models:
claude-opus-4-*,claude-opus-4-5-* - Claude Sonnet 4 models:
claude-sonnet-4-*,claude-sonnet-4-5-*
Temperature Restriction
When extended thinking is enabled, Claude does not support custom temperature settings. The router automatically removes the temperature parameter when thinking is active.
xhigh Automatic Fallback
When xhigh is requested for Claude models, the router automatically downgrades it to high (32,768 budget_tokens) since Claude doesn't support xhigh.
Gemini Backend¶
Gemini models support reasoning_effort natively through their OpenAI-compatible endpoint. The router validates and passes the value directly.
Supported Effort Levels¶
| Effort Level | Supported Models | Description |
|---|---|---|
none | Flash models only | Disable thinking |
minimal | All thinking models | Minimal reasoning |
low | All thinking models | Light reasoning |
medium | All thinking models | Moderate reasoning |
high | All thinking models | Deep reasoning |
Flash Models (support none)¶
gemini-2.0-flashgemini-2.5-flashgemini-3-flash
Pro Models (do NOT support none)¶
gemini-2.5-progemini-3-pro
Pro Model Restriction
Requesting none effort level for Pro models results in a validation error. Only Flash models support disabling thinking.
xhigh Automatic Fallback
When xhigh is requested for Gemini models, the router automatically downgrades it to high since Gemini doesn't support xhigh.
Additional Features¶
For Gemini thinking models, the router automatically:
- Sets
include_thoughts: trueto expose reasoning content - Sets default
max_completion_tokens: 16384if not specified - Validates effort level against model capabilities
Generic/HTTP Backend¶
Generic HTTP backends (used for Ollama, vLLM, LocalAI, LM Studio, etc.) pass requests through without transformation.
| Behavior | Description |
|---|---|
| Pass-through | reasoning_effort is forwarded as-is to the backend |
| No validation | The router does not validate effort levels |
| No conversion | No token budget conversion is performed |
Backend Responsibility
For generic backends, the target LLM server is responsible for handling (or ignoring) the reasoning_effort parameter. If the server doesn't support it, it may return an error or ignore the parameter.
llama.cpp Backend¶
The llama.cpp backend uses pass-through behavior:
| Behavior | Description |
|---|---|
| Pass-through | Parameters forwarded unchanged |
| Server-dependent | Support depends on llama-server configuration |
Thinking Model Support in llama.cpp
Recent versions of llama-server support thinking models (e.g., DeepSeek-R1) with <think> tag handling. However, there is no standardized API parameter for reasoning_effort or budget_tokens. The thinking behavior is typically controlled by:
- Model's built-in chat template with thinking tags
- Server-side configuration (e.g.,
--thinking-budgetif available) - Sampling parameters like temperature and top-p
vLLM Backend¶
vLLM provides an OpenAI-compatible API but reasoning effort support depends on the model:
| Behavior | Description |
|---|---|
| Pass-through | Parameters forwarded via generic backend |
| Model-dependent | Support varies by model type |
Thinking Model Support in vLLM
vLLM can run thinking-capable models (e.g., DeepSeek-R1, QwQ) but the reasoning_effort parameter handling depends on:
- Whether the model supports structured thinking
- vLLM server version and configuration
- Model's chat template configuration
For DeepSeek-R1 and similar models, thinking is often implicit in the model's behavior rather than controlled by an explicit budget parameter.
Summary Table¶
| Backend | Effort Levels | Conversion | Notes |
|---|---|---|---|
| OpenAI | low, medium, high, xhigh* | None (native) | *xhigh only for GPT-5.2 |
| Anthropic | none, minimal, low, medium, high | → budget_tokens | Removes temperature when enabled |
| Gemini | none*, minimal, low, medium, high | None (native) | *none only for Flash models |
| vLLM | Model-dependent | Pass-through | DeepSeek-R1, QwQ use implicit thinking |
| llama.cpp | Model-dependent | Pass-through | Uses <think> tag in chat template |
| Generic | Any | Pass-through | Backend handles validation |
Response Format¶
When reasoning/thinking is enabled, the response includes the model's reasoning process:
OpenAI Format (with reasoning_content)¶
{
"choices": [{
"message": {
"role": "assistant",
"content": "The answer is 42.",
"reasoning_content": "Let me think through this step by step..."
},
"finish_reason": "stop"
}]
}
Claude Extended Thinking (transformed to OpenAI format)¶
The router transforms Claude's thinking blocks to the reasoning_content field:
{
"choices": [{
"message": {
"role": "assistant",
"content": "The answer is 42.",
"reasoning_content": "I need to consider multiple factors..."
},
"finish_reason": "stop"
}]
}
Best Practices¶
- Use appropriate effort levels: Higher effort = better reasoning but slower and more expensive
- Check model support: Not all models support reasoning parameters
- Handle xhigh carefully: Only GPT-5.2 family supports
xhigh; usehighfor other models - Consider cost: Extended thinking consumes additional tokens (especially with Anthropic's
budget_tokens) - Test with your backend: Generic backends may have varying support levels