Reasoning Effort Parameter¶

This document describes how Continuum Router handles the reasoning_effort parameter across different LLM backends, including the token budget conversions and supported effort levels.

Overview¶

The reasoning_effort parameter controls the amount of computational effort a model spends on reasoning before generating a response. Different backends implement this feature differently:

OpenAI: Native reasoning_effort parameter for O-series and GPT-5.2 thinking models
Anthropic: Converted to thinking.budget_tokens for Claude models with extended thinking
Gemini: Native reasoning_effort for thinking-capable models via OpenAI-compatible endpoint
Other backends: Pass-through (no transformation)

Parameter Formats¶

Continuum Router supports two input formats, both normalized internally:

Flat Format (Chat Completions API)¶

{
  "model": "o3-mini",
  "reasoning_effort": "high",
  "messages": [...]
}

Nested Format (Responses API)¶

{
  "model": "o3-mini",
  "reasoning": {
    "effort": "high"
  },
  "input": "..."
}

Both formats are automatically normalized to the flat reasoning_effort format before processing. If both are present, the flat format takes precedence.

Direct Token Budget Specification¶

For advanced use cases, you can specify the token budget directly instead of using reasoning_effort levels:

Anthropic: Direct `thinking` Parameter¶

{
  "model": "claude-sonnet-4-20250514",
  "thinking": {
    "type": "enabled",
    "budget_tokens": 16000
  },
  "messages": [...]
}

Gemini: Direct `thinking_budget` via `extra_body`¶

{
  "model": "gemini-2.5-pro",
  "extra_body": {
    "google": {
      "thinking_config": {
        "thinking_budget": 10000,
        "include_thoughts": true
      }
    }
  },
  "messages": [...]
}

Priority

If both reasoning_effort and direct token specification are present, the direct specification takes precedence for Anthropic (thinking parameter). For Gemini, both can coexist but thinking_budget in extra_body is used for fine-grained control.

Backend-Specific Behavior¶

OpenAI Backend¶

OpenAI models support the reasoning_effort parameter natively. The router passes the value directly to the OpenAI API.

Supported Effort Levels¶

Effort Level	Supported Models	Description
`low`	O-series, GPT-5.2 thinking	Minimal reasoning, faster responses
`medium`	O-series, GPT-5.2 thinking	Balanced reasoning effort
`high`	O-series, GPT-5.2 thinking	Deep reasoning, slower responses
`xhigh`	GPT-5.2 family only	Maximum reasoning effort

Models Supporting `reasoning_effort`¶

O-series models (support low, medium, high):

o1, o1-mini, o1-preview
o3, o3-mini, o3-pro
o4-mini

GPT-5.2 thinking models (support low, medium, high, xhigh):

gpt-5.2, gpt-5.2-thinking, gpt-5.2-latest
gpt-5.2-pro

xhigh Automatic Fallback

The xhigh effort level is only natively supported by GPT-5.2 family thinking models. When xhigh is requested for any other model or backend, Continuum Router automatically downgrades it to high with an info-level log message. This allows clients to always request xhigh without worrying about backend compatibility.

Models NOT Supporting `reasoning_effort`¶

The following models do not support reasoning parameters (parameter is stripped):

GPT-4o, GPT-4o-mini, GPT-4-turbo, GPT-4
GPT-5.2-chat-latest, GPT-5.2-instant (non-thinking variants)
GPT-5.1, GPT-5
GPT-3.5-turbo
Embedding models, Image models

Anthropic Backend (Claude)¶

Anthropic Claude models use a different mechanism called "extended thinking" with a thinking.budget_tokens parameter. The router automatically converts reasoning_effort to the appropriate token budget.

Conversion Table¶

`reasoning_effort`	`budget_tokens`	Description
`none`	disabled	Thinking feature disabled
`minimal`	1,024	Minimum allowed budget
`low`	4,096	Light reasoning
`medium`	10,240	Moderate reasoning
`high`	32,768	Deep reasoning

Transformation Example¶

Input (OpenAI format):

{
  "model": "claude-sonnet-4-20250514",
  "reasoning_effort": "high",
  "messages": [...]
}

Transformed (Anthropic format):

{
  "model": "claude-sonnet-4-20250514",
  "thinking": {
    "type": "enabled",
    "budget_tokens": 32768
  },
  "messages": [...]
}

Models Supporting Extended Thinking¶

Extended thinking is supported by:

Claude Opus models: claude-opus-4-*, claude-opus-4-5-*
Claude Sonnet 4 models: claude-sonnet-4-*, claude-sonnet-4-5-*

Temperature Restriction

When extended thinking is enabled, Claude does not support custom temperature settings. The router automatically removes the temperature parameter when thinking is active.

xhigh Automatic Fallback

When xhigh is requested for Claude models, the router automatically downgrades it to high (32,768 budget_tokens) since Claude doesn't support xhigh.

Gemini Backend¶

Gemini models support reasoning_effort natively through their OpenAI-compatible endpoint. The router validates and passes the value directly.

Supported Effort Levels¶

Effort Level	Supported Models	Description
`none`	Flash models only	Disable thinking
`minimal`	All thinking models	Minimal reasoning
`low`	All thinking models	Light reasoning
`medium`	All thinking models	Moderate reasoning
`high`	All thinking models	Deep reasoning

Flash Models (support `none`)¶

gemini-2.0-flash
gemini-2.5-flash
gemini-3-flash

Pro Models (do NOT support `none`)¶

gemini-2.5-pro
gemini-3-pro

Pro Model Restriction

Requesting none effort level for Pro models results in a validation error. Only Flash models support disabling thinking.

xhigh Automatic Fallback

When xhigh is requested for Gemini models, the router automatically downgrades it to high since Gemini doesn't support xhigh.

Additional Features¶

For Gemini thinking models, the router automatically:

Sets include_thoughts: true to expose reasoning content
Sets default max_completion_tokens: 16384 if not specified
Validates effort level against model capabilities

Generic/HTTP Backend¶

Generic HTTP backends (used for Ollama, vLLM, LocalAI, LM Studio, etc.) pass requests through without transformation.

Behavior	Description
Pass-through	`reasoning_effort` is forwarded as-is to the backend
No validation	The router does not validate effort levels
No conversion	No token budget conversion is performed

Backend Responsibility

For generic backends, the target LLM server is responsible for handling (or ignoring) the reasoning_effort parameter. If the server doesn't support it, it may return an error or ignore the parameter.

llama.cpp Backend¶

The llama.cpp backend uses pass-through behavior:

Behavior	Description
Pass-through	Parameters forwarded unchanged
Server-dependent	Support depends on llama-server configuration

Thinking Model Support in llama.cpp

Recent versions of llama-server support thinking models (e.g., DeepSeek-R1) with <think> tag handling. However, there is no standardized API parameter for reasoning_effort or budget_tokens. The thinking behavior is typically controlled by:

Model's built-in chat template with thinking tags
Server-side configuration (e.g., --thinking-budget if available)
Sampling parameters like temperature and top-p

vLLM Backend¶

vLLM provides an OpenAI-compatible API but reasoning effort support depends on the model:

Behavior	Description
Pass-through	Parameters forwarded via generic backend
Model-dependent	Support varies by model type

Thinking Model Support in vLLM

vLLM can run thinking-capable models (e.g., DeepSeek-R1, QwQ) but the reasoning_effort parameter handling depends on:

Whether the model supports structured thinking
vLLM server version and configuration
Model's chat template configuration

For DeepSeek-R1 and similar models, thinking is often implicit in the model's behavior rather than controlled by an explicit budget parameter.

Summary Table¶

Backend	Effort Levels	Conversion	Notes
OpenAI	`low`, `medium`, `high`, `xhigh`*	None (native)	*`xhigh` only for GPT-5.2
Anthropic	`none`, `minimal`, `low`, `medium`, `high`	→ `budget_tokens`	Removes `temperature` when enabled
Gemini	`none`*, `minimal`, `low`, `medium`, `high`	None (native)	*`none` only for Flash models
vLLM	Model-dependent	Pass-through	DeepSeek-R1, QwQ use implicit thinking
llama.cpp	Model-dependent	Pass-through	Uses `<think>` tag in chat template
Generic	Any	Pass-through	Backend handles validation

Response Format¶

When reasoning/thinking is enabled, the response includes the model's reasoning process:

OpenAI Format (with reasoning_content)¶

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "The answer is 42.",
      "reasoning_content": "Let me think through this step by step..."
    },
    "finish_reason": "stop"
  }]
}

Claude Extended Thinking (transformed to OpenAI format)¶

The router transforms Claude's thinking blocks to the reasoning_content field:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "The answer is 42.",
      "reasoning_content": "I need to consider multiple factors..."
    },
    "finish_reason": "stop"
  }]
}

Best Practices¶

Use appropriate effort levels: Higher effort = better reasoning but slower and more expensive
Check model support: Not all models support reasoning parameters
Handle xhigh carefully: Only GPT-5.2 family supports xhigh; use high for other models
Consider cost: Extended thinking consumes additional tokens (especially with Anthropic's budget_tokens)
Test with your backend: Generic backends may have varying support levels

Reasoning Effort Parameter¶

Overview¶

Parameter Formats¶

Flat Format (Chat Completions API)¶

Nested Format (Responses API)¶

Direct Token Budget Specification¶

Anthropic: Direct thinking Parameter¶

Gemini: Direct thinking_budget via extra_body¶

Backend-Specific Behavior¶

OpenAI Backend¶

Supported Effort Levels¶

Models Supporting reasoning_effort¶

Models NOT Supporting reasoning_effort¶

Anthropic Backend (Claude)¶

Conversion Table¶

Transformation Example¶

Models Supporting Extended Thinking¶

Gemini Backend¶

Supported Effort Levels¶

Flash Models (support none)¶

Pro Models (do NOT support none)¶

Additional Features¶

Generic/HTTP Backend¶

llama.cpp Backend¶

vLLM Backend¶

Summary Table¶

Response Format¶

OpenAI Format (with reasoning_content)¶

Claude Extended Thinking (transformed to OpenAI format)¶

Best Practices¶

Related Documentation¶

Anthropic: Direct `thinking` Parameter¶

Gemini: Direct `thinking_budget` via `extra_body`¶

Models Supporting `reasoning_effort`¶

Models NOT Supporting `reasoning_effort`¶

Flash Models (support `none`)¶

Pro Models (do NOT support `none`)¶