Reasoning Effort Parameter¶

This document describes how Continuum Router handles the reasoning_effort parameter across different LLM backends, including the token budget conversions and supported effort levels.

Overview¶

The reasoning_effort parameter controls the amount of computational effort a model spends on reasoning before generating a response. Different backends implement this feature differently:

OpenAI: Native reasoning_effort parameter for O-series and GPT-5.x thinking models
Anthropic: Converted to thinking.budget_tokens for Claude models with extended thinking, or to {"type": "adaptive"} for auto effort
Gemini: Native reasoning_effort for thinking-capable models via OpenAI-compatible endpoint
Other backends: Pass-through (no transformation)

Parameter Formats¶

Continuum Router supports two input formats, both normalized internally:

Flat Format (Chat Completions API)¶

{
  "model": "o3-mini",
  "reasoning_effort": "high",
  "messages": [...]
}

Nested Format (Responses API)¶

{
  "model": "o3-mini",
  "reasoning": {
    "effort": "high"
  },
  "input": "..."
}

Both formats are automatically normalized to the flat reasoning_effort format before processing. If both are present, the flat format takes precedence.

Direct Token Budget Specification¶

For advanced use cases, you can specify the token budget directly instead of using reasoning_effort levels:

Anthropic: Direct `thinking` Parameter¶

{
  "model": "claude-sonnet-4-20250514",
  "thinking": {
    "type": "enabled",
    "budget_tokens": 16000
  },
  "messages": [...]
}

For adaptive thinking (model decides when and how much to think):

{
  "model": "claude-opus-4-6-20260205",
  "thinking": {
    "type": "adaptive"
  },
  "messages": [...]
}

Gemini: Direct `thinking_budget` via `extra_body`¶

{
  "model": "gemini-2.5-pro",
  "extra_body": {
    "google": {
      "thinking_config": {
        "thinking_budget": 10000,
        "include_thoughts": true
      }
    }
  },
  "messages": [...]
}

Priority

If both reasoning_effort and direct token specification are present, the direct specification takes precedence for Anthropic (thinking parameter). For Gemini, both can coexist but thinking_budget in extra_body is used for fine-grained control.

Backend-Specific Behavior¶

OpenAI Backend¶

OpenAI models support the reasoning_effort parameter natively. The router passes the value directly to the OpenAI API.

Supported Effort Levels¶

Effort Level	Supported Models	Description
`low`	O-series, GPT-5.x thinking	Minimal reasoning, faster responses
`medium`	O-series, GPT-5.x thinking	Balanced reasoning effort
`high`	O-series, GPT-5.x thinking	Deep reasoning, slower responses
`xhigh`	GPT-5.x family only	Maximum reasoning effort

Models Supporting `reasoning_effort`¶

O-series models (support low, medium, high):

o1, o1-mini, o1-preview
o3, o3-mini, o3-pro
o4-mini

GPT-5.x thinking models (support low, medium, high, xhigh):

gpt-5.4, gpt-5.4-pro, gpt-5.4-mini, gpt-5.4-nano
gpt-5.2, gpt-5.2-thinking, gpt-5.2-pro

xhigh Automatic Fallback

The xhigh effort level is only natively supported by GPT-5.x family thinking models. When xhigh is requested for any other model or backend, Continuum Router automatically downgrades it to high with an info-level log message. This allows clients to always request xhigh without worrying about backend compatibility.

Models NOT Supporting `reasoning_effort`¶

The following models do not support reasoning parameters (parameter is stripped):

GPT-4o, GPT-4o-mini, GPT-4-turbo, GPT-4
GPT-5.2-chat-latest, GPT-5.2-instant (non-thinking variants)
GPT-5.1, GPT-5
GPT-3.5-turbo
Embedding models, Image models

Anthropic Backend (Claude)¶

Anthropic Claude models use different mechanisms depending on the model generation:

Claude Fable 5 and Mythos 5 (Mythos-class; Mythos 5 is the safeguards-lifted, limited-release variant of Fable 5): Use adaptive thinking ({"type": "adaptive"}) with output_config.effort, the same surface as Opus 4.8 (legacy budget_tokens rejected with HTTP 400; temperature/top_p/top_k dropped automatically; max effort supported). One additional constraint: an explicit thinking.type == "disabled" is rejected (HTTP 400), so the router omits the thinking parameter entirely instead of forwarding it.
Claude 4.8+ models (Opus 4.8): Use adaptive thinking ({"type": "adaptive"}) with output_config.effort to control thinking depth. The legacy thinking.type == "enabled" + budget_tokens shape is rejected (HTTP 400). Sampling parameters temperature, top_p, and top_k are not accepted and the router drops them automatically. Effort default is high.
Claude 4.6+ adaptive models (Opus 4.6, Sonnet 4.6, Opus 4.7): Use adaptive thinking ({"type": "adaptive"}) with output_config.effort to control thinking depth. Claude Opus 4.7 requires this API; sending the legacy thinking.type == "enabled" + budget_tokens shape to Opus 4.7 produces HTTP 400.
Pre-4.6 models (Opus 4.5, Sonnet 4, etc.): Use thinking.budget_tokens for extended thinking

Conversion Table: Claude 4.6+ Models (Adaptive Thinking)¶

Claude 4.6+ adaptive models (claude-opus-4-6-*, claude-sonnet-4-6-*, claude-opus-4-7-*, claude-opus-4-8-*, claude-fable-5-*, claude-mythos-5-*) use output_config.effort instead of budget_tokens:

`reasoning_effort`	`thinking`	`output_config.effort`	Description
`none`	disabled	(omitted)	Thinking disabled
`minimal`	`{"type": "adaptive"}`	`"low"`	Mapped to `low` (no minimal level in Anthropic API)
`auto`	`{"type": "adaptive"}`	(omitted)	Model uses default effort (high)
`low`	`{"type": "adaptive"}`	`"low"`	Low thinking depth
`medium`	`{"type": "adaptive"}`	`"medium"`	Moderate thinking depth
`high`	`{"type": "adaptive"}`	`"high"`	High thinking depth
`xhigh` (Opus 4.6, Opus 4.7, Opus 4.8, Fable 5, Mythos 5)	`{"type": "adaptive"}`	`"max"`	Maximum thinking depth
`xhigh` (Sonnet 4.6)	`{"type": "adaptive"}`	`"high"`	Downgraded (`max` is Opus/Mythos-class-only)

Conversion Table: Pre-4.6 Claude Models (Budget Tokens)¶

Pre-4.6 models use thinking.budget_tokens:

`reasoning_effort`	Anthropic `thinking`	Description
`none`	disabled	Thinking feature disabled
`minimal`	`{"type": "enabled", "budget_tokens": 1024}`	Minimum allowed budget
`auto`	`{"type": "adaptive"}`	Model decides when and how much to think
`low`	`{"type": "enabled", "budget_tokens": 4096}`	Light reasoning
`medium`	`{"type": "enabled", "budget_tokens": 10240}`	Moderate reasoning
`high`	`{"type": "enabled", "budget_tokens": 32768}`	Deep reasoning
`xhigh`	`{"type": "enabled", "budget_tokens": 32768}`	Falls back to `high` budget

Transformation Example: Claude 4.6+ (Adaptive Thinking + Effort)¶

Input (OpenAI format):

{
  "model": "claude-opus-4-6-20260205",
  "reasoning_effort": "high",
  "messages": [...]
}

Transformed (Anthropic format):

{
  "model": "claude-opus-4-6-20260205",
  "thinking": {
    "type": "adaptive"
  },
  "output_config": {
    "effort": "high"
  },
  "messages": [...]
}

For xhigh on Opus 4.6 or Opus 4.7, output_config.effort is set to "max":

{
  "model": "claude-opus-4-6-20260205",
  "thinking": {"type": "adaptive"},
  "output_config": {"effort": "max"},
  "messages": [...]
}

Transformation Example: Pre-4.6 Claude (Budget Tokens)¶

Input (OpenAI format):

{
  "model": "claude-sonnet-4-20250514",
  "reasoning_effort": "high",
  "messages": [...]
}

Transformed (Anthropic format):

{
  "model": "claude-sonnet-4-20250514",
  "thinking": {
    "type": "enabled",
    "budget_tokens": 32768
  },
  "messages": [...]
}

Adaptive Thinking (Auto)¶

When reasoning_effort is set to "auto", the router emits {"type": "adaptive"} without specifying output_config.effort. This lets Anthropic use its default effort level (currently high) and lets the model dynamically decide when and how much reasoning to apply.

Input (OpenAI format):

{
  "model": "claude-opus-4-6-20260205",
  "reasoning_effort": "auto",
  "messages": [...]
}

Transformed (Anthropic format):

{
  "model": "claude-opus-4-6-20260205",
  "thinking": {
    "type": "adaptive"
  },
  "messages": [...]
}

Direct `output_config` Pass-Through¶

When a request already contains explicit thinking and output_config parameters, they are passed through directly without conversion:

{
  "model": "claude-opus-4-6-20260205",
  "thinking": {"type": "adaptive"},
  "output_config": {"effort": "medium"},
  "messages": [...]
}

Adaptive Thinking Availability

Adaptive thinking with output_config.effort is available on Claude 4.6+ adaptive models (claude-opus-4-6-*, claude-sonnet-4-6-*, claude-opus-4-7-*, claude-opus-4-8-*) and the Mythos-class family (claude-fable-5-*, claude-mythos-5-*). Claude Opus 4.7/4.8 and the Mythos-class models require this API; the router normalizes explicit legacy thinking.type == "enabled" requests for these models to avoid upstream HTTP 400 errors. Pre-4.6 models use budget_tokens instead.

max Effort Level

The "max" effort level in output_config.effort is available on Opus models (Opus 4.6, Opus 4.7, Opus 4.8) and the Mythos-class family (Fable 5, Mythos 5). When xhigh is requested for Sonnet 4.6, the router automatically downgrades it to "high" with an info-level log message.

Claude Opus 4.7/4.8 and Mythos-class Sampling Parameter Deprecation

Anthropic deprecated temperature, top_p, and top_k for Claude Opus 4.7, Opus 4.8, Fable 5, and Mythos 5. The router automatically drops all three parameters before forwarding to any claude-opus-4-7-*, claude-opus-4-8-*, claude-fable-5-*, or claude-mythos-5-* model to avoid HTTP 400 errors.

Mythos-class Disabled-Thinking Rejection

Claude Fable 5 and Mythos 5 reject an explicit thinking: {"type": "disabled"} with HTTP 400. When a client sends a disabled-thinking config to a claude-fable-5-* or claude-mythos-5-* model, the router omits the thinking parameter entirely (the documented workaround) instead of forwarding it. Opus and Sonnet families accept an explicit disabled-thinking config unchanged.

Models Supporting Extended Thinking¶

Extended thinking is supported by:

Claude Fable 5 / Mythos 5 (adaptive thinking + effort; max supported; sampling params dropped; explicit disabled-thinking omitted): claude-fable-5-* (alias claude-fable-5-latest), claude-mythos-5-* (alias claude-mythos-5-latest, limited release)
Claude Opus 4.8 (adaptive thinking + effort; sampling params deprecated; effort default high): claude-opus-4-8-*, alias claude-opus-4-8-latest
Claude Opus 4.7 (adaptive thinking + effort; sampling params deprecated): claude-opus-4-7-*
Claude 4.6 family (adaptive thinking + effort): claude-opus-4-6-*, claude-sonnet-4-6-*
Claude Opus models (budget tokens): claude-opus-4-*, claude-opus-4-5-*
Claude Sonnet 4 models (budget tokens): claude-sonnet-4-*, claude-sonnet-4-5-*

Temperature Restriction

When extended thinking is enabled, Claude does not support custom temperature settings. The router automatically removes the temperature parameter when thinking is active. For Claude Opus 4.7, Opus 4.8, Fable 5, and Mythos 5, temperature, top_p, and top_k are always dropped regardless of whether thinking is enabled.

xhigh for Pre-4.6 Claude

When xhigh is requested for pre-4.6 Claude models, the router automatically downgrades it to high (32,768 budget_tokens).

auto on Non-Anthropic Backends

The auto effort level maps to Anthropic's adaptive thinking. When auto is requested for OpenAI or Gemini backends (which do not support adaptive thinking), the router automatically downgrades it to medium with a debug-level log message.

Gemini Backend¶

Gemini models support reasoning_effort natively through their OpenAI-compatible endpoint. The router validates and passes the value directly.

Supported Effort Levels¶

Effort Level	Supported Models	Description
`none`	Flash models only	Disable thinking
`minimal`	All thinking models	Minimal reasoning
`low`	All thinking models	Light reasoning
`medium`	All thinking models	Moderate reasoning
`high`	All thinking models	Deep reasoning

Flash Models (support `none`)¶

gemini-2.0-flash
gemini-2.5-flash
gemini-3-flash
gemini-3.1-flash
gemini-3.5-flash

Pro Models (do NOT support `none`)¶

gemini-2.5-pro
gemini-3-pro

Pro Model Restriction

Requesting none effort level for Pro models results in a validation error. Only Flash models support disabling thinking.

xhigh Automatic Fallback

When xhigh is requested for Gemini models, the router automatically downgrades it to high since Gemini doesn't support xhigh.

Additional Features¶

For Gemini thinking models, the router automatically:

Sets include_thoughts: true to expose reasoning content
Sets default max_completion_tokens: 16384 if not specified
Validates effort level against model capabilities

Generic/HTTP Backend¶

Generic HTTP backends (used for Ollama, vLLM, LocalAI, LM Studio, etc.) pass requests through without transformation.

Behavior	Description
Pass-through	`reasoning_effort` is forwarded as-is to the backend
No validation	The router does not validate effort levels
No conversion	No token budget conversion is performed

Backend Responsibility

For generic backends, the target LLM server is responsible for handling (or ignoring) the reasoning_effort parameter. If the server doesn't support it, it may return an error or ignore the parameter.

llama.cpp Backend¶

The llama.cpp backend uses pass-through behavior:

Behavior	Description
Pass-through	Parameters forwarded unchanged
Server-dependent	Support depends on llama-server configuration

Thinking Model Support in llama.cpp

Recent versions of llama-server support thinking models (e.g., DeepSeek-R1) with <think> tag handling. However, there is no standardized API parameter for reasoning_effort or budget_tokens. The thinking behavior is typically controlled by:

Model's built-in chat template with thinking tags
Server-side configuration (e.g., --thinking-budget if available)
Sampling parameters like temperature and top-p

vLLM Backend¶

vLLM provides an OpenAI-compatible API but reasoning effort support depends on the model:

Behavior	Description
Pass-through	Parameters forwarded via generic backend
Model-dependent	Support varies by model type

Thinking Model Support in vLLM

vLLM can run thinking-capable models (e.g., DeepSeek-R1, QwQ) but the reasoning_effort parameter handling depends on:

Whether the model supports structured thinking
vLLM server version and configuration
Model's chat template configuration

For DeepSeek-R1 and similar models, thinking is often implicit in the model's behavior rather than controlled by an explicit budget parameter.

Response Field Normalization

Newer vLLM versions emit reasoning text as choices[].delta.reasoning (streaming) or choices[].message.reasoning (non-streaming) rather than the canonical reasoning_content. The router normalizes this field on all Chat Completions relay paths, so downstream clients always see reasoning_content regardless of vLLM version. Backends that already use reasoning_content are not modified.

Summary Table¶

Backend	Effort Levels	Conversion	Notes
OpenAI	`auto`*, `low`, `medium`, `high`, `xhigh`	None (native)	`xhigh` only for GPT-5.2; *`auto` downgraded to `medium`
Anthropic (Fable 5 / Mythos 5)	`none`, `minimal`, `auto`, `low`, `medium`, `high`, `xhigh`	→ `adaptive` + `output_config.effort`	`xhigh` → `max`; drops `temperature`/`top_p`/`top_k` always; explicit disabled-thinking omitted
Anthropic (Opus 4.8)	`none`, `minimal`, `auto`, `low`, `medium`, `high`, `xhigh`	→ `adaptive` + `output_config.effort`	`xhigh` → `max`; drops `temperature`/`top_p`/`top_k` always; effort default `high`
Anthropic (Opus 4.7)	`none`, `minimal`, `auto`, `low`, `medium`, `high`, `xhigh`	→ `adaptive` + `output_config.effort`	`xhigh` → `max`; drops `temperature`/`top_p`/`top_k` always
Anthropic (4.6)	`none`, `minimal`, `auto`, `low`, `medium`, `high`, `xhigh`*	→ `adaptive` + `output_config.effort`	*`xhigh` → `max` for Opus 4.6, `high` for Sonnet 4.6; removes `temperature`
Anthropic (pre-4.6)	`none`, `minimal`, `auto`, `low`, `medium`, `high`	→ `budget_tokens` or `adaptive`	`auto` maps to adaptive thinking; removes `temperature`
Gemini	`none`, `auto`*, `minimal`, `low`, `medium`, `high`	None (native)	`none` only for Flash models; *`auto` downgraded to `medium`
vLLM	Model-dependent	Pass-through	DeepSeek-R1, QwQ use implicit thinking
llama.cpp	Model-dependent	Pass-through	Uses `<think>` tag in chat template
Generic	Any	Pass-through	Backend handles validation

Response Format¶

When reasoning/thinking is enabled, the response includes the model's reasoning process:

OpenAI Format (with reasoning_content)¶

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "The answer is 42.",
      "reasoning_content": "Let me think through this step by step..."
    },
    "finish_reason": "stop"
  }]
}

Claude Extended Thinking (transformed to OpenAI format)¶

The router transforms Claude's thinking blocks to the reasoning_content field:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "The answer is 42.",
      "reasoning_content": "I need to consider multiple factors..."
    },
    "finish_reason": "stop"
  }]
}

Best Practices¶

Use appropriate effort levels: Higher effort = better reasoning but slower and more expensive
Check model support: Not all models support reasoning parameters
Handle xhigh carefully: Only GPT-5.2 family supports xhigh; use high for other models
Consider cost: Extended thinking consumes additional tokens (especially with Anthropic's budget_tokens)
Test with your backend: Generic backends may have varying support levels

Reasoning Effort Parameter¶

Overview¶

Parameter Formats¶

Flat Format (Chat Completions API)¶

Nested Format (Responses API)¶

Direct Token Budget Specification¶

Anthropic: Direct thinking Parameter¶

Gemini: Direct thinking_budget via extra_body¶

Backend-Specific Behavior¶

OpenAI Backend¶

Supported Effort Levels¶

Models Supporting reasoning_effort¶

Models NOT Supporting reasoning_effort¶

Anthropic Backend (Claude)¶

Conversion Table: Claude 4.6+ Models (Adaptive Thinking)¶

Conversion Table: Pre-4.6 Claude Models (Budget Tokens)¶

Transformation Example: Claude 4.6+ (Adaptive Thinking + Effort)¶

Transformation Example: Pre-4.6 Claude (Budget Tokens)¶

Adaptive Thinking (Auto)¶

Direct output_config Pass-Through¶

Models Supporting Extended Thinking¶

Gemini Backend¶

Supported Effort Levels¶

Flash Models (support none)¶

Pro Models (do NOT support none)¶

Additional Features¶

Generic/HTTP Backend¶

llama.cpp Backend¶

vLLM Backend¶

Summary Table¶

Response Format¶

OpenAI Format (with reasoning_content)¶

Claude Extended Thinking (transformed to OpenAI format)¶

Best Practices¶

Related Documentation¶

Anthropic: Direct `thinking` Parameter¶

Gemini: Direct `thinking_budget` via `extra_body`¶

Models Supporting `reasoning_effort`¶

Models NOT Supporting `reasoning_effort`¶

Direct `output_config` Pass-Through¶

Flash Models (support `none`)¶

Pro Models (do NOT support `none`)¶