Skip to content

Reasoning Effort Parameter

This document describes how Continuum Router handles the reasoning_effort parameter across different LLM backends, including the token budget conversions and supported effort levels.

Overview

The reasoning_effort parameter controls the amount of computational effort a model spends on reasoning before generating a response. Different backends implement this feature differently:

  • OpenAI: Native reasoning_effort parameter for O-series and GPT-5.x thinking models
  • Anthropic: Converted to thinking.budget_tokens for Claude models with extended thinking, or to {"type": "adaptive"} for auto effort
  • Gemini: Native reasoning_effort for thinking-capable models via OpenAI-compatible endpoint
  • Other backends: Pass-through (no transformation)

Parameter Formats

Continuum Router supports two input formats, both normalized internally:

Flat Format (Chat Completions API)

{
  "model": "o3-mini",
  "reasoning_effort": "high",
  "messages": [...]
}

Nested Format (Responses API)

{
  "model": "o3-mini",
  "reasoning": {
    "effort": "high"
  },
  "input": "..."
}

Both formats are automatically normalized to the flat reasoning_effort format before processing. If both are present, the flat format takes precedence.

Direct Token Budget Specification

For advanced use cases, you can specify the token budget directly instead of using reasoning_effort levels:

Anthropic: Direct thinking Parameter

{
  "model": "claude-sonnet-4-20250514",
  "thinking": {
    "type": "enabled",
    "budget_tokens": 16000
  },
  "messages": [...]
}

For adaptive thinking (model decides when and how much to think):

{
  "model": "claude-opus-4-6-20260205",
  "thinking": {
    "type": "adaptive"
  },
  "messages": [...]
}

Gemini: Direct thinking_budget via extra_body

{
  "model": "gemini-2.5-pro",
  "extra_body": {
    "google": {
      "thinking_config": {
        "thinking_budget": 10000,
        "include_thoughts": true
      }
    }
  },
  "messages": [...]
}

Priority

If both reasoning_effort and direct token specification are present, the direct specification takes precedence for Anthropic (thinking parameter). For Gemini, both can coexist but thinking_budget in extra_body is used for fine-grained control.


Backend-Specific Behavior

OpenAI Backend

OpenAI models support the reasoning_effort parameter natively. The router passes the value directly to the OpenAI API.

Supported Effort Levels

Effort Level Supported Models Description
low O-series, GPT-5.x thinking Minimal reasoning, faster responses
medium O-series, GPT-5.x thinking Balanced reasoning effort
high O-series, GPT-5.x thinking Deep reasoning, slower responses
xhigh GPT-5.x family only Maximum reasoning effort

Models Supporting reasoning_effort

O-series models (support low, medium, high):

  • o1, o1-mini, o1-preview
  • o3, o3-mini, o3-pro
  • o4-mini

GPT-5.x thinking models (support low, medium, high, xhigh):

  • gpt-5.4, gpt-5.4-pro, gpt-5.4-mini, gpt-5.4-nano
  • gpt-5.2, gpt-5.2-thinking, gpt-5.2-pro

xhigh Automatic Fallback

The xhigh effort level is only natively supported by GPT-5.x family thinking models. When xhigh is requested for any other model or backend, Continuum Router automatically downgrades it to high with an info-level log message. This allows clients to always request xhigh without worrying about backend compatibility.

Models NOT Supporting reasoning_effort

The following models do not support reasoning parameters (parameter is stripped):

  • GPT-4o, GPT-4o-mini, GPT-4-turbo, GPT-4
  • GPT-5.2-chat-latest, GPT-5.2-instant (non-thinking variants)
  • GPT-5.1, GPT-5
  • GPT-3.5-turbo
  • Embedding models, Image models

Anthropic Backend (Claude)

Anthropic Claude models use different mechanisms depending on the model generation:

  • Claude Fable 5 and Mythos 5 (Mythos-class; Mythos 5 is the safeguards-lifted, limited-release variant of Fable 5): Use adaptive thinking ({"type": "adaptive"}) with output_config.effort, the same surface as Opus 4.8 (legacy budget_tokens rejected with HTTP 400; temperature/top_p/top_k dropped automatically; max effort supported). One additional constraint: an explicit thinking.type == "disabled" is rejected (HTTP 400), so the router omits the thinking parameter entirely instead of forwarding it.
  • Claude 4.8+ models (Opus 4.8): Use adaptive thinking ({"type": "adaptive"}) with output_config.effort to control thinking depth. The legacy thinking.type == "enabled" + budget_tokens shape is rejected (HTTP 400). Sampling parameters temperature, top_p, and top_k are not accepted and the router drops them automatically. Effort default is high.
  • Claude 4.6+ adaptive models (Opus 4.6, Sonnet 4.6, Opus 4.7): Use adaptive thinking ({"type": "adaptive"}) with output_config.effort to control thinking depth. Claude Opus 4.7 requires this API; sending the legacy thinking.type == "enabled" + budget_tokens shape to Opus 4.7 produces HTTP 400.
  • Pre-4.6 models (Opus 4.5, Sonnet 4, etc.): Use thinking.budget_tokens for extended thinking

Conversion Table: Claude 4.6+ Models (Adaptive Thinking)

Claude 4.6+ adaptive models (claude-opus-4-6-*, claude-sonnet-4-6-*, claude-opus-4-7-*, claude-opus-4-8-*, claude-fable-5-*, claude-mythos-5-*) use output_config.effort instead of budget_tokens:

reasoning_effort thinking output_config.effort Description
none disabled (omitted) Thinking disabled
minimal {"type": "adaptive"} "low" Mapped to low (no minimal level in Anthropic API)
auto {"type": "adaptive"} (omitted) Model uses default effort (high)
low {"type": "adaptive"} "low" Low thinking depth
medium {"type": "adaptive"} "medium" Moderate thinking depth
high {"type": "adaptive"} "high" High thinking depth
xhigh (Opus 4.6, Opus 4.7, Opus 4.8, Fable 5, Mythos 5) {"type": "adaptive"} "max" Maximum thinking depth
xhigh (Sonnet 4.6) {"type": "adaptive"} "high" Downgraded (max is Opus/Mythos-class-only)

Conversion Table: Pre-4.6 Claude Models (Budget Tokens)

Pre-4.6 models use thinking.budget_tokens:

reasoning_effort Anthropic thinking Description
none disabled Thinking feature disabled
minimal {"type": "enabled", "budget_tokens": 1024} Minimum allowed budget
auto {"type": "adaptive"} Model decides when and how much to think
low {"type": "enabled", "budget_tokens": 4096} Light reasoning
medium {"type": "enabled", "budget_tokens": 10240} Moderate reasoning
high {"type": "enabled", "budget_tokens": 32768} Deep reasoning
xhigh {"type": "enabled", "budget_tokens": 32768} Falls back to high budget

Transformation Example: Claude 4.6+ (Adaptive Thinking + Effort)

Input (OpenAI format):

{
  "model": "claude-opus-4-6-20260205",
  "reasoning_effort": "high",
  "messages": [...]
}

Transformed (Anthropic format):

{
  "model": "claude-opus-4-6-20260205",
  "thinking": {
    "type": "adaptive"
  },
  "output_config": {
    "effort": "high"
  },
  "messages": [...]
}

For xhigh on Opus 4.6 or Opus 4.7, output_config.effort is set to "max":

{
  "model": "claude-opus-4-6-20260205",
  "thinking": {"type": "adaptive"},
  "output_config": {"effort": "max"},
  "messages": [...]
}

Transformation Example: Pre-4.6 Claude (Budget Tokens)

Input (OpenAI format):

{
  "model": "claude-sonnet-4-20250514",
  "reasoning_effort": "high",
  "messages": [...]
}

Transformed (Anthropic format):

{
  "model": "claude-sonnet-4-20250514",
  "thinking": {
    "type": "enabled",
    "budget_tokens": 32768
  },
  "messages": [...]
}

Adaptive Thinking (Auto)

When reasoning_effort is set to "auto", the router emits {"type": "adaptive"} without specifying output_config.effort. This lets Anthropic use its default effort level (currently high) and lets the model dynamically decide when and how much reasoning to apply.

Input (OpenAI format):

{
  "model": "claude-opus-4-6-20260205",
  "reasoning_effort": "auto",
  "messages": [...]
}

Transformed (Anthropic format):

{
  "model": "claude-opus-4-6-20260205",
  "thinking": {
    "type": "adaptive"
  },
  "messages": [...]
}

Direct output_config Pass-Through

When a request already contains explicit thinking and output_config parameters, they are passed through directly without conversion:

{
  "model": "claude-opus-4-6-20260205",
  "thinking": {"type": "adaptive"},
  "output_config": {"effort": "medium"},
  "messages": [...]
}

Adaptive Thinking Availability

Adaptive thinking with output_config.effort is available on Claude 4.6+ adaptive models (claude-opus-4-6-*, claude-sonnet-4-6-*, claude-opus-4-7-*, claude-opus-4-8-*) and the Mythos-class family (claude-fable-5-*, claude-mythos-5-*). Claude Opus 4.7/4.8 and the Mythos-class models require this API; the router normalizes explicit legacy thinking.type == "enabled" requests for these models to avoid upstream HTTP 400 errors. Pre-4.6 models use budget_tokens instead.

max Effort Level

The "max" effort level in output_config.effort is available on Opus models (Opus 4.6, Opus 4.7, Opus 4.8) and the Mythos-class family (Fable 5, Mythos 5). When xhigh is requested for Sonnet 4.6, the router automatically downgrades it to "high" with an info-level log message.

Claude Opus 4.7/4.8 and Mythos-class Sampling Parameter Deprecation

Anthropic deprecated temperature, top_p, and top_k for Claude Opus 4.7, Opus 4.8, Fable 5, and Mythos 5. The router automatically drops all three parameters before forwarding to any claude-opus-4-7-*, claude-opus-4-8-*, claude-fable-5-*, or claude-mythos-5-* model to avoid HTTP 400 errors.

Mythos-class Disabled-Thinking Rejection

Claude Fable 5 and Mythos 5 reject an explicit thinking: {"type": "disabled"} with HTTP 400. When a client sends a disabled-thinking config to a claude-fable-5-* or claude-mythos-5-* model, the router omits the thinking parameter entirely (the documented workaround) instead of forwarding it. Opus and Sonnet families accept an explicit disabled-thinking config unchanged.

Models Supporting Extended Thinking

Extended thinking is supported by:

  • Claude Fable 5 / Mythos 5 (adaptive thinking + effort; max supported; sampling params dropped; explicit disabled-thinking omitted): claude-fable-5-* (alias claude-fable-5-latest), claude-mythos-5-* (alias claude-mythos-5-latest, limited release)
  • Claude Opus 4.8 (adaptive thinking + effort; sampling params deprecated; effort default high): claude-opus-4-8-*, alias claude-opus-4-8-latest
  • Claude Opus 4.7 (adaptive thinking + effort; sampling params deprecated): claude-opus-4-7-*
  • Claude 4.6 family (adaptive thinking + effort): claude-opus-4-6-*, claude-sonnet-4-6-*
  • Claude Opus models (budget tokens): claude-opus-4-*, claude-opus-4-5-*
  • Claude Sonnet 4 models (budget tokens): claude-sonnet-4-*, claude-sonnet-4-5-*

Temperature Restriction

When extended thinking is enabled, Claude does not support custom temperature settings. The router automatically removes the temperature parameter when thinking is active. For Claude Opus 4.7, Opus 4.8, Fable 5, and Mythos 5, temperature, top_p, and top_k are always dropped regardless of whether thinking is enabled.

xhigh for Pre-4.6 Claude

When xhigh is requested for pre-4.6 Claude models, the router automatically downgrades it to high (32,768 budget_tokens).

auto on Non-Anthropic Backends

The auto effort level maps to Anthropic's adaptive thinking. When auto is requested for OpenAI or Gemini backends (which do not support adaptive thinking), the router automatically downgrades it to medium with a debug-level log message.


Gemini Backend

Gemini models support reasoning_effort natively through their OpenAI-compatible endpoint. The router validates and passes the value directly.

Supported Effort Levels

Effort Level Supported Models Description
none Flash models only Disable thinking
minimal All thinking models Minimal reasoning
low All thinking models Light reasoning
medium All thinking models Moderate reasoning
high All thinking models Deep reasoning

Flash Models (support none)

  • gemini-2.0-flash
  • gemini-2.5-flash
  • gemini-3-flash
  • gemini-3.1-flash
  • gemini-3.5-flash

Pro Models (do NOT support none)

  • gemini-2.5-pro
  • gemini-3-pro

Pro Model Restriction

Requesting none effort level for Pro models results in a validation error. Only Flash models support disabling thinking.

xhigh Automatic Fallback

When xhigh is requested for Gemini models, the router automatically downgrades it to high since Gemini doesn't support xhigh.

Additional Features

For Gemini thinking models, the router automatically:

  1. Sets include_thoughts: true to expose reasoning content
  2. Sets default max_completion_tokens: 16384 if not specified
  3. Validates effort level against model capabilities

Generic/HTTP Backend

Generic HTTP backends (used for Ollama, vLLM, LocalAI, LM Studio, etc.) pass requests through without transformation.

Behavior Description
Pass-through reasoning_effort is forwarded as-is to the backend
No validation The router does not validate effort levels
No conversion No token budget conversion is performed

Backend Responsibility

For generic backends, the target LLM server is responsible for handling (or ignoring) the reasoning_effort parameter. If the server doesn't support it, it may return an error or ignore the parameter.


llama.cpp Backend

The llama.cpp backend uses pass-through behavior:

Behavior Description
Pass-through Parameters forwarded unchanged
Server-dependent Support depends on llama-server configuration

Thinking Model Support in llama.cpp

Recent versions of llama-server support thinking models (e.g., DeepSeek-R1) with <think> tag handling. However, there is no standardized API parameter for reasoning_effort or budget_tokens. The thinking behavior is typically controlled by:

  • Model's built-in chat template with thinking tags
  • Server-side configuration (e.g., --thinking-budget if available)
  • Sampling parameters like temperature and top-p

vLLM Backend

vLLM provides an OpenAI-compatible API but reasoning effort support depends on the model:

Behavior Description
Pass-through Parameters forwarded via generic backend
Model-dependent Support varies by model type

Thinking Model Support in vLLM

vLLM can run thinking-capable models (e.g., DeepSeek-R1, QwQ) but the reasoning_effort parameter handling depends on:

  • Whether the model supports structured thinking
  • vLLM server version and configuration
  • Model's chat template configuration

For DeepSeek-R1 and similar models, thinking is often implicit in the model's behavior rather than controlled by an explicit budget parameter.

Response Field Normalization

Newer vLLM versions emit reasoning text as choices[].delta.reasoning (streaming) or choices[].message.reasoning (non-streaming) rather than the canonical reasoning_content. The router normalizes this field on all Chat Completions relay paths, so downstream clients always see reasoning_content regardless of vLLM version. Backends that already use reasoning_content are not modified.


Summary Table

Backend Effort Levels Conversion Notes
OpenAI auto*, low, medium, high, xhigh None (native) xhigh only for GPT-5.2; *auto downgraded to medium
Anthropic (Fable 5 / Mythos 5) none, minimal, auto, low, medium, high, xhigh adaptive + output_config.effort xhighmax; drops temperature/top_p/top_k always; explicit disabled-thinking omitted
Anthropic (Opus 4.8) none, minimal, auto, low, medium, high, xhigh adaptive + output_config.effort xhighmax; drops temperature/top_p/top_k always; effort default high
Anthropic (Opus 4.7) none, minimal, auto, low, medium, high, xhigh adaptive + output_config.effort xhighmax; drops temperature/top_p/top_k always
Anthropic (4.6) none, minimal, auto, low, medium, high, xhigh* adaptive + output_config.effort *xhighmax for Opus 4.6, high for Sonnet 4.6; removes temperature
Anthropic (pre-4.6) none, minimal, auto, low, medium, high budget_tokens or adaptive auto maps to adaptive thinking; removes temperature
Gemini none, auto*, minimal, low, medium, high None (native) none only for Flash models; *auto downgraded to medium
vLLM Model-dependent Pass-through DeepSeek-R1, QwQ use implicit thinking
llama.cpp Model-dependent Pass-through Uses <think> tag in chat template
Generic Any Pass-through Backend handles validation

Response Format

When reasoning/thinking is enabled, the response includes the model's reasoning process:

OpenAI Format (with reasoning_content)

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "The answer is 42.",
      "reasoning_content": "Let me think through this step by step..."
    },
    "finish_reason": "stop"
  }]
}

Claude Extended Thinking (transformed to OpenAI format)

The router transforms Claude's thinking blocks to the reasoning_content field:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "The answer is 42.",
      "reasoning_content": "I need to consider multiple factors..."
    },
    "finish_reason": "stop"
  }]
}

Best Practices

  1. Use appropriate effort levels: Higher effort = better reasoning but slower and more expensive
  2. Check model support: Not all models support reasoning parameters
  3. Handle xhigh carefully: Only GPT-5.2 family supports xhigh; use high for other models
  4. Consider cost: Extended thinking consumes additional tokens (especially with Anthropic's budget_tokens)
  5. Test with your backend: Generic backends may have varying support levels