Skip to content

Reasoning Effort Parameter

This document describes how Continuum Router handles the reasoning_effort parameter across different LLM backends, including the token budget conversions and supported effort levels.

Overview

The reasoning_effort parameter controls the amount of computational effort a model spends on reasoning before generating a response. Different backends implement this feature differently:

  • OpenAI: Native reasoning_effort parameter for O-series and GPT-5.x thinking models
  • Anthropic: Converted to thinking.budget_tokens for Claude models with extended thinking, or to {"type": "adaptive"} for auto effort
  • Gemini: Native reasoning_effort for thinking-capable models via OpenAI-compatible endpoint
  • Other backends: Pass-through (no transformation)

Parameter Formats

Continuum Router supports two input formats, both normalized internally:

Flat Format (Chat Completions API)

{
  "model": "o3-mini",
  "reasoning_effort": "high",
  "messages": [...]
}

Nested Format (Responses API)

{
  "model": "o3-mini",
  "reasoning": {
    "effort": "high"
  },
  "input": "..."
}

Both formats are automatically normalized to the flat reasoning_effort format before processing. If both are present, the flat format takes precedence.

Direct Token Budget Specification

For advanced use cases, you can specify the token budget directly instead of using reasoning_effort levels:

Anthropic: Direct thinking Parameter

{
  "model": "claude-sonnet-4-20250514",
  "thinking": {
    "type": "enabled",
    "budget_tokens": 16000
  },
  "messages": [...]
}

For adaptive thinking (model decides when and how much to think):

{
  "model": "claude-opus-4-6-20260205",
  "thinking": {
    "type": "adaptive"
  },
  "messages": [...]
}

Gemini: Direct thinking_budget via extra_body

{
  "model": "gemini-2.5-pro",
  "extra_body": {
    "google": {
      "thinking_config": {
        "thinking_budget": 10000,
        "include_thoughts": true
      }
    }
  },
  "messages": [...]
}

Priority

If both reasoning_effort and direct token specification are present, the direct specification takes precedence for Anthropic (thinking parameter). For Gemini, both can coexist but thinking_budget in extra_body is used for fine-grained control.


Backend-Specific Behavior

OpenAI Backend

OpenAI models support the reasoning_effort parameter natively. The router passes the value directly to the OpenAI API.

Supported Effort Levels

Effort Level Supported Models Description
low O-series, GPT-5.x thinking Minimal reasoning, faster responses
medium O-series, GPT-5.x thinking Balanced reasoning effort
high O-series, GPT-5.x thinking Deep reasoning, slower responses
xhigh GPT-5.x family only Maximum reasoning effort

Models Supporting reasoning_effort

O-series models (support low, medium, high):

  • o1, o1-mini, o1-preview
  • o3, o3-mini, o3-pro
  • o4-mini

GPT-5.x thinking models (support low, medium, high, xhigh):

  • gpt-5.4, gpt-5.4-pro, gpt-5.4-mini, gpt-5.4-nano
  • gpt-5.2, gpt-5.2-thinking, gpt-5.2-pro

xhigh Automatic Fallback

The xhigh effort level is only natively supported by GPT-5.x family thinking models. When xhigh is requested for any other model or backend, Continuum Router automatically downgrades it to high with an info-level log message. This allows clients to always request xhigh without worrying about backend compatibility.

Models NOT Supporting reasoning_effort

The following models do not support reasoning parameters (parameter is stripped):

  • GPT-4o, GPT-4o-mini, GPT-4-turbo, GPT-4
  • GPT-5.2-chat-latest, GPT-5.2-instant (non-thinking variants)
  • GPT-5.1, GPT-5
  • GPT-3.5-turbo
  • Embedding models, Image models

Anthropic Backend (Claude)

Anthropic Claude models use different mechanisms depending on the model generation:

  • Claude 4.6 models (Opus 4.6, Sonnet 4.6): Use adaptive thinking ({"type": "adaptive"}) with output_config.effort to control thinking depth
  • Pre-4.6 models (Opus 4.5, Sonnet 4, etc.): Use thinking.budget_tokens for extended thinking

Conversion Table: Claude 4.6 Models (Adaptive Thinking)

Claude 4.6 models (claude-opus-4-6-*, claude-sonnet-4-6-*) use output_config.effort instead of budget_tokens:

reasoning_effort thinking output_config.effort Description
none disabled (omitted) Thinking disabled
minimal {"type": "adaptive"} "low" Mapped to low (no minimal level in Anthropic API)
auto {"type": "adaptive"} (omitted) Model uses default effort (high)
low {"type": "adaptive"} "low" Low thinking depth
medium {"type": "adaptive"} "medium" Moderate thinking depth
high {"type": "adaptive"} "high" High thinking depth
xhigh (Opus 4.6) {"type": "adaptive"} "max" Maximum thinking depth
xhigh (Sonnet 4.6) {"type": "adaptive"} "high" Downgraded (max is Opus 4.6 only)

Conversion Table: Pre-4.6 Claude Models (Budget Tokens)

Pre-4.6 models use thinking.budget_tokens:

reasoning_effort Anthropic thinking Description
none disabled Thinking feature disabled
minimal {"type": "enabled", "budget_tokens": 1024} Minimum allowed budget
auto {"type": "adaptive"} Model decides when and how much to think
low {"type": "enabled", "budget_tokens": 4096} Light reasoning
medium {"type": "enabled", "budget_tokens": 10240} Moderate reasoning
high {"type": "enabled", "budget_tokens": 32768} Deep reasoning
xhigh {"type": "enabled", "budget_tokens": 32768} Falls back to high budget

Transformation Example: Claude 4.6 (Adaptive Thinking + Effort)

Input (OpenAI format):

{
  "model": "claude-opus-4-6-20260205",
  "reasoning_effort": "high",
  "messages": [...]
}

Transformed (Anthropic format):

{
  "model": "claude-opus-4-6-20260205",
  "thinking": {
    "type": "adaptive"
  },
  "output_config": {
    "effort": "high"
  },
  "messages": [...]
}

For xhigh on Opus 4.6, output_config.effort is set to "max":

{
  "model": "claude-opus-4-6-20260205",
  "thinking": {"type": "adaptive"},
  "output_config": {"effort": "max"},
  "messages": [...]
}

Transformation Example: Pre-4.6 Claude (Budget Tokens)

Input (OpenAI format):

{
  "model": "claude-sonnet-4-20250514",
  "reasoning_effort": "high",
  "messages": [...]
}

Transformed (Anthropic format):

{
  "model": "claude-sonnet-4-20250514",
  "thinking": {
    "type": "enabled",
    "budget_tokens": 32768
  },
  "messages": [...]
}

Adaptive Thinking (Auto)

When reasoning_effort is set to "auto", the router emits {"type": "adaptive"} without specifying output_config.effort. This lets Anthropic use its default effort level (currently high) and lets the model dynamically decide when and how much reasoning to apply.

Input (OpenAI format):

{
  "model": "claude-opus-4-6-20260205",
  "reasoning_effort": "auto",
  "messages": [...]
}

Transformed (Anthropic format):

{
  "model": "claude-opus-4-6-20260205",
  "thinking": {
    "type": "adaptive"
  },
  "messages": [...]
}

Direct output_config Pass-Through

When a request already contains explicit thinking and output_config parameters, they are passed through directly without conversion:

{
  "model": "claude-opus-4-6-20260205",
  "thinking": {"type": "adaptive"},
  "output_config": {"effort": "medium"},
  "messages": [...]
}

Adaptive Thinking Availability

Adaptive thinking with output_config.effort is available on Claude 4.6 models (claude-opus-4-6-*, claude-sonnet-4-6-*). Pre-4.6 models use budget_tokens instead.

max Effort Level

The "max" effort level in output_config.effort is only available on Claude Opus 4.6. When xhigh is requested for Claude Sonnet 4.6, the router automatically downgrades it to "high" with an info-level log message.

Models Supporting Extended Thinking

Extended thinking is supported by:

  • Claude 4.6 family (adaptive thinking + effort): claude-opus-4-6-*, claude-sonnet-4-6-*
  • Claude Opus models (budget tokens): claude-opus-4-*, claude-opus-4-5-*
  • Claude Sonnet 4 models (budget tokens): claude-sonnet-4-*, claude-sonnet-4-5-*

Temperature Restriction

When extended thinking is enabled, Claude does not support custom temperature settings. The router automatically removes the temperature parameter when thinking is active.

xhigh for Pre-4.6 Claude

When xhigh is requested for pre-4.6 Claude models, the router automatically downgrades it to high (32,768 budget_tokens).

auto on Non-Anthropic Backends

The auto effort level maps to Anthropic's adaptive thinking. When auto is requested for OpenAI or Gemini backends (which do not support adaptive thinking), the router automatically downgrades it to medium with a debug-level log message.


Gemini Backend

Gemini models support reasoning_effort natively through their OpenAI-compatible endpoint. The router validates and passes the value directly.

Supported Effort Levels

Effort Level Supported Models Description
none Flash models only Disable thinking
minimal All thinking models Minimal reasoning
low All thinking models Light reasoning
medium All thinking models Moderate reasoning
high All thinking models Deep reasoning

Flash Models (support none)

  • gemini-2.0-flash
  • gemini-2.5-flash
  • gemini-3-flash

Pro Models (do NOT support none)

  • gemini-2.5-pro
  • gemini-3-pro

Pro Model Restriction

Requesting none effort level for Pro models results in a validation error. Only Flash models support disabling thinking.

xhigh Automatic Fallback

When xhigh is requested for Gemini models, the router automatically downgrades it to high since Gemini doesn't support xhigh.

Additional Features

For Gemini thinking models, the router automatically:

  1. Sets include_thoughts: true to expose reasoning content
  2. Sets default max_completion_tokens: 16384 if not specified
  3. Validates effort level against model capabilities

Generic/HTTP Backend

Generic HTTP backends (used for Ollama, vLLM, LocalAI, LM Studio, etc.) pass requests through without transformation.

Behavior Description
Pass-through reasoning_effort is forwarded as-is to the backend
No validation The router does not validate effort levels
No conversion No token budget conversion is performed

Backend Responsibility

For generic backends, the target LLM server is responsible for handling (or ignoring) the reasoning_effort parameter. If the server doesn't support it, it may return an error or ignore the parameter.


llama.cpp Backend

The llama.cpp backend uses pass-through behavior:

Behavior Description
Pass-through Parameters forwarded unchanged
Server-dependent Support depends on llama-server configuration

Thinking Model Support in llama.cpp

Recent versions of llama-server support thinking models (e.g., DeepSeek-R1) with <think> tag handling. However, there is no standardized API parameter for reasoning_effort or budget_tokens. The thinking behavior is typically controlled by:

  • Model's built-in chat template with thinking tags
  • Server-side configuration (e.g., --thinking-budget if available)
  • Sampling parameters like temperature and top-p

vLLM Backend

vLLM provides an OpenAI-compatible API but reasoning effort support depends on the model:

Behavior Description
Pass-through Parameters forwarded via generic backend
Model-dependent Support varies by model type

Thinking Model Support in vLLM

vLLM can run thinking-capable models (e.g., DeepSeek-R1, QwQ) but the reasoning_effort parameter handling depends on:

  • Whether the model supports structured thinking
  • vLLM server version and configuration
  • Model's chat template configuration

For DeepSeek-R1 and similar models, thinking is often implicit in the model's behavior rather than controlled by an explicit budget parameter.


Summary Table

Backend Effort Levels Conversion Notes
OpenAI auto*, low, medium, high, xhigh None (native) xhigh only for GPT-5.2; *auto downgraded to medium
Anthropic (4.6) none, minimal, auto, low, medium, high, xhigh* adaptive + output_config.effort *xhighmax for Opus 4.6, high for Sonnet 4.6; removes temperature
Anthropic (pre-4.6) none, minimal, auto, low, medium, high budget_tokens or adaptive auto maps to adaptive thinking; removes temperature
Gemini none, auto*, minimal, low, medium, high None (native) none only for Flash models; *auto downgraded to medium
vLLM Model-dependent Pass-through DeepSeek-R1, QwQ use implicit thinking
llama.cpp Model-dependent Pass-through Uses <think> tag in chat template
Generic Any Pass-through Backend handles validation

Response Format

When reasoning/thinking is enabled, the response includes the model's reasoning process:

OpenAI Format (with reasoning_content)

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "The answer is 42.",
      "reasoning_content": "Let me think through this step by step..."
    },
    "finish_reason": "stop"
  }]
}

Claude Extended Thinking (transformed to OpenAI format)

The router transforms Claude's thinking blocks to the reasoning_content field:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "The answer is 42.",
      "reasoning_content": "I need to consider multiple factors..."
    },
    "finish_reason": "stop"
  }]
}

Best Practices

  1. Use appropriate effort levels: Higher effort = better reasoning but slower and more expensive
  2. Check model support: Not all models support reasoning parameters
  3. Handle xhigh carefully: Only GPT-5.2 family supports xhigh; use high for other models
  4. Consider cost: Extended thinking consumes additional tokens (especially with Anthropic's budget_tokens)
  5. Test with your backend: Generic backends may have varying support levels