Skip to content

Reasoning Effort Parameter

This document describes how Continuum Router handles the reasoning_effort parameter across different LLM backends, including the token budget conversions and supported effort levels.

Overview

The reasoning_effort parameter controls the amount of computational effort a model spends on reasoning before generating a response. Different backends implement this feature differently:

  • OpenAI: Native reasoning_effort parameter for O-series and GPT-5.2 thinking models
  • Anthropic: Converted to thinking.budget_tokens for Claude models with extended thinking
  • Gemini: Native reasoning_effort for thinking-capable models via OpenAI-compatible endpoint
  • Other backends: Pass-through (no transformation)

Parameter Formats

Continuum Router supports two input formats, both normalized internally:

Flat Format (Chat Completions API)

{
  "model": "o3-mini",
  "reasoning_effort": "high",
  "messages": [...]
}

Nested Format (Responses API)

{
  "model": "o3-mini",
  "reasoning": {
    "effort": "high"
  },
  "input": "..."
}

Both formats are automatically normalized to the flat reasoning_effort format before processing. If both are present, the flat format takes precedence.

Direct Token Budget Specification

For advanced use cases, you can specify the token budget directly instead of using reasoning_effort levels:

Anthropic: Direct thinking Parameter

{
  "model": "claude-sonnet-4-20250514",
  "thinking": {
    "type": "enabled",
    "budget_tokens": 16000
  },
  "messages": [...]
}

Gemini: Direct thinking_budget via extra_body

{
  "model": "gemini-2.5-pro",
  "extra_body": {
    "google": {
      "thinking_config": {
        "thinking_budget": 10000,
        "include_thoughts": true
      }
    }
  },
  "messages": [...]
}

Priority

If both reasoning_effort and direct token specification are present, the direct specification takes precedence for Anthropic (thinking parameter). For Gemini, both can coexist but thinking_budget in extra_body is used for fine-grained control.


Backend-Specific Behavior

OpenAI Backend

OpenAI models support the reasoning_effort parameter natively. The router passes the value directly to the OpenAI API.

Supported Effort Levels

Effort Level Supported Models Description
low O-series, GPT-5.2 thinking Minimal reasoning, faster responses
medium O-series, GPT-5.2 thinking Balanced reasoning effort
high O-series, GPT-5.2 thinking Deep reasoning, slower responses
xhigh GPT-5.2 family only Maximum reasoning effort

Models Supporting reasoning_effort

O-series models (support low, medium, high):

  • o1, o1-mini, o1-preview
  • o3, o3-mini, o3-pro
  • o4-mini

GPT-5.2 thinking models (support low, medium, high, xhigh):

  • gpt-5.2, gpt-5.2-thinking, gpt-5.2-latest
  • gpt-5.2-pro

xhigh Automatic Fallback

The xhigh effort level is only natively supported by GPT-5.2 family thinking models. When xhigh is requested for any other model or backend, Continuum Router automatically downgrades it to high with an info-level log message. This allows clients to always request xhigh without worrying about backend compatibility.

Models NOT Supporting reasoning_effort

The following models do not support reasoning parameters (parameter is stripped):

  • GPT-4o, GPT-4o-mini, GPT-4-turbo, GPT-4
  • GPT-5.2-chat-latest, GPT-5.2-instant (non-thinking variants)
  • GPT-5.1, GPT-5
  • GPT-3.5-turbo
  • Embedding models, Image models

Anthropic Backend (Claude)

Anthropic Claude models use a different mechanism called "extended thinking" with a thinking.budget_tokens parameter. The router automatically converts reasoning_effort to the appropriate token budget.

Conversion Table

reasoning_effort budget_tokens Description
none disabled Thinking feature disabled
minimal 1,024 Minimum allowed budget
low 4,096 Light reasoning
medium 10,240 Moderate reasoning
high 32,768 Deep reasoning

Transformation Example

Input (OpenAI format):

{
  "model": "claude-sonnet-4-20250514",
  "reasoning_effort": "high",
  "messages": [...]
}

Transformed (Anthropic format):

{
  "model": "claude-sonnet-4-20250514",
  "thinking": {
    "type": "enabled",
    "budget_tokens": 32768
  },
  "messages": [...]
}

Models Supporting Extended Thinking

Extended thinking is supported by:

  • Claude Opus models: claude-opus-4-*, claude-opus-4-5-*
  • Claude Sonnet 4 models: claude-sonnet-4-*, claude-sonnet-4-5-*

Temperature Restriction

When extended thinking is enabled, Claude does not support custom temperature settings. The router automatically removes the temperature parameter when thinking is active.

xhigh Automatic Fallback

When xhigh is requested for Claude models, the router automatically downgrades it to high (32,768 budget_tokens) since Claude doesn't support xhigh.


Gemini Backend

Gemini models support reasoning_effort natively through their OpenAI-compatible endpoint. The router validates and passes the value directly.

Supported Effort Levels

Effort Level Supported Models Description
none Flash models only Disable thinking
minimal All thinking models Minimal reasoning
low All thinking models Light reasoning
medium All thinking models Moderate reasoning
high All thinking models Deep reasoning

Flash Models (support none)

  • gemini-2.0-flash
  • gemini-2.5-flash
  • gemini-3-flash

Pro Models (do NOT support none)

  • gemini-2.5-pro
  • gemini-3-pro

Pro Model Restriction

Requesting none effort level for Pro models results in a validation error. Only Flash models support disabling thinking.

xhigh Automatic Fallback

When xhigh is requested for Gemini models, the router automatically downgrades it to high since Gemini doesn't support xhigh.

Additional Features

For Gemini thinking models, the router automatically:

  1. Sets include_thoughts: true to expose reasoning content
  2. Sets default max_completion_tokens: 16384 if not specified
  3. Validates effort level against model capabilities

Generic/HTTP Backend

Generic HTTP backends (used for Ollama, vLLM, LocalAI, LM Studio, etc.) pass requests through without transformation.

Behavior Description
Pass-through reasoning_effort is forwarded as-is to the backend
No validation The router does not validate effort levels
No conversion No token budget conversion is performed

Backend Responsibility

For generic backends, the target LLM server is responsible for handling (or ignoring) the reasoning_effort parameter. If the server doesn't support it, it may return an error or ignore the parameter.


llama.cpp Backend

The llama.cpp backend uses pass-through behavior:

Behavior Description
Pass-through Parameters forwarded unchanged
Server-dependent Support depends on llama-server configuration

Thinking Model Support in llama.cpp

Recent versions of llama-server support thinking models (e.g., DeepSeek-R1) with <think> tag handling. However, there is no standardized API parameter for reasoning_effort or budget_tokens. The thinking behavior is typically controlled by:

  • Model's built-in chat template with thinking tags
  • Server-side configuration (e.g., --thinking-budget if available)
  • Sampling parameters like temperature and top-p

vLLM Backend

vLLM provides an OpenAI-compatible API but reasoning effort support depends on the model:

Behavior Description
Pass-through Parameters forwarded via generic backend
Model-dependent Support varies by model type

Thinking Model Support in vLLM

vLLM can run thinking-capable models (e.g., DeepSeek-R1, QwQ) but the reasoning_effort parameter handling depends on:

  • Whether the model supports structured thinking
  • vLLM server version and configuration
  • Model's chat template configuration

For DeepSeek-R1 and similar models, thinking is often implicit in the model's behavior rather than controlled by an explicit budget parameter.


Summary Table

Backend Effort Levels Conversion Notes
OpenAI low, medium, high, xhigh* None (native) *xhigh only for GPT-5.2
Anthropic none, minimal, low, medium, high budget_tokens Removes temperature when enabled
Gemini none*, minimal, low, medium, high None (native) *none only for Flash models
vLLM Model-dependent Pass-through DeepSeek-R1, QwQ use implicit thinking
llama.cpp Model-dependent Pass-through Uses <think> tag in chat template
Generic Any Pass-through Backend handles validation

Response Format

When reasoning/thinking is enabled, the response includes the model's reasoning process:

OpenAI Format (with reasoning_content)

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "The answer is 42.",
      "reasoning_content": "Let me think through this step by step..."
    },
    "finish_reason": "stop"
  }]
}

Claude Extended Thinking (transformed to OpenAI format)

The router transforms Claude's thinking blocks to the reasoning_content field:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "The answer is 42.",
      "reasoning_content": "I need to consider multiple factors..."
    },
    "finish_reason": "stop"
  }]
}

Best Practices

  1. Use appropriate effort levels: Higher effort = better reasoning but slower and more expensive
  2. Check model support: Not all models support reasoning parameters
  3. Handle xhigh carefully: Only GPT-5.2 family supports xhigh; use high for other models
  4. Consider cost: Extended thinking consumes additional tokens (especially with Anthropic's budget_tokens)
  5. Test with your backend: Generic backends may have varying support levels