Agent Reasoning Mode Comparison: Extended Thinking vs Streaming Output, How to Choose?

Cost, Speed, Accuracy Triangle Trade-off

Agent Reasoning Mode Comparison: Extended Thinking vs Streaming Output

Core Difference in One Sentence

ModeThinking ProcessOutput SpeedAccuracyCostUse Case

StreamingDirect output⚡ 1-5 sec70-80%$0.15/MChat, content generation, quick response Extended ThinkingDeep reasoning then output🐢 10-60 sec92-98%$1.50/MProgramming, math, reasoning, decision-making

Why Two Modes?

Before 2023, all LLMs used 'single forward pass': input prompt → model generates all tokens at once until EOS (end of sequence). This resulted in:

✅ Pros: Fast, cheap, streamable

❌ Cons: Often wrong on complex problems, no self-correction

In 2024, two breakthroughs broke this balance:

OpenAI o1's 'Chain-of-Thought': Before generating the final answer, the model internally 'thinks' for 5-20 seconds (invisible to the user), using hidden tokens to represent the thinking process. This:

- Gives the model ample time to verify logic - SWE-bench jumped from 68% to 85%+ - But cost doubled

Claude's 'Extended Thinking': Similar idea, but directly returns the thinking process (transparent), allowing users to see the agent's chain of thought, making it more trustworthy.

Practical Comparison: Fixing a Complex Bug

Scenario

Production is down; a user's order status is incorrect. The agent needs to analyze database logs, trace the transaction flow, and provide root cause and fix.

Usage 1: Streaming (Direct Output)

typescript
const response = await anthropic.messages.create({
  model: 'claude-opus-4-1',
  max_tokens: 1024,
  messages: [
    {
      role: 'user',
      content: `Here are the database logs... [1000 lines of logs]
Please analyze the root cause and provide a fix`
    }
  ]
});

Output (5 sec):


Based on the logs, order #12345 failed to update at 2024-05-15 14:32:15.
Possible cause: concurrent race condition.
Suggestion: add optimistic locking.

Problem: Too shallow, didn't trace the full transaction chain. After the fix, it turned out to be a deadlock. Cost: $0.05, but fix failed, total cost = $0.05 + 2 hours manual debug.

Usage 2: Extended Thinking (Deep Reasoning)

typescript
const response = await anthropic.messages.create({
  model: 'claude-opus-4-1',
  max_tokens: 10000,
  thinking: {
    type: 'enabled',
    budget_tokens: 8000  // Use up to 8000 tokens for thinking
  },
  messages: [
    {
      role: 'user',
      content: `Here are the database logs... [1000 lines of logs]
Please thoroughly analyze the root cause and provide a fix`
    }
  ]
});

Thinking Process (invisible to user):


Check timestamps: order created 14:30:00, first update 14:32:15
Look at transaction IDs in logs: TX_5672_A and TX_5672_B started in the same millisecond
Trace TX_5672_A: acquires read lock on orders table → updates customers table → waits for write lock on orders (BLOCKED)
Trace TX_5672_B: acquires write lock on orders table → waits for read lock on customers table (BLOCKED)
Conclusion: A and B form a deadlock cycle!
Root cause: transactions hold locks too long; A acquires orders read lock first, B acquires orders write lock simultaneously
Fix: modify order update logic to complete all operations within a single transaction

Final Output (45 sec):


Deadlock root cause identified:
Transaction TX_5672_A holds read lock on orders table, waiting for write lock on customers table
Transaction TX_5672_B holds write lock on orders table, waiting for read lock on customers table
Fix steps:
Refactor order update logic into a single transaction (see code block)
Add lock timeout configuration (max_lock_wait = 5s)
Recover locked orders: UPDATE orders SET status='PENDING' WHERE locked_at < NOW()-1min
Monitoring: add slow query alerts, alert when lock wait > 100ms

Cost Comparison: ModeCost per CallSuccess RateTotal Cost (success + failure reinvestigation)

Streaming$0.0540%$0.05 + $300 (manual) = $300.05 Extended Thinking$0.5095%$0.50 + $20 (verification) = $20.50

Conclusion: Although Extended Thinking costs 10x per call, total cost is 15x lower!

Selection Matrix

Task TypeReasoning DifficultyWhich Mode?Reason

Chat Q&A ("What is MCP?")LowStreamingDirect query return is enough Content generation (copy, articles)LowStreamingCreative output doesn't need deep verification Simple programming (fix one line bug)LowStreamingObvious fix, no deep analysis needed Complex algorithm implementation🔴 HighExtended ThinkingNeeds logic verification, error cost high System design (architecture decisions)🔴 HighExtended ThinkingMulti-layer trade-offs, needs deep thought Math problems🔴 HighExtended ThinkingIntermediate steps need step-by-step verification Production incident troubleshooting🔴 HighExtended ThinkingMisdiagnosis cost extremely high Code review + optimizationMediumHybrid (see below)Quick scan with Streaming, deep review with Extended

Hybrid Strategy: Two-Stage Agent

Best practice is two-stage:

Stage 1: Quick Filter (Streaming)

typescript
// First use Streaming to quickly assess problem level
const { problem_level } = await chatWithStreaming(`
Based on this error log, determine the problem level (critical/high/medium/low) and give a preliminary diagnosis
`);

Stage 2: Deep Diagnosis (Extended Thinking, only triggered for critical/high)

typescript
if (problem_level === 'critical') {
  const { analysis } = await chatWithExtendedThinking(`
  Here is my preliminary diagnosis... [logs]
Please conduct a thorough analysis and verify if my diagnosis is correct
  `);
}

Effect:

Saves 70% cost (most problems are judged in stage 1)

Ensures >95% diagnostic accuracy for critical problems

API Call Examples

Streaming (Current Standard API)

typescript
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();
async function quickResponse(prompt: string) {
  const stream = client.messages.stream({
    model: 'claude-opus-4-1',
    max_tokens: 1024,
    messages: [{ role: 'user', content: prompt }]
  });
  for await (const chunk of stream) {
    if (chunk.type === 'content_block_delta' && chunk.delta.type === 'text_delta') {
      process.stdout.write(chunk.delta.text);  // real-time output
    }
  }
}quickResponse('Explain what MCP is');

Extended Thinking (New API)

typescript
async function deepAnalysis(prompt: string) {
  const response = await client.messages.create({
    model: 'claude-opus-4-1',
    max_tokens: 10000,
    thinking: {
      type: 'enabled',
      budget_tokens: 5000  // Up to 5000 tokens for thinking
    },
    messages: [{ role: 'user', content: prompt }]
  });
  // Extract thinking process and final response
  let thinking = '';
  let response_text = '';
  for (const block of response.content) {
    if (block.type === 'thinking') {
      thinking = block.thinking;
    } else if (block.type === 'text') {
      response_text = block.text;
    }
  }
  console.log('Thinking process:\n', thinking);
  console.log('\nFinal response:\n', response_text);
}deepAnalysis('This is a complex system failure, please analyze...');

Cost Calculation

Streaming (Standard Pricing)

Input: $0.003/1M tokens

Output: $0.015/1M tokens

Example: 1000 token input + 500 token output = $0.0075

Extended Thinking (New Thinking Tokens)

Input: $0.003/1M tokens

Output: $0.015/1M tokens

Thinking tokens: $0.003/1M tokens (same as input)

Example: 1000 token input + 5000 token thinking + 500 token output = (1000 × $0.003 + 5000 × $0.003 + 500 × $0.015) / 1M = $0.021/1M = $0.000021 per call (single call cost still very low)

Summary: When to Upgrade to Extended Thinking?

MetricWhen to Upgrade?

Error tolerance< 10% acceptable → upgrade Fix cost> $100 (manual time) → upgrade Real-time requirementCan wait < 1 minute → upgrade User experienceNo need for real-time streaming → upgrade

Simple rule: If a wrong answer would cause financial loss, data loss, or user complaints, use Extended Thinking. Otherwise, use Streaming.

Also available in 中文.