← Back to tutorials

Agent Reasoning Mode Comparison: Extended Thinking vs Streaming Output, How to Choose?

Cost, Speed, Accuracy Triangle Trade-off

Agent Reasoning Mode Comparison: Extended Thinking vs Streaming Output

Core Difference in One Sentence

ModeThinking ProcessOutput SpeedAccuracyCostUse Case

StreamingDirect output⚑ 1-5 sec70-80%$0.15/MChat, content generation, quick response Extended ThinkingDeep reasoning then output🐒 10-60 sec92-98%$1.50/MProgramming, math, reasoning, decision-making

Why Two Modes?

Before 2023, all LLMs used 'single forward pass': input prompt β†’ model generates all tokens at once until EOS (end of sequence). This resulted in:

  • βœ… Pros: Fast, cheap, streamable
  • ❌ Cons: Often wrong on complex problems, no self-correction
  • In 2024, two breakthroughs broke this balance:

  • OpenAI o1's 'Chain-of-Thought': Before generating the final answer, the model internally 'thinks' for 5-20 seconds (invisible to the user), using hidden tokens to represent the thinking process. This:
  • - Gives the model ample time to verify logic - SWE-bench jumped from 68% to 85%+ - But cost doubled

  • Claude's 'Extended Thinking': Similar idea, but directly returns the thinking process (transparent), allowing users to see the agent's chain of thought, making it more trustworthy.
  • Practical Comparison: Fixing a Complex Bug

    Scenario

    Production is down; a user's order status is incorrect. The agent needs to analyze database logs, trace the transaction flow, and provide root cause and fix.

    Usage 1: Streaming (Direct Output)

    typescript
    const response = await anthropic.messages.create({
      model: 'claude-opus-4-1',
      max_tokens: 1024,
      messages: [
        {
          role: 'user',
          content: `Here are the database logs... [1000 lines of logs]
    Please analyze the root cause and provide a fix`
        }
      ]
    });
    

    Output (5 sec):

    
    Based on the logs, order #12345 failed to update at 2024-05-15 14:32:15.
    Possible cause: concurrent race condition.
    Suggestion: add optimistic locking.
    

    Problem: Too shallow, didn't trace the full transaction chain. After the fix, it turned out to be a deadlock. Cost: $0.05, but fix failed, total cost = $0.05 + 2 hours manual debug.

    Usage 2: Extended Thinking (Deep Reasoning)

    typescript
    const response = await anthropic.messages.create({
      model: 'claude-opus-4-1',
      max_tokens: 10000,
      thinking: {
        type: 'enabled',
        budget_tokens: 8000  // Use up to 8000 tokens for thinking
      },
      messages: [
        {
          role: 'user',
          content: `Here are the database logs... [1000 lines of logs]
    Please thoroughly analyze the root cause and provide a fix`
        }
      ]
    });
    

    Thinking Process (invisible to user):

    
    
  • Check timestamps: order created 14:30:00, first update 14:32:15
  • Look at transaction IDs in logs: TX_5672_A and TX_5672_B started in the same millisecond
  • Trace TX_5672_A: acquires read lock on orders table β†’ updates customers table β†’ waits for write lock on orders (BLOCKED)
  • Trace TX_5672_B: acquires write lock on orders table β†’ waits for read lock on customers table (BLOCKED)
  • Conclusion: A and B form a deadlock cycle!
  • Root cause: transactions hold locks too long; A acquires orders read lock first, B acquires orders write lock simultaneously
  • Fix: modify order update logic to complete all operations within a single transaction
  • Final Output (45 sec):

    
    Deadlock root cause identified:
    
  • Transaction TX_5672_A holds read lock on orders table, waiting for write lock on customers table
  • Transaction TX_5672_B holds write lock on orders table, waiting for read lock on customers table
  • Fix steps:

  • Refactor order update logic into a single transaction (see code block)
  • Add lock timeout configuration (max_lock_wait = 5s)
  • Recover locked orders: UPDATE orders SET status='PENDING' WHERE locked_at < NOW()-1min
  • Monitoring: add slow query alerts, alert when lock wait > 100ms
  • Cost Comparison: ModeCost per CallSuccess RateTotal Cost (success + failure reinvestigation)

    Streaming$0.0540%$0.05 + $300 (manual) = $300.05 Extended Thinking$0.5095%$0.50 + $20 (verification) = $20.50

    Conclusion: Although Extended Thinking costs 10x per call, total cost is 15x lower!

    Selection Matrix

    Task TypeReasoning DifficultyWhich Mode?Reason

    Chat Q&A ("What is MCP?")LowStreamingDirect query return is enough Content generation (copy, articles)LowStreamingCreative output doesn't need deep verification Simple programming (fix one line bug)LowStreamingObvious fix, no deep analysis needed Complex algorithm implementationπŸ”΄ HighExtended ThinkingNeeds logic verification, error cost high System design (architecture decisions)πŸ”΄ HighExtended ThinkingMulti-layer trade-offs, needs deep thought Math problemsπŸ”΄ HighExtended ThinkingIntermediate steps need step-by-step verification Production incident troubleshootingπŸ”΄ HighExtended ThinkingMisdiagnosis cost extremely high Code review + optimizationMediumHybrid (see below)Quick scan with Streaming, deep review with Extended

    Hybrid Strategy: Two-Stage Agent

    Best practice is two-stage:

    Stage 1: Quick Filter (Streaming)

    typescript
    // First use Streaming to quickly assess problem level
    const { problem_level } = await chatWithStreaming(`
    Based on this error log, determine the problem level (critical/high/medium/low) and give a preliminary diagnosis
    `);
    

    Stage 2: Deep Diagnosis (Extended Thinking, only triggered for critical/high)

    typescript
    if (problem_level === 'critical') {
      const { analysis } = await chatWithExtendedThinking(`
      Here is my preliminary diagnosis... [logs]
    Please conduct a thorough analysis and verify if my diagnosis is correct
      `);
    }
    

    Effect:

  • Saves 70% cost (most problems are judged in stage 1)
  • Ensures >95% diagnostic accuracy for critical problems
  • API Call Examples

    Streaming (Current Standard API)

    typescript
    import Anthropic from '@anthropic-ai/sdk';

    const client = new Anthropic();

    async function quickResponse(prompt: string) { const stream = client.messages.stream({ model: 'claude-opus-4-1', max_tokens: 1024, messages: [{ role: 'user', content: prompt }] });

    for await (const chunk of stream) { if (chunk.type === 'content_block_delta' && chunk.delta.type === 'text_delta') { process.stdout.write(chunk.delta.text); // real-time output } } }

    quickResponse('Explain what MCP is');

    Extended Thinking (New API)

    typescript
    async function deepAnalysis(prompt: string) {
      const response = await client.messages.create({
        model: 'claude-opus-4-1',
        max_tokens: 10000,
        thinking: {
          type: 'enabled',
          budget_tokens: 5000  // Up to 5000 tokens for thinking
        },
        messages: [{ role: 'user', content: prompt }]
      });

    // Extract thinking process and final response let thinking = ''; let response_text = '';

    for (const block of response.content) { if (block.type === 'thinking') { thinking = block.thinking; } else if (block.type === 'text') { response_text = block.text; } }

    console.log('Thinking process:\n', thinking); console.log('\nFinal response:\n', response_text); }

    deepAnalysis('This is a complex system failure, please analyze...');

    Cost Calculation

    Streaming (Standard Pricing)

  • Input: $0.003/1M tokens
  • Output: $0.015/1M tokens
  • Example: 1000 token input + 500 token output = $0.0075

    Extended Thinking (New Thinking Tokens)

  • Input: $0.003/1M tokens
  • Output: $0.015/1M tokens
  • Thinking tokens: $0.003/1M tokens (same as input)
  • Example: 1000 token input + 5000 token thinking + 500 token output = (1000 Γ— $0.003 + 5000 Γ— $0.003 + 500 Γ— $0.015) / 1M = $0.021/1M = $0.000021 per call (single call cost still very low)

    Summary: When to Upgrade to Extended Thinking?

    MetricWhen to Upgrade?

    Error tolerance< 10% acceptable β†’ upgrade Fix cost> $100 (manual time) β†’ upgrade Real-time requirementCan wait < 1 minute β†’ upgrade User experienceNo need for real-time streaming β†’ upgrade

    Simple rule: If a wrong answer would cause financial loss, data loss, or user complaints, use Extended Thinking. Otherwise, use Streaming.

    Also available in δΈ­ζ–‡.

    Agent Reasoning Mode Comparison: Extended Thinking vs Streaming Output, How to Choose? | AI Skill Navigation | AI Skill Navigation