Agent Reasoning Mode Comparison: Extended Thinking vs Streaming Output, How to Choose?
Cost, Speed, Accuracy Triangle Trade-off
Agent Reasoning Mode Comparison: Extended Thinking vs Streaming Output
Core Difference in One Sentence
Why Two Modes?
Before 2023, all LLMs used 'single forward pass': input prompt β model generates all tokens at once until EOS (end of sequence). This resulted in:
In 2024, two breakthroughs broke this balance:
Practical Comparison: Fixing a Complex Bug
Scenario
Production is down; a user's order status is incorrect. The agent needs to analyze database logs, trace the transaction flow, and provide root cause and fix.Usage 1: Streaming (Direct Output)
typescript
const response = await anthropic.messages.create({
model: 'claude-opus-4-1',
max_tokens: 1024,
messages: [
{
role: 'user',
content: `Here are the database logs... [1000 lines of logs]
Please analyze the root cause and provide a fix`
}
]
});
Output (5 sec):
Based on the logs, order #12345 failed to update at 2024-05-15 14:32:15.
Possible cause: concurrent race condition.
Suggestion: add optimistic locking.
Problem: Too shallow, didn't trace the full transaction chain. After the fix, it turned out to be a deadlock. Cost: $0.05, but fix failed, total cost = $0.05 + 2 hours manual debug.
Usage 2: Extended Thinking (Deep Reasoning)
typescript
const response = await anthropic.messages.create({
model: 'claude-opus-4-1',
max_tokens: 10000,
thinking: {
type: 'enabled',
budget_tokens: 8000 // Use up to 8000 tokens for thinking
},
messages: [
{
role: 'user',
content: `Here are the database logs... [1000 lines of logs]
Please thoroughly analyze the root cause and provide a fix`
}
]
});
Thinking Process (invisible to user):
Check timestamps: order created 14:30:00, first update 14:32:15
Look at transaction IDs in logs: TX_5672_A and TX_5672_B started in the same millisecond
Trace TX_5672_A: acquires read lock on orders table β updates customers table β waits for write lock on orders (BLOCKED)
Trace TX_5672_B: acquires write lock on orders table β waits for read lock on customers table (BLOCKED)
Conclusion: A and B form a deadlock cycle!
Root cause: transactions hold locks too long; A acquires orders read lock first, B acquires orders write lock simultaneously
Fix: modify order update logic to complete all operations within a single transaction
Final Output (45 sec):
Deadlock root cause identified:
Transaction TX_5672_A holds read lock on orders table, waiting for write lock on customers table
Transaction TX_5672_B holds write lock on orders table, waiting for read lock on customers table Fix steps:
Refactor order update logic into a single transaction (see code block)
Add lock timeout configuration (max_lock_wait = 5s)
Recover locked orders: UPDATE orders SET status='PENDING' WHERE locked_at < NOW()-1min
Monitoring: add slow query alerts, alert when lock wait > 100ms
Cost Comparison:
Conclusion: Although Extended Thinking costs 10x per call, total cost is 15x lower!
Selection Matrix
Hybrid Strategy: Two-Stage Agent
Best practice is two-stage:
Stage 1: Quick Filter (Streaming)
typescript
// First use Streaming to quickly assess problem level
const { problem_level } = await chatWithStreaming(`
Based on this error log, determine the problem level (critical/high/medium/low) and give a preliminary diagnosis
`);
Stage 2: Deep Diagnosis (Extended Thinking, only triggered for critical/high)
typescript
if (problem_level === 'critical') {
const { analysis } = await chatWithExtendedThinking(`
Here is my preliminary diagnosis... [logs]
Please conduct a thorough analysis and verify if my diagnosis is correct
`);
}
Effect:
API Call Examples
Streaming (Current Standard API)
typescript
import Anthropic from '@anthropic-ai/sdk';const client = new Anthropic();
async function quickResponse(prompt: string) {
const stream = client.messages.stream({
model: 'claude-opus-4-1',
max_tokens: 1024,
messages: [{ role: 'user', content: prompt }]
});
for await (const chunk of stream) {
if (chunk.type === 'content_block_delta' && chunk.delta.type === 'text_delta') {
process.stdout.write(chunk.delta.text); // real-time output
}
}
}
quickResponse('Explain what MCP is');
Extended Thinking (New API)
typescript
async function deepAnalysis(prompt: string) {
const response = await client.messages.create({
model: 'claude-opus-4-1',
max_tokens: 10000,
thinking: {
type: 'enabled',
budget_tokens: 5000 // Up to 5000 tokens for thinking
},
messages: [{ role: 'user', content: prompt }]
}); // Extract thinking process and final response
let thinking = '';
let response_text = '';
for (const block of response.content) {
if (block.type === 'thinking') {
thinking = block.thinking;
} else if (block.type === 'text') {
response_text = block.text;
}
}
console.log('Thinking process:\n', thinking);
console.log('\nFinal response:\n', response_text);
}
deepAnalysis('This is a complex system failure, please analyze...');
Cost Calculation
Streaming (Standard Pricing)
Example: 1000 token input + 500 token output = $0.0075
Extended Thinking (New Thinking Tokens)
Example: 1000 token input + 5000 token thinking + 500 token output = (1000 Γ $0.003 + 5000 Γ $0.003 + 500 Γ $0.015) / 1M = $0.021/1M = $0.000021 per call (single call cost still very low)
Summary: When to Upgrade to Extended Thinking?
Simple rule: If a wrong answer would cause financial loss, data loss, or user complaints, use Extended Thinking. Otherwise, use Streaming.
Also available in δΈζ.