Recursive AI Systems: Advanced Guide

AI systems that improve themselves iteratively

Recursive AI Systems: Advanced Guide

A recursive AI system feeds a model's output back into itself — to decompose a problem, refine an answer, or spawn sub-agents that spawn sub-agents. Recursion is the natural shape for problems whose structure you can't know upfront ("research this topic": you don't know the subtopics until you look). It is also the easiest way to build a system that burns $200 of tokens producing nothing. This guide covers the three patterns that work and the control machinery that keeps them safe.

Pattern 1: Recursive task decomposition

The model splits a task into subtasks; each subtask either gets solved directly or split again.

python
import json
MAX_DEPTH = 3
def solve(task: str, depth: int = 0, budget: dict = None) -> str:
    budget = budget if budget is not None else {'calls': 30}
    if budget['calls'] <= 0:
        return f'[budget exhausted before solving: {task[:60]}]'
    budget['calls'] -= 1
    # Force the leaf path at max depth — depth control lives in YOUR code, not the prompt
    if depth >= MAX_DEPTH:
        return llm(f'Solve directly and concisely: {task}')
    plan = json.loads(llm(
        f'Task: {task}\n'
        'If this is directly answerable in one step, return {"atomic": true}.\n'
        'Otherwise return {"atomic": false, "subtasks": ["...", "..."]} — at most 4, '
        'each strictly smaller and non-overlapping. JSON only.'
    ))
    if plan['atomic']:
        return llm(f'Solve directly and concisely: {task}')    results = [solve(s, depth + 1, budget) for s in plan['subtasks']]
    joined = '\n\n'.join(f'## {s}\n{r}' for s, r in zip(plan['subtasks'], results))
    return llm(f'Synthesize these subtask results into one answer for: {task}\n\n{joined}')

The failure mode this code defends against: models are bad at deciding when to *stop* decomposing — left unguarded they split "write a haiku" into three subtasks. Hence the two hard limits (depth and call budget) enforced in code, plus prompt-side pressure ("strictly smaller", "at most 4"). Independent subtasks parallelize naturally — swap the list comprehension for asyncio.gather (async patterns).

Pattern 2: Recursive refinement (generate → critique → revise)

Loop a draft through critique-and-revision until it passes a bar:

python
def refine(task: str, max_rounds: int = 3) -> str:
    draft = llm(f'Complete this task: {task}')
    for _ in range(max_rounds):
        verdict = json.loads(llm(
            f'Score this against the task (1-10) and list concrete defects. '
            f'JSON: {{"score": n, "defects": [...]}}\nTask: {task}\nDraft:\n{draft}'
        ))
        if verdict['score'] >= 8 or not verdict['defects']:
            break
        draft = llm(f'Fix exactly these defects, change nothing else:\n'
                    f'{verdict["defects"]}\n\nDraft:\n{draft}')
    return draft

Two things make this converge instead of oscillate: the critic must produce concrete defects (a bare score gives revisions nothing to act on), and revisions are scoped to fixing those defects only. Returns diminish fast — rounds 1–2 capture most of the gain, and a model critiquing its own output plateaus on its own blind spots (using a different model as critic measurably helps). This is the same generation/evaluation gap that Constitutional AI exploits at training time, applied at inference.

For code tasks, replace the LLM critic with ground truth: run the tests, feed failures back. Objective signal beats self-assessment every time it's available.

Pattern 3: Recursive agents (agents spawning agents)

An orchestrator delegates to sub-agents, each with its own context window — this is how agent systems scale past one context: the orchestrator holds the plan, workers hold the details, and only summaries flow up. Frameworks (CrewAI vs AutoGen, LangGraph) give you the supervisor/worker machinery. The recursion-specific rules:

One level is usually enough. Orchestrator → workers covers most real tasks; depth ≥ 3 mostly adds cost and coordination failure, and mature agent platforms cap delegation depth deliberately.

Sub-agents return summaries, not transcripts — the whole point is context isolation; piping a worker's full trace upward defeats it.

Pass a shared budget down, exactly like budget in Pattern 1, so a runaway worker can't spend the whole allocation.

The control plane (non-negotiable)

Every production recursive system needs all four, in code rather than in prompts:

ControlImplementation

Depth limitdepth parameter, hard cutoff to the leaf path BudgetShared counter of calls/tokens/dollars passed through every call Convergence checkStop when score plateaus or outputs stop changing — not just max-iterations TracingLog the full call tree (parent → children, tokens per node); untraced recursion is undebuggable. LangSmith/Langfuse render these trees well — see LLM evaluation workflow

And one quality warning: recursion amplifies errors as readily as quality. A subtly wrong subtask answer gets synthesized upward as fact; a hallucinated critique steers revisions off course. Validate at the boundaries (schema-check every JSON the model returns — Zod vs Pydantic) and ground loops in objective signals (tests, retrieval, calculators) wherever one exists.

FAQ

Is this AGI-style recursive self-improvement? No — these systems improve *outputs* within fixed model weights. Nothing here changes the model itself.

When is recursion the wrong tool? When the workflow's structure is known upfront — a fixed pipeline (extract → transform → summarize) is cheaper, faster, and more debuggable than letting a model rediscover that structure per request. Recursion pays only for unknown structure.

Total cost intuition? A depth-3, branching-4 decomposition is up to 4³ = 64 leaf calls plus synthesis at every level. Always estimate the worst-case tree before shipping the loop — that's what the budget cap is for.

*Last updated: June 2026.*

Also available in 中文.