← Back to tutorials

Recursive AI Systems: Advanced Guide

AI systems that improve themselves iteratively

Recursive AI Systems: Advanced Guide

A recursive AI system feeds a model's output back into itself — to decompose a problem, refine an answer, or spawn sub-agents that spawn sub-agents. Recursion is the natural shape for problems whose structure you can't know upfront ("research this topic": you don't know the subtopics until you look). It is also the easiest way to build a system that burns $200 of tokens producing nothing. This guide covers the three patterns that work and the control machinery that keeps them safe.

Pattern 1: Recursive task decomposition

The model splits a task into subtasks; each subtask either gets solved directly or split again.

python
import json

MAX_DEPTH = 3

def solve(task: str, depth: int = 0, budget: dict = None) -> str: budget = budget if budget is not None else {'calls': 30} if budget['calls'] <= 0: return f'[budget exhausted before solving: {task[:60]}]' budget['calls'] -= 1

# Force the leaf path at max depth — depth control lives in YOUR code, not the prompt if depth >= MAX_DEPTH: return llm(f'Solve directly and concisely: {task}')

plan = json.loads(llm( f'Task: {task}\n' 'If this is directly answerable in one step, return {"atomic": true}.\n' 'Otherwise return {"atomic": false, "subtasks": ["...", "..."]} — at most 4, ' 'each strictly smaller and non-overlapping. JSON only.' )) if plan['atomic']: return llm(f'Solve directly and concisely: {task}')

results = [solve(s, depth + 1, budget) for s in plan['subtasks']] joined = '\n\n'.join(f'## {s}\n{r}' for s, r in zip(plan['subtasks'], results)) return llm(f'Synthesize these subtask results into one answer for: {task}\n\n{joined}')

The failure mode this code defends against: models are bad at deciding when to *stop* decomposing — left unguarded they split "write a haiku" into three subtasks. Hence the two hard limits (depth and call budget) enforced in code, plus prompt-side pressure ("strictly smaller", "at most 4"). Independent subtasks parallelize naturally — swap the list comprehension for asyncio.gather (async patterns).

Pattern 2: Recursive refinement (generate → critique → revise)

Loop a draft through critique-and-revision until it passes a bar:

python
def refine(task: str, max_rounds: int = 3) -> str:
    draft = llm(f'Complete this task: {task}')
    for _ in range(max_rounds):
        verdict = json.loads(llm(
            f'Score this against the task (1-10) and list concrete defects. '
            f'JSON: {{"score": n, "defects": [...]}}\nTask: {task}\nDraft:\n{draft}'
        ))
        if verdict['score'] >= 8 or not verdict['defects']:
            break
        draft = llm(f'Fix exactly these defects, change nothing else:\n'
                    f'{verdict["defects"]}\n\nDraft:\n{draft}')
    return draft

Two things make this converge instead of oscillate: the critic must produce concrete defects (a bare score gives revisions nothing to act on), and revisions are scoped to fixing those defects only. Returns diminish fast — rounds 1–2 capture most of the gain, and a model critiquing its own output plateaus on its own blind spots (using a different model as critic measurably helps). This is the same generation/evaluation gap that Constitutional AI exploits at training time, applied at inference.

For code tasks, replace the LLM critic with ground truth: run the tests, feed failures back. Objective signal beats self-assessment every time it's available.

Pattern 3: Recursive agents (agents spawning agents)

An orchestrator delegates to sub-agents, each with its own context window — this is how agent systems scale past one context: the orchestrator holds the plan, workers hold the details, and only summaries flow up. Frameworks (CrewAI vs AutoGen, LangGraph) give you the supervisor/worker machinery. The recursion-specific rules:

  • One level is usually enough. Orchestrator → workers covers most real tasks; depth ≥ 3 mostly adds cost and coordination failure, and mature agent platforms cap delegation depth deliberately.
  • Sub-agents return summaries, not transcripts — the whole point is context isolation; piping a worker's full trace upward defeats it.
  • Pass a shared budget down, exactly like budget in Pattern 1, so a runaway worker can't spend the whole allocation.
  • The control plane (non-negotiable)

    Every production recursive system needs all four, in code rather than in prompts:

    ControlImplementation

    Depth limitdepth parameter, hard cutoff to the leaf path BudgetShared counter of calls/tokens/dollars passed through every call Convergence checkStop when score plateaus or outputs stop changing — not just max-iterations TracingLog the full call tree (parent → children, tokens per node); untraced recursion is undebuggable. LangSmith/Langfuse render these trees well — see LLM evaluation workflow

    And one quality warning: recursion amplifies errors as readily as quality. A subtly wrong subtask answer gets synthesized upward as fact; a hallucinated critique steers revisions off course. Validate at the boundaries (schema-check every JSON the model returns — Zod vs Pydantic) and ground loops in objective signals (tests, retrieval, calculators) wherever one exists.

    FAQ

    Is this AGI-style recursive self-improvement? No — these systems improve *outputs* within fixed model weights. Nothing here changes the model itself.

    When is recursion the wrong tool? When the workflow's structure is known upfront — a fixed pipeline (extract → transform → summarize) is cheaper, faster, and more debuggable than letting a model rediscover that structure per request. Recursion pays only for unknown structure.

    Total cost intuition? A depth-3, branching-4 decomposition is up to 4³ = 64 leaf calls plus synthesis at every level. Always estimate the worst-case tree before shipping the loop — that's what the budget cap is for.


    *Last updated: June 2026.*

    Also available in 中文.