From Demo to Production: A Practical Guide to Agent Harness Engineering

Systematically covering Harness concepts, design principles, and real-world experience to help you build a production-grade Agent runtime environment

By AI Skill Navigation Editorial Team

Introduction: Why Can Your Agent Only Write Demos?

Over the past year, Vibe Coding has allowed countless developers to experience the joy of quickly building demos with intuition and AI. However, when these demos attempt to move into enterprise production environments, teams often encounter a "hard landing"—AI can't remember project specifications, suffers from "attention drift" in long conversations, and generated code gets rejected in PRs due to hidden defects.

These pain points reveal a core truth: A stunning demo does not equal production readiness. The bottleneck for AI programming is no longer the model's intelligence itself, but engineering capability. In 2025, we witnessed the boom of Vibe Coding, but in 2026, what enterprises truly need is Harness Engineering.

Agent = Model + Harness

Before diving into technical details, we need to clarify a core formula:

Agent = Model + Harness

The original meaning of "Harness" is horse tack—a horse is incredibly strong, but without the control and guidance of reins, it cannot pull a cart. The same applies to large language models; they are essentially an "intelligence engine" with understanding and generation capabilities, while the Harness is all the engineering infrastructure wrapped around the model: context management, tool orchestration, event interception, state persistence, security governance, etc.

Industry benchmarks have confirmed a key pattern:

The performance difference of the same model under different Harnesses is far greater than the gap between different models under the same Harness.

In the TerminalBench benchmark, optimizing only the Harness layer elevated the same model from below baseline to the Top 5.

The Vercel team found that proactively removing 80% of Agent tools streamlined the process, drastically reduced token consumption, and actually improved response speed.

Therefore, tuning the Harness is the real variable for unleashing AI's true engineering efficiency.

Core Architecture of Harness: The ETCLOVG Seven-Layer Framework

Both academia and industry have deepened their understanding of Harness. A survey titled "Agent Harness Engineering: A Survey" jointly published by CMU, Yale, Amazon, and others proposed a practical seven-layer classification framework—ETCLOVG—covering all dimensions required for production Agent operation:

LayerEnglishCore Responsibility

ExecutionExecutionWhere does the Agent run? Local, container, browser, remote sandbox? What are the boundaries? ToolingToolingHow are tools described, discovered, and invoked? How to prevent the model from randomly selecting tools? Context & MemoryContextHow to manage short-term context, session state, and long-term memory? Lifecycle OrchestrationLifecycleSingle-turn or multi-turn? Single Agent or multi-role division of labor? ObservabilityObservabilityEvery call, tool execution, error, retry, and cost must be traceable. Verification & EvaluationVerificationIs the result correct? Is the failure reason the model, tool, context, or environment? Governance & SecurityGovernanceWhat permissions does the Agent have? Which operations require human approval? How to audit?

Together, these seven layers form an Agent system capable of running long tasks. Many people still understand Agent as just "model + tool calling," but tool calling is only one layer. A real Agent product needs execution environment, context, orchestration, monitoring, verification, and governance—otherwise, it can easily become a "moving demo."

First Challenge: How to Make AI Understand a Giant Codebase?

Pain Point: AI Can't Remember Project Specifications, Can't Read Large Repositories

Every new session requires re-explaining the project background to AI, and due to context window limitations, when facing a million-line codebase, AI often "reads the front and forgets the back."

Solution: Five-Tier Memory System + Context Triage

#### 1. Build a Layered Memory Architecture

You can't cram all specifications into a single configuration file. Instead, build a five-tier memory system:

Enterprise Level: Global enterprise rules, non-bypassable security and compliance policies (e.g., prohibit sending code to external APIs, forbid hardcoded keys).

User Level: Personal coding preferences (e.g., communication language, shortcut command mappings).

Project Level: Team-shared project-level specifications (e.g., explicitly use Fastify framework and pnpm package manager). Anthropic officially recommends keeping this file within 200–300 lines, each line being a golden rule.

Rules Level: Break down domain-specific specifications (e.g., frontend component specs, database migration specs) into separate files, using YAML Frontmatter's paths field with Glob patterns for conditional loading. For example, only activate test specifications when AI operates on tests/** paths.

Local Level: Personal temporary notes, automatically included in .gitignore, not committed to the codebase.

#### 2. Context Triage: Analogy to OS Scheduling

LLM is the CPU, Context is memory, and the file system is disk. We can't stuff the entire disk into memory; we need a "context triage" mechanism similar to an OS virtual memory manager, categorizing candidate information into four levels (P0 ~ P3):

P0 (Always Online): Core project rules, security policies.

P1 (High-Frequency Access): API documentation related to the current task, key module structures.

P2 (On-Demand Loading): Implementation details of specific features, historical conversation summaries.

P3 (Archival Retrieval): Completed old task records, low-frequency references.

Through this triage scheduling, when troubleshooting an "order deduction failure" issue, AI only loads 3 core logs (P0/P1) and 5 historical ticket handles (P3), compressing the context volume from 18K to 2K tokens. The signal-to-noise ratio greatly improves, and localization accuracy actually increases.

Second Challenge: How to Control AI Hallucinations?

Pain Point: AI Produces Code That Looks Right but Is Actually Wrong

In long conversations, Claude Code automatically triggers context compression at 95% capacity. If it compresses a 487-token error stack trace for "connection pool exhausted" into a simple sentence like a database error occurred, AI loses the feedback loop and may spend hours retrying error solutions already ruled out by the stack trace.

Solution: Structured Context + Hooks Quality Gates

#### 1. Structured Input: Inject Rather Than Generate

The key to reducing hallucinations is to make AI perform "injection modifications" based on existing code, rather than "creating from scratch." When assigning tasks to AI, provide structured information:

Bad example: Optimize this function.

Good example: Optimize the parseConfig function in src/utils/parser.ts; the bottleneck is the loop on line 42.

#### 2. Stop Hook as a Contract: Return Control to Deterministic Engineering

"Prompt is a request; Hook is a contract." We don't need to beg AI in the prompt "please don't make things up"; instead, use deterministic Hook gates to block unreliable outputs.

By configuring a Stop Hook in the extension layer (triggered after AI completes its response and generates code, but before delivery), the system automatically and silently runs unit tests and static code checks:

json
{
  "hooks": {
    "Stop": [
      {
        "matcher": "All",
        "command": "pnpm lint && pnpm test",
        "blocking": true
      }
    ]
  }
}

If tests fail, the system directly blocks the submission and reports the error, feeding the result back to AI to fix itself until it passes self-healing before delivery.

Third Challenge: How to Reuse Experience?

Pain Point: Good Prompts Are Locked in Individual Minds, Not Shareable Across Teams

Every developer repeatedly writes similar prompts for code review, test generation, etc., in their own terminal. Newcomers are slow to ramp up, and the whole team reinvents the wheel.

Solution: From Prompt to Declarative Skill

In Claude Code's design, useful prompts can be encapsulated as Skill assets in the .claude/skills/ directory and version-controlled via Git. When a new developer clones the codebase, they instantly inherit the entire team's accumulated AI programming capabilities.

A Skill is essentially a directory containing a SKILL.md file. To save tokens, Claude uses a progressive disclosure design:

Startup Phase: Only loads the name and description at the top of each Skill (about 100 tokens of metadata).

Matching Phase: When the user's input matches the Skill's semantics (e.g., mentioning "review code"), the system expands the full SKILL.md main file.

Execution Phase: Only dynamically calls mounted bundled scripts/external resources when action is actually needed.

By "only reading the corresponding chapter when you open the book," this design saves up to ~98% of token space when running multi-Skill systems.

Fourth Challenge: High Compute Costs and Opaque Usage? Exploring Token Economics

Pain Point: Can't Tell How Much a Single Task Costs; Long Conversations Get More Expensive Over Time

Solution: Reverse Selection, Multi-Layer Routing, and Talker-Reasoner Architecture

#### 1. Establish a Model Selection Matrix

In enterprise deployments, running all tasks on expensive Opus often wastes significant money. Statistics on real business complexity distribution show that up to 41% of queries are simple SQL template fill-ins, which only require the cheapest Haiku model.

By configuring a three-layer routing mechanism in the Harness:

Haiku (60%) → Sonnet (30%) → Opus (10%)

While maintaining output quality, monthly bills can drop from 480,000 to 120,000, reducing overall costs by 65%–75%.

#### 2. Reverse Selection: Choose "Patterns" Under Constrained Models

When budget and deployment environment are hard constraints, and you can only deploy local open-source cheap models (e.g., Qwen-32B), how do you improve accuracy? At this point, pattern selection becomes the core of design:

Single call to Opus: Expensive, and may still make mistakes on edge cases.

Cheap Haiku model + iterative self-healing: Let Haiku write code, another Haiku do code review, iterate 2 rounds. The total compute cost is still far lower than a single call to a top-tier model, but the final output quality actually surpasses it.

#### 3. Talker-Reasoner Dual System

For high-frequency interaction scenarios like real-time chat/Voice, long thinking delays (e.g., reasoning models taking 24 seconds) make users think the system is stuck. Drawing on Kahneman's dual-system theory, the architecture can be restructured as:

Talker: Uses a super-fast cheap model (e.g., Haiku) with 200ms response time, responsible for replying to users immediately and chatting while waiting.

Reasoner: Uses a slow but smart model (e.g., Opus/reasoning) for deep reasoning in the background, continuously supplying the inferred belief state to the Talker.

This successfully "hides" the thinking delay from the user's perception.

Fifth Challenge: Constraint vs. Freedom

Pain Point: AI Fixes a Bug but Also Changes Three Security-Related Logic Pieces It Shouldn't Have

Solution: Constrain Actions, Not Thinking; Introduce HITL Human Review

When governing AI's action boundaries, many technical leaders fall into a trap: trying to refine every thinking step of AI in the prompt. This actually restricts the model's reasoning freedom.

"Constraints define the boundaries of action, not the freedom of thought. Constraints are not a guarantee of capability, but a container for capability."

Reasonable engineering constraints should be placed where actions occur and side effects happen:

Read-only / low blast radius operations (e.g., looking up code, reading docs): Auto-approve, no interruption.

Writable / medium impact operations: Approve with trace, record full-chain Keyed logs, support complete replay for post-hoc tracing.

High blast radius / irreversible operations: Force trigger blocking, pop up HITL human review panel in the console; only after a human clicks confirm can AI proceed.

Sixth Challenge: How to Choose Among Complex Orchestration Carriers?

Pain Point: Confusion Between SubAgent, Skill, Workflow, Agent Team; Don't Know How to Organize

Solution: A Four-Quadrant Diagram to Clarify Boundaries

In Harness design, these four orchestration carriers are not competing; they map to four types of work entities in the real world:

Skill = Job Operating Manual: Static, cross-task reusable knowledge packages and SOP templates, representing the Agent's professional capability.

SubAgent = Dedicated Employee: Has independent, isolated context space; executes a specific task (e.g., run a test, search a keyword) and is immediately destroyed afterward, preventing contamination.

Workflow = SOP Flowchart: Freezes control flow explicitly and deterministically in code or scripts, suitable for multi-step, long-term automated processes with clear goals (e.g., nightly build auto-fix).

Agent Team = Continuously Collaborating Virtual Team: Maintains long-term, multi-person conversational interactions, with each Mate role having a persistent session.

In mature enterprise projects, these four are typically complementary and nested, combined into a business pipeline.

Seventh Challenge: How to Prevent Long-Task State Drift?

Pain Point: Complex long tasks gradually deviate from the goal

Solution: Three-Plane Separation Architecture + Scratchpad Board

To address goal drift in long tasks, introduce a three-plane separation architecture:

Execution Plane: The space where the Agent actually executes tasks, including code modifications, tool calls, etc.

Monitoring Plane: Continuously tracks execution progress and goal deviation, records key decision points.

Intervention Plane: When significant deviation is detected, triggers rollback or human intervention.

Additionally, introduce a "scratchpad board" mechanism—let the Agent record intermediate reasoning processes in an independent scratchpad instead of directly modifying the main context. This way, even if a step fails, recovery is possible from the board, avoiding contamination of the entire context.

Dynamic Workflows: Let the Harness Self-Evolve

Traditional static workflows need to be pre-written, are highly general but cannot be personalized. Anthropic has launched dynamic workflows in Claude Code—Claude can customize a dedicated runtime framework on the fly based on the task, supporting saving, reuse, and sharing.

Core Patterns of Dynamic Workflows

A dynamic workflow executes a JavaScript file containing special functions that help generate and coordinate sub-agents. Common patterns include:

Classify and Act: Use a classifier agent to decide the task type, then route to different agents or actions.

Distribute and Synthesize: Break a task into multiple small steps, run an agent on each step, then synthesize the results.

Adversarial Verification: For each generated agent, run an independent generation agent to adversarially verify its output.

Generate and Filter: Generate multiple ideas, filter through scoring criteria, only return the highest quality.

Tournament: Have multiple agents try the same task with different methods, then pairwise evaluate until a winner emerges.

Loop Until Done: Continuously generate loop agents until a stopping condition is met.

When to Use Dynamic Workflows

Dynamic workflows are particularly suitable for:

Tasks requiring deep research (e.g., /deep-research)

Large-scale code refactoring and migration

Multi-dimensional hypothesis troubleshooting for root cause

Sorting massive resumes/tickets

Multi-round evaluation and selection for product naming, solution design, etc.

But note that dynamic workflows typically consume more tokens, so enable them on demand. For simple programming tasks, the default Claude Code framework is sufficient.

From Agent Framework to Agent Platform

Looking at the tool ecosystem, a clear trend is that Agents are moving from Framework to Platform.

Framework solves local abstractions: agent, tool, memory, loop.

Platform must solve the complete production system: durable workspace, managed sandbox, identity, billing, observability, evaluation, governance, human handoff.

Early on, the competition was: who can build an Agent loop the fastest. Now it's: who can make that loop run reliably over the long term.

So the competition for Agent platforms won't just happen at the model layer or the development framework layer, but across the entire Harness capability. Whose execution environment is more stable, whose tool protocol is clearer, whose context is less prone to drift, whose traces are more usable, whose verification is closer to real tasks, whose permissions and auditing are more controllable—that's who is more likely to bring Agents into real production workflows.

Conclusion: Less Is More

Finally, an important reminder: Agent engineering is not about making things as complex as possible.

As models become stronger, the Harness must also be re-evaluated. Every wrapper, reset, verifier, planner, memory rule, and permission gate essentially represents an assumption: the model can't do it well on its own, so I add a layer of control outside. But if the model's capabilities change, these controls may no longer be necessary and may even become a hindrance.

An example from Anthropic: In long application development tasks, certain context resets were useful for older models but could be removed for stronger models. Removing them reduced costs without degrading quality.

A good Harness doesn't just know how to add controls; it also knows when to remove them.

FAQ

What is the difference between Agent Harness and Agent Framework? Harness is the execution layer inside the Agent, responsible for calling the model, handling tool calls, and deciding when to stop; Framework is a higher-level abstraction that provides definitions and assembly methods for components like agent, tool, memory. Harness focuses more on runtime, Framework focuses more on development time.

Which is better: dynamic workflow or static workflow? There is no absolute winner. Static workflows are pre-written, highly general, and suitable for deterministic multi-step processes; dynamic workflows are customized on the fly by AI, more flexible, and suitable for exploratory, unstructured tasks. Dynamic workflows typically consume more tokens, so enable them on demand.

How to evaluate the quality of an Agent Harness? Don't just look at the final success rate. Evaluate from multiple dimensions: result correctness, execution path reasonableness, token cost, latency, traceability of failure reasons, security and compliance. It is recommended to use trace-native evaluation, recording the complete execution trajectory and analyzing layer by layer.

What is the most overlooked aspect in Harness engineering? Observability and governance. Many teams first get the Agent running, then add logging and permissions. But in production, without observability, you don't know why the Agent failed; without governance, you may not dare to use it even if it succeeds. These two aspects should be included in the design from the start.

How can small teams quickly get started with Harness engineering? It is recommended to start with mature tools like Claude Code, first mastering basic features like CLAUDE.md configuration, Hooks, and Skills. Then gradually introduce SubAgent for context isolation, and finally customize dynamic workflows based on business needs. Don't aim for a complete seven-layer architecture from the beginning.

Reference Resources

AI Agent and Multi-Agent

Claude Model and Harness Engineering

Workflow and Task Orchestration

Also available in 中文.