From Demo to Production: A Practical Guide to Agent Harness Engineering
Systematically covering Harness concepts, design principles, and real-world experience to help you build a production-grade Agent runtime environment
Introduction: Why Can Your Agent Only Write Demos?
Over the past year, Vibe Coding has allowed countless developers to experience the joy of quickly building demos with intuition and AI. However, when these demos attempt to move into enterprise production environments, teams often encounter a "hard landing"—AI can't remember project specifications, suffers from "attention drift" in long conversations, and generated code gets rejected in PRs due to hidden defects.
These pain points reveal a core truth: A stunning demo does not equal production readiness. The bottleneck for AI programming is no longer the model's intelligence itself, but engineering capability. In 2025, we witnessed the boom of Vibe Coding, but in 2026, what enterprises truly need is Harness Engineering.
Agent = Model + Harness
Before diving into technical details, we need to clarify a core formula:
Agent = Model + Harness
The original meaning of "Harness" is horse tack—a horse is incredibly strong, but without the control and guidance of reins, it cannot pull a cart. The same applies to large language models; they are essentially an "intelligence engine" with understanding and generation capabilities, while the Harness is all the engineering infrastructure wrapped around the model: context management, tool orchestration, event interception, state persistence, security governance, etc.
Industry benchmarks have confirmed a key pattern:
Therefore, tuning the Harness is the real variable for unleashing AI's true engineering efficiency.
Core Architecture of Harness: The ETCLOVG Seven-Layer Framework
Both academia and industry have deepened their understanding of Harness. A survey titled "Agent Harness Engineering: A Survey" jointly published by CMU, Yale, Amazon, and others proposed a practical seven-layer classification framework—ETCLOVG—covering all dimensions required for production Agent operation:
Together, these seven layers form an Agent system capable of running long tasks. Many people still understand Agent as just "model + tool calling," but tool calling is only one layer. A real Agent product needs execution environment, context, orchestration, monitoring, verification, and governance—otherwise, it can easily become a "moving demo."
First Challenge: How to Make AI Understand a Giant Codebase?
Pain Point: AI Can't Remember Project Specifications, Can't Read Large Repositories
Every new session requires re-explaining the project background to AI, and due to context window limitations, when facing a million-line codebase, AI often "reads the front and forgets the back."
Solution: Five-Tier Memory System + Context Triage
#### 1. Build a Layered Memory Architecture
You can't cram all specifications into a single configuration file. Instead, build a five-tier memory system:
paths field with Glob patterns for conditional loading. For example, only activate test specifications when AI operates on tests/** paths..gitignore, not committed to the codebase.#### 2. Context Triage: Analogy to OS Scheduling
LLM is the CPU, Context is memory, and the file system is disk. We can't stuff the entire disk into memory; we need a "context triage" mechanism similar to an OS virtual memory manager, categorizing candidate information into four levels (P0 ~ P3):
Through this triage scheduling, when troubleshooting an "order deduction failure" issue, AI only loads 3 core logs (P0/P1) and 5 historical ticket handles (P3), compressing the context volume from 18K to 2K tokens. The signal-to-noise ratio greatly improves, and localization accuracy actually increases.
Second Challenge: How to Control AI Hallucinations?
Pain Point: AI Produces Code That Looks Right but Is Actually Wrong
In long conversations, Claude Code automatically triggers context compression at 95% capacity. If it compresses a 487-token error stack trace for "connection pool exhausted" into a simple sentence like a database error occurred, AI loses the feedback loop and may spend hours retrying error solutions already ruled out by the stack trace.
Solution: Structured Context + Hooks Quality Gates
#### 1. Structured Input: Inject Rather Than Generate
The key to reducing hallucinations is to make AI perform "injection modifications" based on existing code, rather than "creating from scratch." When assigning tasks to AI, provide structured information:
parseConfig function in src/utils/parser.ts; the bottleneck is the loop on line 42.#### 2. Stop Hook as a Contract: Return Control to Deterministic Engineering
"Prompt is a request; Hook is a contract." We don't need to beg AI in the prompt "please don't make things up"; instead, use deterministic Hook gates to block unreliable outputs.
By configuring a Stop Hook in the extension layer (triggered after AI completes its response and generates code, but before delivery), the system automatically and silently runs unit tests and static code checks:
json
{
"hooks": {
"Stop": [
{
"matcher": "All",
"command": "pnpm lint && pnpm test",
"blocking": true
}
]
}
}
If tests fail, the system directly blocks the submission and reports the error, feeding the result back to AI to fix itself until it passes self-healing before delivery.
Third Challenge: How to Reuse Experience?
Pain Point: Good Prompts Are Locked in Individual Minds, Not Shareable Across Teams
Every developer repeatedly writes similar prompts for code review, test generation, etc., in their own terminal. Newcomers are slow to ramp up, and the whole team reinvents the wheel.
Solution: From Prompt to Declarative Skill
In Claude Code's design, useful prompts can be encapsulated as Skill assets in the .claude/skills/ directory and version-controlled via Git. When a new developer clones the codebase, they instantly inherit the entire team's accumulated AI programming capabilities.
A Skill is essentially a directory containing a SKILL.md file. To save tokens, Claude uses a progressive disclosure design:
name and description at the top of each Skill (about 100 tokens of metadata).SKILL.md main file.By "only reading the corresponding chapter when you open the book," this design saves up to ~98% of token space when running multi-Skill systems.
Fourth Challenge: High Compute Costs and Opaque Usage? Exploring Token Economics
Pain Point: Can't Tell How Much a Single Task Costs; Long Conversations Get More Expensive Over Time
Solution: Reverse Selection, Multi-Layer Routing, and Talker-Reasoner Architecture
#### 1. Establish a Model Selection Matrix
In enterprise deployments, running all tasks on expensive Opus often wastes significant money. Statistics on real business complexity distribution show that up to 41% of queries are simple SQL template fill-ins, which only require the cheapest Haiku model.
By configuring a three-layer routing mechanism in the Harness:
While maintaining output quality, monthly bills can drop from 480,000 to 120,000, reducing overall costs by 65%–75%.
#### 2. Reverse Selection: Choose "Patterns" Under Constrained Models
When budget and deployment environment are hard constraints, and you can only deploy local open-source cheap models (e.g., Qwen-32B), how do you improve accuracy? At this point, pattern selection becomes the core of design:
#### 3. Talker-Reasoner Dual System
For high-frequency interaction scenarios like real-time chat/Voice, long thinking delays (e.g., reasoning models taking 24 seconds) make users think the system is stuck. Drawing on Kahneman's dual-system theory, the architecture can be restructured as:
This successfully "hides" the thinking delay from the user's perception.
Fifth Challenge: Constraint vs. Freedom
Pain Point: AI Fixes a Bug but Also Changes Three Security-Related Logic Pieces It Shouldn't Have
Solution: Constrain Actions, Not Thinking; Introduce HITL Human Review
When governing AI's action boundaries, many technical leaders fall into a trap: trying to refine every thinking step of AI in the prompt. This actually restricts the model's reasoning freedom.
"Constraints define the boundaries of action, not the freedom of thought. Constraints are not a guarantee of capability, but a container for capability."
Reasonable engineering constraints should be placed where actions occur and side effects happen:
Sixth Challenge: How to Choose Among Complex Orchestration Carriers?
Pain Point: Confusion Between SubAgent, Skill, Workflow, Agent Team; Don't Know How to Organize
Solution: A Four-Quadrant Diagram to Clarify Boundaries
In Harness design, these four orchestration carriers are not competing; they map to four types of work entities in the real world:
In mature enterprise projects, these four are typically complementary and nested, combined into a business pipeline.
Seventh Challenge: How to Prevent Long-Task State Drift?
Pain Point: Complex long tasks gradually deviate from the goal
Solution: Three-Plane Separation Architecture + Scratchpad Board
To address goal drift in long tasks, introduce a three-plane separation architecture:
Additionally, introduce a "scratchpad board" mechanism—let the Agent record intermediate reasoning processes in an independent scratchpad instead of directly modifying the main context. This way, even if a step fails, recovery is possible from the board, avoiding contamination of the entire context.
Dynamic Workflows: Let the Harness Self-Evolve
Traditional static workflows need to be pre-written, are highly general but cannot be personalized. Anthropic has launched dynamic workflows in Claude Code—Claude can customize a dedicated runtime framework on the fly based on the task, supporting saving, reuse, and sharing.
Core Patterns of Dynamic Workflows
A dynamic workflow executes a JavaScript file containing special functions that help generate and coordinate sub-agents. Common patterns include:
When to Use Dynamic Workflows
Dynamic workflows are particularly suitable for:
/deep-research)But note that dynamic workflows typically consume more tokens, so enable them on demand. For simple programming tasks, the default Claude Code framework is sufficient.
From Agent Framework to Agent Platform
Looking at the tool ecosystem, a clear trend is that Agents are moving from Framework to Platform.
Early on, the competition was: who can build an Agent loop the fastest. Now it's: who can make that loop run reliably over the long term.
So the competition for Agent platforms won't just happen at the model layer or the development framework layer, but across the entire Harness capability. Whose execution environment is more stable, whose tool protocol is clearer, whose context is less prone to drift, whose traces are more usable, whose verification is closer to real tasks, whose permissions and auditing are more controllable—that's who is more likely to bring Agents into real production workflows.
Conclusion: Less Is More
Finally, an important reminder: Agent engineering is not about making things as complex as possible.
As models become stronger, the Harness must also be re-evaluated. Every wrapper, reset, verifier, planner, memory rule, and permission gate essentially represents an assumption: the model can't do it well on its own, so I add a layer of control outside. But if the model's capabilities change, these controls may no longer be necessary and may even become a hindrance.
An example from Anthropic: In long application development tasks, certain context resets were useful for older models but could be removed for stronger models. Removing them reduced costs without degrading quality.
A good Harness doesn't just know how to add controls; it also knows when to remove them.
FAQ
What is the difference between Agent Harness and Agent Framework? Harness is the execution layer inside the Agent, responsible for calling the model, handling tool calls, and deciding when to stop; Framework is a higher-level abstraction that provides definitions and assembly methods for components like agent, tool, memory. Harness focuses more on runtime, Framework focuses more on development time.
Which is better: dynamic workflow or static workflow? There is no absolute winner. Static workflows are pre-written, highly general, and suitable for deterministic multi-step processes; dynamic workflows are customized on the fly by AI, more flexible, and suitable for exploratory, unstructured tasks. Dynamic workflows typically consume more tokens, so enable them on demand.
How to evaluate the quality of an Agent Harness? Don't just look at the final success rate. Evaluate from multiple dimensions: result correctness, execution path reasonableness, token cost, latency, traceability of failure reasons, security and compliance. It is recommended to use trace-native evaluation, recording the complete execution trajectory and analyzing layer by layer.
What is the most overlooked aspect in Harness engineering? Observability and governance. Many teams first get the Agent running, then add logging and permissions. But in production, without observability, you don't know why the Agent failed; without governance, you may not dare to use it even if it succeeds. These two aspects should be included in the design from the start.
How can small teams quickly get started with Harness engineering? It is recommended to start with mature tools like Claude Code, first mastering basic features like CLAUDE.md configuration, Hooks, and Skills. Then gradually introduce SubAgent for context isolation, and finally customize dynamic workflows based on business needs. Don't aim for a complete seven-layer architecture from the beginning.
Reference Resources
Also available in 中文.