← Back to tutorials

Multi-Agent System Performance Optimization: A Comprehensive Guide from Topology to Training

Covering topology optimization, pipeline parallelism, RL training frameworks, and market mechanisms to build efficient collaborative multi-agent systems

Introduction: The Optimization Dilemma of Multi-Agent Systems

Multi-agent systems (MAS) decompose complex tasks into collaborations among multiple specialized agents, achieving performance surpassing single models in code generation, mathematical reasoning, question answering, and more. However, as system scale grows, performance optimization faces multiple challenges: workflow topologies are often fixed due to safety validation and compliance review, serial communication between agents causes latency to grow linearly with depth, existing reinforcement learning frameworks focus on single-policy optimization and cannot directly optimize multi-agent workflows, and centralized coordination mechanisms become performance bottlenecks.

This article systematically explores MAS optimization strategies from four cutting-edge directions:

  • Prompt Optimization under Fixed Topology: When the workflow cannot be modified, how to efficiently search for prompt configurations of each agent to improve performance.
  • Streaming Communication Acceleration: Introducing streaming output into agent collaboration to achieve pipeline parallelism, reducing latency and improving reasoning quality.
  • Multi-Agent Reinforcement Learning Framework: Building a general RL training framework for workflows, supporting role decoupling and heterogeneous training.
  • Decentralized Market Mechanism: Using economic incentives to let agents spontaneously form specialization and collaboration, avoiding centralized bottlenecks.
  • These methods are not mutually exclusive and can be combined. For example, in a fixed topology scenario, MASPOB can first optimize prompts, then StreamMA can be introduced to accelerate communication; if further training is needed, UnityMAS-O can be used for RL optimization.

    Prompt Optimization under Fixed Topology: MASPOB

    Problem Background

    In real deployments, MAS workflow topologies such as medical diagnosis SOPs and financial audit processes are often designed by experts, validated for safety, and reviewed for compliance. Once deployed, they are difficult to modify. At this point, adjusting each agent's prompt becomes a key means to improve system performance. However, prompt optimization for MAS faces three major challenges:

  • High Evaluation Cost: Each evaluation requires a full execution of the MAS workflow, involving multiple LLM calls.
  • Topology-Induced Coupling: Changes in upstream agent prompts affect the input distribution of downstream agents; agents are not independent.
  • Combinatorial Search Space Explosion: The joint search space grows exponentially with the number of agents.
  • Core Algorithm of MASPOB

    The MASPOB framework, proposed by teams including The Chinese University of Hong Kong (Shenzhen), models prompt optimization as a combinatorial black-box optimization problem with a budget, comprising three core components:

  • Topology-Aware Performance Surrogate Model: The MAS workflow is modeled as a directed acyclic graph (DAG), with prompt embeddings of each agent as node features. A graph attention network (GAT) performs message passing, explicitly modeling the impact of upstream prompt changes on downstream agents.
  • Bandit-Based Exploration-Exploitation Trade-off: A linear upper confidence bound (LinUCB) is used to construct an acquisition function that favors high predicted performance while assigning higher scores to underexplored regions.
  • Coordinate Ascent Search: The joint optimization is decomposed into univariate optimization for each agent individually, significantly reducing search complexity.
  • Experimental Results

    On six benchmarks covering question answering (HotpotQA, DROP), code generation (HumanEval, MBPP), and mathematical reasoning (GSM8K, MATH), MASPOB achieved an average score of 80.58% under a budget of 50 evaluations, improving over IO baseline, AFlow, and MIPRO by 12.02%, 2.06%, and 1.71%, respectively. Ablation studies show that the GNN module contributes an average improvement of 2.31%, and coordinate ascent reduces runtime by over 98% with a performance loss of less than 0.5%.

    Streaming Communication Acceleration: StreamMA

    The Cost of Serial Communication

    Existing MAS frameworks commonly use a "generate first, then transmit" serial communication method: the upstream agent must generate a complete response before passing it to the downstream. This leads to two problems:

  • Linear Latency Growth: Downstream agents must wait for upstream completion, causing end-to-end latency to grow linearly with pipeline depth.
  • Error Inheritance: Downstream agents are forced to read the entire upstream response, including low-quality reasoning steps, and errors accumulate along the chain.
  • Research shows that in long-chain reasoning, early steps are usually reliable, while later steps are more prone to drift. CoT accuracy degrades after an optimal length.

    StreamMA Solution

    StreamMA, proposed by teams including The Hong Kong University of Science and Technology (Guangzhou), leverages the model's own streaming output mechanism: each upstream agent forwards a reasoning step to the downstream as soon as it is produced, achieving pipeline parallelism. Core design:

  • All agents start concurrently, each maintaining an input queue.
  • Each agent makes streaming calls; as soon as a complete step is produced, it is immediately pushed to the downstream queue.
  • While the downstream processes step s, the upstream is still generating step s+1.
  • The downstream agent is invoked S times; previous steps form a shared prefix, reducing cost through cache hits.
  • Key insight: Reliable early steps reach the downstream first, allowing the downstream to build independent reasoning trajectories, diluting the impact of erroneous later steps.

    Experimental Results

    On eight benchmarks (AIME 2025/2026, HMMT 2026, GPQA-Diamond, HLE, LiveCodeBench) using Claude Opus 4.6 and GPT-5.4, StreamMA outperformed serial and single-model approaches across three DAG topologies, with an average improvement of 7.3 percentage points on Claude and 1.5 percentage points on GPT. Cost analysis shows that due to cache reuse, the total cost of the streaming approach is even lower than serial. Additionally, increasing the number of reasoning steps S per agent leads to continuous improvement in both effectiveness and speed, forming a new scaling law orthogonal to "stacking more agents."

    Multi-Agent Reinforcement Learning Framework: UnityMAS-O

    Limitations of Existing Frameworks

    Most LLM-based MAS cannot be trained: workflows are patched together with prompts, routing rules, and hand-crafted interaction protocols. Even when training is introduced, it often only trains one model or role. Existing RL frameworks (TRL, OpenRLHF, verl, etc.) focus on single-policy optimization and cannot directly express role division, topology structure, and reward distribution in multi-agent workflows.

    UnityMAS-O Design

    UnityMAS-O, proposed by Renmin University of China and Xiaohongshu, extends verl to elevate the optimization target from "single policy" to "multi-agent workflow." Core abstractions include:

  • Logical Role: Describes a node's responsibility in the workflow (e.g., planner, retriever, coder), with prompt templates, input/output formats, available tools, etc. Roles are workflow-level objects, not bound to specific parameters.
  • Role-to-Model Mapping: Supports full sharing (all roles share the same model), partial sharing (roles grouped to share parameters), and full separation (each role has its own model).
  • Workflow Graph: A user-defined directed graph supporting sequential pipelines, parallel branches, iterative loops, etc.
  • Reward Function: Each role can define its own reward, combining node-level, round-level, and trajectory-level information, covering rule-based format rewards, environment rewards, and model-based rewards.
  • System Implementation and Training Process

    The system uses a star-topology runtime: a central controller maintains the global training loop and schedules workflow states; the Ray execution layer provides remote calls and GPU management; LLM worker groups are bound to physical model instances. During training, the controller only transfers lightweight metadata (role identity, routing identifier, output, reward), while heavy tensors (token probabilities, attention masks) remain local to the worker groups.

    Experimental Results

    On retrieval and code tasks, all workflows and model scales showed improvement after training. Small models benefited significantly: QD-Retrieve-Answer's F1 on NQ rose from 0.022 to 0.445, and on HotpotQA from 0.032 to 0.397. In code tasks, the pass rate after training increased substantially, while the average number of validation rounds decreased, indicating that training improved both accuracy and efficiency. Parameter sharing experiments show that multi-role sharing of physical models can still be effectively trained, reducing the number of model groups in practice.

    Decentralized Market Mechanism: EoM

    The Drawbacks of Centralized Coordination

    Mainstream MAS uses centralized orchestration (e.g., MetaGPT, AutoGen), but suffers from structural drawbacks: planning is bottlenecked at a single gate, and coordination costs grow linearly with scale. EoM, proposed by teams from Harvard University and MIT, draws inspiration from Hayek's market economy theory, designing a set of economic incentives that allow agents to spontaneously form specialization and collaboration without central control.

    Core Mechanism

    EoM models a group of LLM agents as a "society" with economic interactions. Each agent is defined by its wake condition, action strategy, fixed bid, and current wealth. The system includes two processes:

  • Planning (Auction and Trading): At each step, all agents check their wake conditions. Among those eligible, the agent with the highest bid wins the right to act. The winner pays its bid to the agent that acted in the previous step and receives the environment reward. This achieves decentralized credit assignment: agents that pave the way for high-value future actions accumulate wealth, while those that lead the system into dead ends lose wealth.
  • Adaptation (Evolution): Pay rent, eliminate, and inject new agents. Exploitation: wealthy agents mutate and reproduce; exploration: bankrupt agents serve as negative examples to generate corrected versions. Novice protection: new agents' first bid is set to the highest in the field.
  • Experimental Results

    In five domains—mathematical reasoning, accelerator design, financial research, scientific research, and distributed system optimization—EoM allowed "crippled" agents (deliberately weakened, e.g., output limited to 128 tokens, only one tool) to band together and outperform fully functional strong agents. Mathematical reasoning accuracy rose from 15.9% to 57.0%, surpassing the complete baseline of 51.9%; accelerator design EDP dropped to 39.3, better than the complete ReAct's 43.1. Ablation studies show that removing economic parameters (e.g., rent, reward) or components like auction, exploitation, and exploration significantly degrades performance, confirming that the economic mechanism is the core engine.

    Conclusion and Outlook

    Multi-agent system optimization is advancing from multiple dimensions:

  • Prompt Level: MASPOB achieves sample-efficient joint optimization under fixed topology.
  • Communication Level: StreamMA breaks serial bottlenecks through streaming communication.
  • Training Level: UnityMAS-O provides a general RL framework for workflows.
  • Organization Level: EoM uses market mechanisms for decentralized coordination.
  • These methods collectively point to a trend: future MAS optimization will become more systematic and automated, reducing manual intervention. For developers, understanding these techniques helps in selecting appropriate optimization strategies based on actual scenarios. For example, if the workflow is fixed but performance is insufficient, try MASPOB; if latency is a bottleneck, introduce StreamMA; if continuous improvement of system limits is needed, consider UnityMAS-O or EoM.

    For a deeper understanding of basic multi-agent system concepts, refer to AI Agent and Multi-Agent; if focusing on workflow design, read Workflow and Orchestration; for reinforcement learning training, explore Fine-tuning and RL.

    FAQ

    Is MASPOB applicable to workflows with non-DAG topologies? MASPOB models workflows as directed acyclic graphs (DAGs), which is common for most MAS. For topologies with loops, it can theoretically be adapted by unrolling loops or introducing time steps, but the current version is primarily designed for DAGs.

    What task types does StreamMA require? StreamMA is suitable for tasks that can be decomposed into steps, such as mathematical reasoning, code generation, and scientific analysis. For open-ended creative writing tasks that are difficult to stepwise, the advantages of streaming communication are less pronounced.

    Which RL algorithms does UnityMAS-O support? The current version is based on verl and primarily supports the PPO algorithm. Future extensions could support GRPO, REINFORCE, etc., but the core abstractions (role-model decoupling, workflow graph, role-level rewards) are algorithm-agnostic.

    How to set economic parameters in EoM? Experiments in the paper show that parameters such as rent, reward scaling, and agent count need balance. It is recommended to start with default parameters and adjust the rent multiplier and reward scaling factor based on the task to avoid premature elimination or excessive protection.

    Can these methods be combined? Yes. For example, first use MASPOB to optimize prompts under fixed topology, then introduce StreamMA to accelerate communication; if further training is needed, use UnityMAS-O for RL optimization. EoM provides an alternative decentralized organization method that can complement other approaches.

    Which method is best for my scenario? It depends on constraints: if the workflow is fixed and evaluation budget is limited, choose MASPOB; if latency-sensitive and tasks are decomposable, choose StreamMA; if continuous training improvement is desired, choose UnityMAS-O; if pursuing decentralization and robustness, choose EoM.

    Also available in 中文.