How RecursiveMAS speeds up multi-agent inference by 2.4x and reduces token usage by 75%



One of the key challenges of today’s multi-agent AI systems is that they communicate by generating and sharing text sequences, which introduces latency, increases token costs, and makes it difficult to train the entire system as a cohesive unit.

To overcome this challenge, researchers at the University of Illinois Urbana-Champaign and Stanford University developed RecursiveMOREa framework that allows agents to collaborate and transmit information by incorporating space instead of text. This change results in both efficiency and performance gains.

Experiments show that RecursiveMAS achieves accuracy improvement in complex domains such as code generation, medical reasoning, and search, while increasing inference speed and dramatically reducing token usage.

RecursiveMAS is significantly cheaper to train than standard full-tuning or LoRA methods, making it a scalable and cost-effective model for custom multi-agent systems.

The challenges of improving multi-agent systems

Multi-agent systems can help address complex tasks that single-agent systems have difficulty handling. When scaling multi-agent systems for real-world applications, a major challenge is allowing the system to evolve, improve, and adapt to different scenarios over time.

Cue-based adaptation improves agent interactions by iteratively refining the shared context provided to agents. By updating prompts, the system acts as a director, guiding agents to generate responses that are more aligned with the overall goal. The fundamental limitation is that the capabilities of the models underlying each agent remain static.

A more sophisticated approach is to train agents by updating the weights of the underlying models. Training an entire system of agents is difficult because updating all parameters across multiple models is computationally non-trivial.

Even if an engineering team commits to training their models, the standard approach of agents communicating through text-based interactions creates significant bottlenecks. Because agents rely on sequential text generation, this causes latency as each model must wait for the previous one to finish generating its text before it can begin its own processing.

Forcing models to explain their intermediate reasoning token by token just so the next model can read it is highly inefficient. It severely inflates token usage, increases computing costs, and makes system-wide iterative learning painfully slow to scale.

How RecursiveMAS works

Instead of trying to improve each agent as an isolated, independent component, RecursiveMAS is designed to co-evolve and scale the entire multi-agent system as an integrated whole.

The frame is inspired by recursive language models (RLM). In a standard language model, data flows linearly through a stack of distinct layers. In contrast, a recursive language model reuses a set of shared layers that processes data and feeds it back to itself. By looping the calculation, the model can deepen its reasoning without adding parameters.

RecursiveMAS extends this scaling principle from a single model to a multi-agent architecture that acts as a unified recursive system. In this configuration, each agent functions as a layer in a recursive language model. Instead of generating text, agents iteratively pass their continuous latent representations to the next agent in the sequence, creating a hidden, looping stream of information that flows through the system.

This latent transfer continues through all agents. When the final agent finishes its processing, its latent results are sent directly to the first agent, starting a new round of recursion.

This structure allows the entire multi-agent system to interact, reflect, and refine their collective reasoning over multiple rounds entirely in the latent space, with only the last agent producing a textual output in the final round. It is as if the agents communicate telepathically as a unified whole and the last agent provides the final response in text form.

The architecture of latent collaboration

To enable continuous collaboration in latent space, the authors introduce a specialized architectural component called RecursiveLink. This is a lightweight two-layer module designed to convey and refine the latent states of a model rather than forcing it to decode text.

The hidden states of the last layer of a language model contain the rich semantic representation of its reasoning process. RecursiveLink is designed to preserve and transmit this high-dimensional information from one embedding space to another.

To avoid the cost of updating each parameter in multiple large language models, the framework keeps the models’ parameters frozen. Instead, it optimizes the system by training only the parameters of the RecursiveLink modules.

To handle both internal reasoning and external communication, the system uses two variations of the module. The internal RecursiveLink operates within an agent during its reasoning phase. It takes the newly generated embeddings from the model and maps them directly to its own input embedding space. This allows the agent to continuously generate a stream of latent thoughts without generating discrete text tokens.

The external RecursiveLink serves as a bridge between agents. Because agents in a real-world system may use different architectures and model sizes, their internal embedding spaces have completely different dimensions. The external RecursiveLink includes an additional layer designed to match the embeddings of one agent’s hidden dimension to the embedding space of the next agent.

During training, first, internal links are trained independently to warm up each agent’s ability to think in continuous latent embeddings. The system then enters outer loop training, where the various frozen models are chained together in a loop and the system is evaluated based on the final textual output of the last agent.

The only thing that is updated in the training process are the RecursiveLink parameters and the weights of the original model remain unchanged, similar to low range adaptation (LoRA). Another advantage of this system comes into effect when you have several agents on the same backbone model.

If you have a multi-agent system where two agents are built on exactly the same base model and act in different roles, you don’t need to load two copies of the model into your GPU memory or train them separately. The agents will share the same spine as the brain and will use RecursiveLink as connective tissue.

MORE recursive in action

The researchers evaluated RecursiveMAS on nine benchmarks spanning mathematics, science and medicine, code generation, and search-based question answering. They created a multi-agent system using open weight models, including Qwen, Llama-3, Gemma3, and Mistral. These models were assigned roles to form different agent collaboration patterns, such as sequential reasoning and expert collaboration.

RecursiveMAS was compared to baselines with identical training budgets, including standalone models enhanced with LoRA or full supervised tuning, alternative multi-agent frameworks such as Mixture-of-Agents and TextGrad, and recursive baselines such as LoopLM. It was also compared to Recursive-TextMAS, which uses the same recursive loop structure as RecursiveMAS but forces agents to communicate explicitly via text.

RecursiveMAS achieved an average accuracy improvement of 8.3% compared to the strongest baselines of all benchmarks. It particularly excelled in reasoning-intensive tasks, outperforming text-based optimization methods like TextGrad by 18.1% in AIME2025 and 13% in AIME2026.

Because it avoids generating text at every step, RecursiveMAS achieved an end-to-end inference speed of 1.2 to 2.4 times. RecursiveMAS is also much more symbolically efficient than the alternative. Compared with the text-based Recursive-TextMAS, it reduces the token usage by 34.6% in the first round of recursion, and in the third round, it achieves a token reduction of 75.6%. RecursiveMAS was also remarkably cheap to train. Because it only updates the lightweight RecursiveLink modules, which consist of about 13 million parameters or about 0.31% of the trainable parameters of the frozen models, it requires the lowest maximum GPU memory and reduces training costs by more than half compared to full tuning.

Business adoption

The efficiency gains (lower token consumption, lower GPU memory requirements, and faster inference) are intended to make complex, multi-step agent workflows viable in production environments without the compute overhead that limits enterprise agent deployments. The researchers have published the code and trained model weights under the Apache 2.0 license.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *