Alibaba’s Metis agent reduces redundant calls to AI tools from 98% to 2% and becomes more accurate doing so



One of the key challenges in creating effective AI agents is teaching them to choose between using external tools or relying on their internal knowledge. But large language models are often trained to invoke tools blindly, leading to latency bottlenecks, unnecessary API costs, and degraded reasoning caused by ambient noise.

To overcome this challenge, Alibaba researchers introduced Hierarchical optimization of decoupled policies (HDPO), a reinforcement learning framework that trains agents to balance execution efficiency and task accuracy.

Metis, a multimodal model they trained using this framework, reduces redundant tool invocations from 98% to just 2% while establishing new state-of-the-art reasoning accuracy on key industry benchmarks. This framework helps create AI agents that are not easily triggered and know when to refrain from using tools, enabling the development of responsive and cost-effective agent systems.

The metacognitive deficit

Current agent models face what researchers call a “profound metacognitive deficit.” Models have difficulty deciding when to use their internal parametric knowledge versus when to consult an external utility. As a result, they blindly invoke tools and APIs, such as web search or code execution, even when the user’s message already contains all the information necessary to solve the task.

This easy activation tool calling behavior creates serious operational obstacles for real-world applications. Because the models are trained to focus almost entirely on task completion, they are indifferent to latency. These agents frequently reach exorbitant tool call rates. Every unnecessary external API call introduces a bottleneck to serial processing, turning a technically capable AI into a slow system that frustrates users and eats up tooling budgets.

At the same time, burning up computational resources by overusing tools does not translate into better reasoning. Redundant tool interactions inject noise into the model context. This noise can distract the model, derail an otherwise sound chain of reasoning, and actively degrade the final result.

To address the latency and cost issues of blind tool invocation, previous reinforcement learning methods attempted to penalize excessive tool use by combining task accuracy and execution efficiency into a reward signal. However, this interlocking design creates an unsolvable optimization dilemma. If the efficiency penalty is too aggressive, the model becomes overly conservative and suppresses the use of essential tools, sacrificing correctness in arduous tasks. On the contrary, if the penalty is slight, the optimization signal loses its value and does not prevent excessive use of the tool in simpler tasks.

Furthermore, this shared reward creates semantic ambiguity, where an inaccurate trajectory without tool calls could generate the same reward as an accurate trajectory with excessive tool use. Because the training signals for accuracy and efficiency are intertwined, the model cannot learn to control tool use without degrading its basic reasoning capabilities.

Hierarchical optimization of decoupled policies

To solve the optimization dilemma of coupled rewards, researchers introduced HDPO. HDPO separates accuracy and efficiency into two independent optimization channels. The precision pipeline focuses on maximizing task correctness across all model implementations. The efficiency channel optimizes execution economy.

HDPO calculates the training signals for these two channels independently and only combines them in the final stage of the loss calculation. The efficiency signal is conditional on the precision channel. This means that an incorrect answer is never rewarded simply for being quick or using fewer tools. This decoupling avoids situations where accuracy and efficiency gradients cancel each other out, providing the AI ​​with clear learning signals for both objectives.

The most powerful emergent property of this decoupled design is that it creates an implicit cognitive curriculum. Early in training, when the model is still struggling with the task, optimization is dominated by the accuracy goal, forcing the model to prioritize learning correct reasoning and knowledge. As the model’s reasoning capabilities mature and it consistently arrives at the correct answers, the efficiency signal increases smoothly. This mechanism makes the model first master task resolution and only then refine its self-sufficiency by avoiding costly and redundant API calls.

To complement HDPO, researchers developed a rigorous, multi-stage data curation regimen that addresses serious flaws found in existing data sets augmented with tools. Its data curation process covers supervised fitting (SFT) and reinforcement learning (RL) stages.

For the SFT phase, they obtained multimodal trajectory data augmented with publicly available tools and filtered it to remove low-quality examples that contained execution failures or feedback inconsistencies. They also aggressively filtered out any training samples that the base model could solve directly without tools. Finally, using Google Gemini 3.1 Pro As an automated judge, they filtered the SFT corpus to retain only examples that demonstrated the use of strategic tools.

For the RL phase, selection focused on ensuring a stable optimization signal. They filtered messages with corrupted images or semantic ambiguity. The HDPO algorithm is based on comparing correct and incorrect answers. If a task is trivially easy when the model always gets it right, or prohibitively difficult when the model always gets it wrong, there is no meaningful mathematical variation to learn from. The team strictly retained only cues that showed a non-trivial combination of successes and failures to ensure an actionable gradient signal.

Agent Metis: HDPO in action

To test HDPO in action, the researchers used the framework to develop Metis, a multimodal reasoning agent equipped with coding and search tools. Metis is built on the Qwen3-VL-8B-Instruct vision and language model. The researchers trained it in two different stages. First, they applied SFT using their selected data to provide a cold initialization. They then applied RL using the HDPO framework, exposing the model to multi-turn interactions where it could invoke tools such as Python code execution, text search, and image search.

The researchers pitted Metis against standard open source vision models such as LLaVA-OneVision, text-only reasoners, and state-of-the-art agent models including DeepEyes V2 and the 30 billion-parameter Skywork-R1V4. The assessment covered two main areas: visual perception and document understanding from datasets such as HRBench and V*Bench, and rigorous mathematical and logical reasoning tasks such as WeMath and MathVista.

Across all tasks, Metis achieved state-of-the-art or highly competitive performance, outperforming existing agent models, including the much larger 30 billion-parameter Skywork-R1V4, in both visual perception and reasoning tasks.

Equally important is the anecdotal behavior that Metis displayed in the experiments. For example, when presented with an image of a museum sign and asked what the central text says, standard agent models waste time blindly writing Python scripts to crop the image just to read it. Metis, however, recognizes that the text is clearly legible in the raw image. Skips the tools entirely and uses a single inference pass.

In another experiment, the model was given a complex graph and asked to identify the second highest line at a specific data point within a small subplot. Metis recognized that the detailed visual analysis exceeded its native resolution capabilities and could not accurately distinguish the overlapping lines. Instead of guessing from the entire image, it invoked Python to exclusively crop and zoom that specific subplot region, allowing it to correctly identify the line. Treat code as a precision instrument that is deployed only when visual evidence is genuinely ambiguous, not as a default fallback.

The researchers published Metis together with him code for HDPO under the permissive Apache 2.0 license.

“Our results demonstrate that the use of strategic tools and strong reasoning performance are not a trade-off; rather, the elimination of noisy and redundant tool calls directly contributes to superior accuracy,” the researchers conclude. “More broadly, our work suggests a paradigm shift in tool-augmented learning: from simply teaching models to run tools to cultivating metacognitive wisdom about when to refrain from them.”



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *