Alibaba’s new AI framework bypasses the loading of all tools and reduces agent token usage by 99%



As enterprise AI systems scale to handle complex workflows, professionals face the challenge of directing subtasks to the right tools and skills. Agents can have hundreds of tools and skills and be confused about which one to use for each step of a workflow.

To address this challenge, Alibaba researchers developed SkillWeavera framework that creates an execution graph for a given task and chooses the appropriate skills for each of the nodes. They also introduce skill-aware decomposition (SAD), a novel technique that uses a feedback loop to allow the agent to iteratively search and examine relevant tool candidates. This compositional approach and feedback loop mechanism distinguish SkillWeaver from other tool routing frameworks that choose tools in one fell swoop.

SkillWeaver relates to real-world AI applications where agents autonomously orchestrate multi-tool ecosystems, such as Model Context Protocol (MCP), to execute multi-step business operations, such as downloading data sets, transforming information, and creating visual reports.

In practice, researchers’ experiments with SkillWeaver show that implementing this routing and recovery approach significantly increases accuracy and reduces token consumption by more than 99% compared to naively exposing agents to a full tool library.

For professionals building AI agents, the main takeaway is that the granularity of task decomposition is the biggest obstacle to accurate tool retrieval.

The Skills Routing Challenge

Skills are a key pattern in modern LLM agent architectures. A skill is a modular, reusable tool specification that uses structured natural language documentation.

As enterprise agents integrate with massive tool ecosystems, precisely directing user queries to the right skills becomes a difficult task. Exposing an entire library to an LLM to find the right tool is very inefficient, quickly exceeding context limits, and consuming hundreds of thousands of tokens.

Most current tooling frameworks attempt to solve this through API retrieval, documentation comparison, or hierarchical structures that treat routing strictly as a single-skill selection or one-step problem.

However, this single-skill paradigm is insufficient for enterprise environments because real-world queries are inherently compositional. A standard business request like "Download the dataset, transform it and create visual reports" It cannot be accomplished with a single tool. It requires breaking down the message and sequencing an API client, a data processor, and a visualization tool into a cohesive, multi-step execution plan.

How SkillWeaver and SAD work

To address this, researchers pose the problem of handling complex tasks that require multiple skills such as "routing compositional skills." Given a complex user message and a vast library of tools, an agent must simultaneously figure out how to break the request into a sequence of atomic subtasks, how to map each subtask to the best available skill, and how to compose those skills into an executable plan.

SkillWeaver organizes this process through three distinct stages: decompose, recover, and compose. In the first stage, an LLM acts as a task decomposer, dividing the user’s complex query into a sequence of subtasks, each of which requires a skill. Once the subtasks are clearly defined, the system uses an integration model to compare each subtask against the skills library and obtain a short list of the top candidate tools for each step.

In the final stage, a planner evaluates the retrieved candidates based on how well they work together. Check compatibility between skills to ensure that outputs from one tool flow naturally into inputs from the next. It then creates a final execution plan as a directed acyclic graph (DAG) that maps dependencies so that independent tasks can potentially run in parallel.

For example, consider a user who asks an AI agent to "Download the dataset, transform it and create visual reports." In the decomposition stage, the decomposer LLM divides it into three distinct subtasks: downloading the dataset, transforming the data, and creating the reports.

In the fetch stage, the system searches the library and finds candidates such as “api-client” or “http-fetch” for task one, “csv-parser” or “etl-pipeline” for task two, and so on. Finally, the writing stage evaluates these options, selects the specific combination of “api-client”, “csv-parser”, and “chart-gen” that are most compatible, and connects them into a final workflow ready to run.

A key challenge of this process is that LLMs often produce generic step descriptions that do not match the technical vocabulary specific to the actual skills available in the library. To solve this problem, SkillWeaver introduces skill-aware iterative decomposition (SAD), a novel feedback loop. SAD works by having the LLM draft an initial plan, conducting a preliminary search to find loosely matching skills, and then reintroducing those recovered skills into the LLM as suggestions. This allows the LLM to rewrite its decomposition so that the granularity and vocabulary align perfectly with the actual tools that exist.

SkillWeaver in action

To evaluate SkillWeaver’s performance in realistic business scenarios, the researchers created a custom benchmark called CompSkillBench. It consists of 300 multi-step queries of different levels of difficulty. To reflect real-world environments, they used a library of 2,209 real-world skills obtained from the public MCP ecosystem, covering 24 functional categories such as cloud infrastructure, finance, and databases.

For the core engine, the researchers primarily used a lightweight 7 billion-parameter model (Qwen2.5-7B-Instruct) for task decomposition, along with a standard semantic search retrieval (MiniLM with a FAISS index) for finding tools. SkillWeaver was evaluated against three main configurations: a brute force "LLM-Direct" method where they inserted all tool names into a large model prompt, a basic LLM based decomposition without SAD and a ReAct style agent loop.

Experiments indicate that task decomposition is the main bottleneck. Standard LLM behavior falls short when it comes to large tool libraries, but the SAD feedback loop dramatically moves the needle. In the base configuration, model 7B achieved decomposition accuracy (i.e., predicting the correct number of steps) only 51.0% of the time. By activating the SAD feedback loop, the accuracy jumped to 67.7% (with the larger Qwen-Max model, the accuracy reached 92%). In "hard" On tasks requiring four to five different skills, SAD improved accuracy by 50%.

One fascinating finding was that larger models may perform worse when unguided. When tested in the base configuration, a larger 14 billion parameter model saw its accuracy drop below the accuracy of the 7B model because it tended to over-decompose tasks into unnecessary, microscopic steps. Once SAD was introduced, the retrieved tooltips anchored the model to reality and increased its accuracy. This suggests that aligning an agent with the vocabulary of specific tools is often more impactful than paying for a larger, more expensive LLM.

Another important conclusion is the symbolic savings. The LLM-Direct baseline, which used the very large Qwen-Max model, showed that entering all tools into the indicator of a large model fails. Despite near-perfect task breakdown capabilities, the bulk model only retrieved the correct tool category 21.1% of the time when inundated with tool options. SkillWeaver’s directed routing and recovery approach far exceeded this in accuracy, while reducing context window consumption from approximately 884,000 tokens to approximately 1,160 tokens per query, a 99.9% reduction. For professionals, this translates directly to drastically lower API costs and faster response times.

Finally, the traditional ReAct baseline failed completely, achieving a decomposition accuracy of 0%. Its loop naturally collapses multi-step plans into isolated actions rather than explicitly mapping out a cohesive, multi-tool sequence.

Considerations for developers

While the researchers have not yet published SkillWeaver’s source code, their work was based on commercially available tools that can be easily reproduced.

Skill Aware Decomposition (SAD), which is the key innovation at the heart of the framework, is an intelligent recovery and rapid engineering loop. The authors have shared the message templates in their article and developers can implement them themselves quite easily using standard orchestration libraries like LangChain, LlamaIndex or even raw Python scripts.

Regarding the recovery component, the authors built the central framework using all-MiniLM-L6-v2an open source integration model. They found that switching to a slightly more powerful commercial encoder (BGE-base-es-v1.5) immediately increased accuracy without any adjustments. While a commercially available bi-encoder is excellent at including a relevant tool in the top 10 candidates almost 70% of the time, it struggles to consistently rank the perfect tool exactly at number one, succeeding only about 37% of the time. To close this gap, teams will likely need to implement a secondary cross-encoder or LLM-based reranker to reorder the top 10 candidates.

An initial preparation requirement is to vectorize the tool library and create a FAISS index in advance. In practice, this is a negligible obstacle. Incorporating and indexing all 2,209 skills into the benchmark took just 15 seconds. Once created, index tool retrieval adds less than 15 milliseconds of latency per query. For enterprise environments, synchronizing the tool index is a trivial background job.

A current limitation of SkillWeaver is the lack of error recovery. While SkillWeaver successfully maps a compatible DAG for execution, the authors’ pilot study revealed the challenges of multi-step toolchains. For example, if an API call fails in step two, the entire chain breaks. The main contribution of the paper is limited to the routing and planning phase. For a true production deployment, practitioners must create their own error recovery, backup, and retry mechanisms in addition to the write stage to handle real-world API timeouts or malformed results.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *