Alibaba model was never trained as an agent and improved agent performance on seven benchmarks

Alibaba’s Qwen team launched Qwen-AgentWorld on Tuesday: two models trained not to act within agent environments, but to predict what those environments return. The release covers seven domains under a single architecture: MCP, Search, Terminal, Software Engineering, Android, Web and OS.

The statement expands on Alibaba’s recent push into self-employed agents. Qwen3.7-MaxLaunched in May, it was built around a 35-hour autonomous run capability.

That change points to a ceiling that teams that train agents at scale directly encounter. Real search engines return any results that exist, without any mechanism to inject controlled conditions. Active endpoints do not allow a low disk space condition to be injected on demand. Agent training is limited by the production environments that will arise, with no systematic way to expose the edge cases that agents will need to handle but rarely encounter in training.

The research team trained agents inside the resulting simulator and found performance improvements that exceeded what training in real environments alone produced. In a separate test, using world model training as a warm-up before agent tuning improved performance on seven benchmarks, including three that the model had never seen during training.

He document accompanying the launch identified a gap in previous agent research. "We argue that global modeling is a crucial missing piece on the path to general agents."

Qwen-AgentWorld trains on what environments return, not what agents should do

Most agent models are trained to answer one question: given what the environment has just shown me, what should I do next? Qwen-AgentWorld is trained to answer the inverse: given what the agent just did, what will the environment show next?

That inversion is the core of what the paper calls a language world model: Instead of optimizing action selection, the model learns to predict the next environmental state in all seven domains under a single training objective. Previous work was more limited: Web Worlda previous Qwen project from February covered only web environments; Snowflake Agent World Modelreleased the same month, it generates code-based SQL-backed environments instead of training a model to predict states. Qwen-AgentWorld is the first to cover seven domains in a single model, with environment modeling integrated from the first pre-training stage.

Alibaba trained both models in three stages on more than 10 million trajectories of interaction with the environment from real agent executions. The first stage teaches the model how environments behave: file systems, terminal states, browser DOM changes, API responses. The second stage trains the model to reason about what comes next before predicting it. The third stage, reinforcement learning, reinforces predictions using rule-based checks and open quality scores.

Both models are expert-mix designs: only a fraction of the parameters are active per token. Model 35B activates 3B; 397B activates 17B. Both support 256K context windows. For GUI domains (Android, Web, and OS), models work from textual accessibility trees and UI view hierarchies instead of screenshots.

Model 35B and AgentWorldBench weights are available in Apache 2.0; The weights of the 397B are not made public.

Training Results Matter More Than Benchmarks

Baseline scores show how accurately the models predict which environments return. Training results show what that predictive power is really worth to team building agents, and those are the numbers that matter most.

According to the researchers, agents trained within a controlled simulation outperformed agents trained in real environments. Injecting specific perturbations (partial responses that force agents to take additional action and edge cases in real environments rarely arise) pushed MCPMark from 24.6 to 33.8. In Search, agents trained in completely fictional worlds were transferred to real search tasks, boosting the WideSearch F1 item from 34.02 to 50.31 in the open model 35B. A separate warm-up test showed that pre-training the global model improved BFCL v4 from 62.29 to 71.25 and Claw-Eval from 53.60 to 64.88 without any agent-specific tuning.

Researchers point out benchmark and risk of overfitting

The article sparked an immediate reaction from AI researchers about X. The concerns they raised relate to what practitioners should verify before acting on the findings.

Regarding the goal of the training and the outcome of the transfer, the evaluation of an AI/ML researcher was direct. "All other “agent” models have been trained to act in environments," wrote @drawais_aiwho has a PhD and regularly analyzes articles on AI. "Qwen reversed the question. They trained the model to predict the environment itself… That predictive knowledge is then transferred to the agent’s tasks even without any agent-specific tuning." Identified the result of Controllable Sim RL as "the receipt" defended the claim that synthetic training can replace RL in a real-world environment at scale, noting that three of the seven transfer benchmarks were completely out of domain.

The reference margin attracted immediate scrutiny. "AgentWorldBench is a benchmark that Alibaba created and published in the same article." wrote @TheSignal_Deskwhich focuses on honest opinions and key figures in AI research. "They wrote the test and then passed it with a 0.46."

The sim-RL methodology is the result @limalemonnnwhich builds artificial intelligence production agents, was identified as most in need of scrutiny before the headline is cited. "Simulation-trained agents traditionally adapt too much to the peculiarities of the simulator," they wrote. "If the world model is too clear, the agent learns the model, not the task." They pointed to the division of the newspaper as the section that professionals should read before acting on the numbers.

The concern about overfitting is partially answered in the data. The gap between uncontrolled Sim RL (MCPMark 24.6) and controlled Sim RL (MCPMark 33.8) suggests that the gains depend substantially on the controllability mechanism, not just the simulation accuracy. The search result in a fictional world, where agents trained in invented environments are transferred to real search tasks, is the paper’s strongest evidence against the overfitting concern.

What this means for teams creating agent channels

For AI engineering teams building and scaling agent pipelines, this work signals a significant shift in how agent capability is built. Teams training agents at scale now have a third option between real-world RL and static benchmarks: controlled simulation that injects edge cases that production doesn’t surface.

Synthetic environments are a legitimate training layer. Controlled simulation that injects conditions that real environments will not produce is a complement to RL in real environments, not a shortcut.

What a model learns before agent training begins is more important than most channels realize. The warming finding (performance improvements through invisible benchmarks without agent-specific training) suggests that environmental grounding is development ahead of current practice.

Source link

Alibaba model was never trained as an agent and improved agent performance on seven benchmarks

Qwen-AgentWorld trains on what environments return, not what agents should do

Training Results Matter More Than Benchmarks

Researchers point out benchmark and risk of overfitting

What this means for teams creating agent channels

Leave a ReplyCancel Reply

Only one is truly intelligent

The Galaxy Z Fold 8 is almost here, but why wait when the Z Fold 7 is over $500 off?