How Sakana Trained a 7B Model to Orchestrate GPT, Claude, and Gemini LLMs



Every LangChain channel your team codes starts breaking the moment the query distribution changes, and it always changes. That bottleneck is what Sakana AI set out to eliminate.

Sakana AI researchers have presented the "Driver RL," a small language model trained using reinforcement learning to automatically orchestrate a diverse group of LLM workers. Driver dynamically analyzes inputs, distributes work among workers, and coordinates among agents.

This automated coordination achieves state-of-the-art results on difficult reasoning and coding benchmarks, outperforming individual frontier models like GPT-5 and Claude Sonnet 4, as well as expensive human-designed multi-agent pipelines. It achieves this performance at a fraction of the cost and with fewer API calls than the competition. RL Conductor is the backbone of Fugu, Sakana AI’s commercial multi-agent orchestration service.

The limitations of manual agent frameworks.

Large language models have strong latent capabilities. But making the most of these capabilities is a major challenge. Achieving this level of performance relies heavily on manually designed agent workflows, which serve as critical components in commercial AI products.

However, these frameworks fall short because they are inherently rigid and limited. In comments to VentureBeat, Yujin Tang, co-author of the paper, explained the exact breaking point of current systems: "While using frameworks with hardcoded channels like LangChain and Mixture-of-Agents can work well for specific use cases… In production, an inherent bottleneck arises when targeting domains with large user bases with very heterogeneous demands."

Tang pointed out that achieving "Real-world generalization across such heterogeneous applications inherently requires going beyond human-coded designs."

Another obstacle to building robust agent systems is that no model is optimal for all tasks. Different models are fine-tuned to specialize in different domains. One model may excel at scientific reasoning, while another is superior at code generation, mathematical logic, or high-level planning.

Because models have these variable characteristics and complementary abilities, predicting and manually coding the ideal combination of models for each query is virtually impossible. An optimal agent framework should be able to analyze a problem and delegate subtasks to the most appropriate expert in the group.

Leading an orchestra of agents

The RL Conductor is designed to overcome the limitations of rigid, human-designed frames. As the name implies, it directs an orchestra of agents by breaking down challenging problems, delegating specific subtasks, and designing communication topologies for a set of LLM workers.

Instead of relying on fixed code or static routing, Conductor organizes these models by generating a custom workflow. For each step in the workflow, the model generates a natural language instruction for a specific aspect of the task, assigns an agent to carry it out, and defines a "access list" that dictates which past subtasks and responses from other agents are included in that agent’s context.

By defining everything in natural language, Conductor creates flexible workflows tailored to each input. You can build simple sequential chains, parallel tree structures, or even recursive loops depending on the demands of the problem.

Importantly, the model learns these strategies not through human design but through reinforcement learning (RL) and reward maximization. During training, the model is assigned a task, a group of workers, and a reward signal based on whether its response and output format are correct.

Through a simple trial-and-error RL algorithm, the model organically discovers which combinations of instructions and communication structures generate the greatest reward. As a result, it automatically adopts advanced orchestration strategies, such as specific prompt engineering, iterative refinement, and meta-prompt optimization.

The model learns to dynamically adjust its strategies and leverage the different strengths of its worker agents without any human developers having to code the process.

Director in action

To test RL Conductor in action, the researchers tuned the Qwen2.5-7B parameter to 7 billion using the framework. During the training, the manager was tasked with designing agent workflows of up to five steps. A worker pool containing seven different models was given access: three closed-source giants (Gemini 2.5 Pro, Claude-Sonnet-4, and GPT-5) and four open-source models (including DeepSeek-R1-Distill-Qwen-32B, Gemma3-27B, and Qwen3-32B).

The team evaluated Conductor across a variety of highly challenging benchmarks, comparing it to individual frontier models acting alone, self-reflecting agents driven iteratively to improve their own responses, and state-of-the-art multi-agent routing frameworks such as MASRouter, Mixture-of-Agents (MoA), RouterDC, and Smoothie. The little 7B Conductor set new benchmarks across the board. It achieved an average score of 77.27% across all tasks, achieving 93.3% on the AIME25 math benchmark, 87.5% on GPQA-Diamond and 83.93% on LiveCodeBench, according to the researchers.

Surprisingly, it achieved these ratings while still being highly efficient. While benchmark models like MoA consumed 11,203 tokens per question, Conductor used an average of just 1,820 tokens, taking an average of just three steps per workflow.

A closer look at the experimental details shows exactly why the framework is so effective. The director automatically learned to measure the difficulty of the task. For simple fact recall questions, he often solved the problem in a single step or used a basic two-agent setup. However, for complex coding problems, he created extensive workflows involving up to four agents with dedicated planning, implementation, and verification phases.

The Director also learned that frontier models have different strengths. To achieve record scores in coding tests, the director frequently assigned Gemini 2.5 Pro and Claude Sonnet 4 to act as high-level planners, and only brought GPT-5 in at the end to write the final optimized code. In a particularly clever show of adaptability, the Director sometimes abdicated his own role entirely, handing the entire planning process over to Gemini 2.5 Pro and allowing him to dictate subtasks to the rest of the group.

Beyond the math and coding benchmarks, Sakana AI is already putting the underlying architecture to work in the front-office utility. "We’ve been using our Conductor technology-based Fugu models internally for various practical business applications: software development, deep research, strategy development, and even visual tasks like slide generation." Tang said.

Bringing orchestration to the enterprise: Sakana Fugu

While the 7B model described in the research paper was an exploratory model and is not publicly available, Sakana AI has turned the Conductor framework into its flagship commercial AI product. Sakana Fugu. Now in its beta phase, Fugu serves as a multi-agent orchestration system accessible through a standard OpenAI-compatible API.

Tang pointed out Fugu’s goals "the large market of industries where AI adoption has not yet generated large productivity gains due to generalizability limitations of current codified processes, such as finance and defense."

For enterprise developers, this enables seamless integration into existing applications without the headache of managing multiple API keys or manually routing tasks between different providers. Behind the API interface, Fugu automates complex collaboration topologies and role assignments across a set of models. To meet diverse business needs, Sakana launched two variants: Fugu Mini, designed for low-latency operations, and Fugu Ultra, designed to deliver maximum performance in demanding workloads.

Addressing governance concerns around autonomous agents generating invisible workflows, Tang noted that interpretability risks are functionally similar to the hidden reasoning traces of today’s top-level closed APIs, and that the system is managed with guardrails in place to minimize hallucinations.

For enterprise architects evaluating when to implement RL orchestration versus traditional routing, the decision often comes down to engineering resources. "We believe the absolute sweet spot comes when users and their teams feel like they are spending a disproportionate amount of time guiding their underlying agents." Tang said. However, he cautioned that the framework is not necessary for everything, noting that "It’s hard to beat the economical proposition of a local model that runs directly on the user’s machine for simple queries."

As the diversity of specialized open and closed source AI models continues to grow, hard-coded static pipelines will inevitably become obsolete. Looking ahead, this dynamic orchestration will likely extend beyond text and code environments. "Indeed, there is great potential to fill this gap with multi-modal Conductor frameworks that become the basis for more autonomous and self-coordinated physical AI systems." Tang said.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *