
Test-time scaling (TTS) has emerged as a proven method for improving the performance of large language models in real-world applications by providing them with additional compute cycles at inference time. However, TTS strategies have historically been crafted by hand, relying heavily on human intuition to dictate the rules of the model’s reasoning.
To address this bottleneck, researchers from Meta, Google, and several universities have introduced AutoTTSa framework that automatically discovers optimal TTS strategies. This automated approach allows enterprise organizations to dynamically optimize compute allocation without manually adjusting heuristics.
By implementing the optimal strategies discovered by AutoTTS, organizations can directly reduce token usage and the operational costs of deploying advanced reasoning models in production environments. In experimental tests, AutoTTS managed inference budgets efficiently and successfully reduced token consumption by up to 69.5% without sacrificing accuracy.
The manual bottleneck in test-time scaling
Scaling in test time enhances LLMs by giving them additional computation when generating answers. This additional calculation allows the model to generate multiple reasoning paths or evaluate its intermediate steps before arriving at a final answer.
The main challenge in designing TTS strategies is determining how to optimally allocate this additional computation. Historically, researchers have designed these strategies manually, relying on guesswork to construct rigid heuristics. Engineers must hypothesize rules and thresholds for when a model should branch into new reasoning paths, delve deeper into an existing path, prune an unpromising branch, or stop reasoning altogether.
Because this manual tuning process is limited by human intuition, a large number of possible approaches remain unexplored. This often results in suboptimal trade-offs between model accuracy and computing costs.
Current TTS algorithms can be mapped to a width-depth control space: "broad" being the number of branches of reasoning explored, "depth" being how far each one develops. Self-consistency (SC) shows a fixed number of trajectories and votes the answer by majority. Adaptive Consistency (ASC) saves computation by stopping early once a confidence threshold is reached. Parallel probe takes a more granular approach, pruning unpromising branches while drilling down the rest. All three are handcrafted, and that’s the restriction that AutoTTS is designed to break.
While some more advanced methods employ richer structures such as tree search or external verifiers, they all share one key characteristic: they are meticulously handcrafted. This manual approach restricts the scope of strategy discovery, leaving a large portion of the potential resource allocation space intact.
Automating Strategy Discovery with AutoTTS
AutoTTS reframes the way scaling is optimized over test time. Instead of treating strategy design as a human task, AutoTTS approaches it as an algorithmic search problem within a controlled environment.
This framework redefines the roles of both the human engineer and the AI model. Instead of manually crafting specific rules for when an LLM should branch, prune, or stop reasoning, the engineer’s role shifts to building the discovery environment. The human defines the boundaries, including the control space of states and actions, optimization objectives that balance accuracy with cost, and specific feedback mechanisms.
An LLM explorer, like Claude Code, designs the strategy. This explorer acts as an autonomous agent that iteratively proposes TTS “controllers.” These drivers are code-defined policies or algorithms that dictate how an AI model allocates its computational budget during inference. The explorer tests and refines these drivers based on feedback until it discovers an optimal resource allocation policy.
To make this automated search computationally affordable, AutoTTS relies on an “offline playback environment.” If the LLM explorer had to invoke a base reasoning model to generate new tokens every time it tried a new strategy, the computational costs would be astronomical. Instead, it is based on thousands of reasoning trajectories previously collected from the core LLM. These trajectories include "probe signals," which are intermediate responses that help the controller evaluate progress in different branches of reasoning.
During the discovery cycle, the explorer agent proposes a handler and evaluates it against this offline data. The agent observes the execution traces of the proposed controller that show its allocated computation over time. By analyzing these traces, the agent can diagnose specific failure modes, such as observing whether a controller pruned branches too aggressively in a specific scenario. This provides an advantage over simply seeing the end result. The agent then iteratively rewrites its code to improve the accuracy-cost ratio.
Inside the AI designed controller
Because the explorer agent is not limited by human intuition, it can discover complex, highly coordinated rules that a human engineer would probably never code manually. An optimal controller discovered by AutoTTS, called Confidence Momentum Controller, takes advantage of several non-obvious mechanisms to manage computation:
-
Trend Based Stop: Handcrafted strategies often instruct the model to stop reasoning once it reaches a certain instantaneous confidence threshold. The AutoTTS agent found that instant confidence can be misleading due to temporary spikes. Instead, the controller follows a confidence exponential moving average (EMA) and only stops if the overall confidence level is high and the trend is not actively declining.
-
Width-depth coupled control: Manually designed algorithms generally address the "widening" of new avenues of reasoning and "going deeper" of current paths as separate decisions. AutoTTS discovered a closed feedback loop where the two actions are linked. If the confidence of current branches stagnates or regresses, the driver automatically triggers the generation of new branches.
-
Alignment-Based Depth Assignment: Instead of giving all active reasoning branches an equal computation budget, the controller dynamically identifies which branches agree with the current top answer. Then it gives priority to those branches. "bursts" of extra calculation. This focuses the computational budget on the emerging consensus to quickly verify whether it is correct.
Cost savings and increased accuracy on real-world benchmarks
To test whether an AI could autonomously discover a better escalation strategy at test time, the researchers established a rigorous evaluation framework. The main experiments were performed on Qwen3 models with parameters ranging from 0.6B to 8B. The researchers also tested the system’s ability to generalize on an 8B distilled version of the DeepSeek-R1 model.
The AI scout agent was initially tasked with discovering an optimal strategy using the mathematical reasoning benchmark AIME24. This discovered strategy was then tested on two mathematics benchmarks, AIME25 and HMMT25, as well as the GPQA-Diamond graduate general reasoning benchmark.
The driver discovered by AutoTTS faced four manually designed test time scaling algorithms in the industry. These baselines included self-consistency with 64 parallel reasoning paths (SC@64), adaptive consistency (ASC), parallel probe, and self-consistency with early stopping (ESC). ESC is a hybrid approach that generates trajectories in parallel and stops early when a response appears stable.
When configured in a balanced and economical mode, the driver discovered by AutoTTS reduced the total token consumption by approximately 69.5% compared to SC@64. At the same time, the controller maintained the same average accuracy across all four Qwen models. When the inference budget was increased, AutoTTS boosted maximum accuracy beyond all handcrafted baselines in five of eight test cases.
This efficiency translated into other tasks. On the GPQA-Diamond benchmark, the balanced AutoTTS variant reduced the inference token cost from 510,000 tokens to just 151,000 tokens, while slightly improving overall accuracy. In the DeepSeek model, AutoTTS achieved the highest overall accuracy on the HMMT25 benchmark and reduced token spending by almost half.
For professionals building enterprise AI applications, these experiments highlight two important operational benefits:
-
Increased maximum performance: AutoTTS not only saves money on token consumption. Actively increases the maximum achievable performance of the base model. The AI-designed controller is remarkably good at detecting noisy or unproductive reasoning branches on the fly and continually redirecting its compute budget toward the branches that generate the most useful reasoning signals.
-
Cost-effective custom development: Because the framework is based on an offline playback environment, the entire discovery process cost only $39.90 and took 160 minutes. For enterprise teams, that means optimized reasoning strategies tailored to proprietary models and internal tasks are now within reach, without a dedicated research budget.
Both the AutoTTS Framework and Confidence Momentum Controller are available on GitHub; The CMC can be used as a direct replacement for other TTS controllers.





