Train-to-test scaling explained: How to optimize your end-to-end AI compute budget for inference



Standard guidelines for building large language models (LLMs) optimize only training costs and ignore inference costs. This poses a challenge for real-world applications that use inference time-scaling techniques to increase the accuracy of model responses, such as extracting multiple reasoning samples from a model at deployment time.

To close this gap, researchers at the University of Wisconsin-Madison and Stanford University have introduced Train to try (t2) scaling laws, a framework that jointly optimizes the size of a model’s parameters, its training data volume, and the number of inference samples at test time.

In practice, their approach demonstrates that it is computationally optimal to train substantially smaller models with much more data than traditional rules prescribe, and then use the saved computational overhead to generate multiple replicate samples in inference.

For enterprise AI application developers who are training their own models, this research provides a proven plan to maximize return on investment. It shows that AI reasoning does not necessarily require spending large amounts on frontier models. Instead, smaller models can deliver stronger performance on complex tasks while keeping per-query inference costs manageable within real-world deployment budgets.

Conflicting scaling laws

Scaling laws are an important part of developing large language models. Pre-training scaling laws dictate the best way to allocate computation during model creation, while Test time escalation laws guide how to allocate computation during implementation, such as letting the model “think harder” or generating multiple reasoning samples to solve complex problems.

The problem is that these scaling laws have developed completely independently of each other despite being fundamentally intertwined.

The size of a model’s parameters and training duration directly dictate both the quality and cost per query of its inference samples. Currently, the industry gold standard for pre-training is chinchilla rulersuggesting an optimal calculation ratio of approximately 20 training tokens for each model parameter.

However, creators of modern AI model families, such as Llama, Gemma, and Qwen, regularly break this rule by intentionally overtraining their smaller models with massive amounts of data.

As Nicholas Roberts, co-author of the paper, told VentureBeat, the traditional approach fails when creating complex agent workflows: "In my opinion, the inference stack breaks down when each individual inference call is expensive. This is the case when the models are large and many repeated samplings are necessary." Instead of relying on massive models, developers can use compact overtrained models to perform this repeated sampling at a fraction of the cost.

But because training and testing time scaling laws are examined in isolation, there is no rigorous framework for calculating how much a model should be overtrained based on how many reasoning samples it will need to generate during deployment.

As a result, there was previously no formula that jointly optimized model size, training data volume, and inference budgets at test time.

The reason this framework is difficult to formulate is that pre-training and scaling at test time speak two different mathematical languages. During pre-training, a model’s performance is measured by “loss,” a continuous, fluid metric that tracks prediction errors as the model learns.

At test time, developers use subsequent real-world metrics to evaluate a model’s reasoning capabilities, such as pass@k, which measures the probability that a model produces at least one correct answer in k repeated, independent trials.

Train escalation laws on trial

To resolve the disconnect between training and implementation, researchers introduce Train-to-Test (T2) scaling laws. At a high level, this framework predicts a model’s reasoning performance by treating three variables as a single equation: the size of the model (N), the volume of training tokens it learns from (D), and the number of reasoning samples it generates during inference (k).

t2 combines pretraining and inference budgets into an optimization formula that represents both the base cost to train the model (6ND) and the composite cost to query it repeatedly in inference (2Nk). The researchers tested different modeling approaches: whether to model pre-training loss or performance at test time (pass@k) as functions of N, D, and k.

The first approach takes the well-known mathematical equation used for Chinchilla scaling (which calculates the prediction error or loss of a model) and modifies it directly by adding a new variable that represents the number of repeated samples at the time of testing (k). This allows developers to see how increasing the inference calculation reduces the overall error rate of the model.

The second approach directly models the downstream pass@k accuracy. It tells developers how likely their application is to solve a problem given a specific computing budget.

But should companies use this framework for all applications? Roberts clarifies that this approach is highly specialized. "I imagine you wouldn’t see as much benefit in knowledge-heavy applications like chat models," said. Instead, "t2 It is designed for reasoning-intensive applications, such as coding, where repeated sampling would typically be used as a scaling method in test time."

What it means for developers

To validate the T2 scaling laws, the researchers built an extensive testbed of more than 100 language models, ranging from 5 million to 901 million parameters. They trained 21 new, highly overtrained checkpoints from scratch to test whether their mathematical predictions held true in reality. They then compared the models on eight diverse tasks, including real-world datasets such as SciQ and OpenBookQA, along with synthetic tasks designed to test arithmetic, spatial reasoning, and knowledge retrieval.

Both mathematical models demonstrated that the optimal calculation frontier departs drastically from the standard Chinchilla scale. To maximize performance on a fixed budget, the optimal choice is a model that is significantly smaller and trained on much more data than the traditional 20 tokens per parameter rule dictates.

In their experiments, the small, highly overtrained models consistently outperformed the larger, Chinchilla-optimal models on all eight evaluation tasks when sampling costs at test time were taken into account.

For developers who want to implement these findings, the technical barrier is surprisingly low.

"Nothing sophisticated is required to perform test-time scaling with our current models," Roberts said. "During implementation, developers can absolutely integrate infrastructure that makes the sampling process more efficient (e.g. KV caching if you are using a transformer)."

KV caching helps by storing the pre-processed context so that the model does not have to re-read the initial message from scratch for each new reasoning sample.

However, extreme overtraining comes with practical trade-offs. While overtrained models can be notoriously stubborn and harder to tune, Roberts notes that when they applied supervised tuning, "While this effect was present, it was not strong enough to cause the optimal model to return to Chinchilla." The optimal calculation strategy is still definitely biased towards compact models.

However, teams that push this to the absolute limit must be careful not to hit physical data limits. "Another angle is that if you take our overtraining recommendations to the extreme, you may run out of training data." Roberts said, referring to the impending "data wall" where high-quality Internet data runs out.

These experiments confirm that if an application relies on generating multiple reasoning samples at test time, aggressively overtraining a compact model is practically and mathematically the most effective way to spend an end-to-end compute budget.

To help developers get started, the research team plans to soon open source its checkpoints and code, allowing companies to connect their own data and test scaling behavior immediately. Ultimately, this framework serves as an equalizing force in the AI ​​industry.

This is especially crucial as the high price of frontier models can become a barrier as agent applications that rely on reasoning models scale.

"t2 It fundamentally changes who builds solid reasoning models," Roberts concludes. "You may not need huge computing budgets to get state-of-the-art reasoning. Instead, you need good data and smart allocation of your training and inference budget."



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *