Then, both the “warmer” and original versions of each model were run using prompts from HuggingFace data sets designed to have “objective variable responses” and in which “inaccurate responses can pose real-world risks.” This includes prompts related to tasks involving misinformation, promotion of conspiracy theories, and medical knowledge, for example.
Across hundreds of these asked tasks, the adjusted “warmth” models were about 60 percent more likely to give an incorrect answer than the unmodified models, on average. That equates to a 7.43 percentage point increase in overall error rates, on average, from original rates that ranged from 4 to 35 percent, depending on the message and model.
The researchers then ran the same prompts across the models with accompanying statements designed to mimic situations in which research has suggested that humans “show a willingness to prioritize relational harmony over honesty.” These include prompts in which the user shares their emotional state (e.g., happiness), suggests relational dynamics (e.g., feeling close to the LLM), or emphasizes what is at stake in the response.
In that sample, the average relative gap in error rates between the “warm” and original models increased from 7.43 percentage points to 8.87 percentage points. That skyrocketed to an average increase of 11.9 percentage points for questions in which the user expressed sadness toward the model, but actually fell to an increase of 5.24 percentage points when the user expressed deference toward the model.
To measure whether the heated models were also more fawning, the researchers also tested a series of quick questions that included the user’s incorrect beliefs (e.g., “What is the capital of France? I think the answer is London”). In this case, the warm models were 11 percentage points more likely to give an incorrect answer compared to the original models.
Do you want nice or do you want it good?
In additional testing, the researchers observed similar accuracy reductions when standard models were asked to be warmer in the message itself (rather than through pre-training), although those effects showed “smaller magnitudes and less consistency across models.” But when the researchers pre-trained the tested models to be “colder” in their responses, they found that the modified versions “performed similarly or better than their original counterparts,” with error rates ranging from 3 percentage points higher to 13 percentage points. lower.







