
Our system did one thing, and it did it well: it converted natural language questions into API calls.
Users were analysts, account managers, and operations leaders. They knew what data they needed, but gathering it manually meant pulling it from four dashboards, two BI tools, and a Salesforce report builder. With our system, they wrote the request in plain English. a request like "Prepare a report on sales volume from January to March 2026 for the Northeast region, broken down by city" resulted in an API call that the system could act on:
json
{
"description": "Sales volume requested by the user for the given date range, here is the API call to get the response",
"api_call": "/api/sales_volume",
"postal_body": {
"start_date": "2026-01-01",
"end_date": "2026-03-31",
"region": "northeast"
}
}
The rest of the pipeline was conventional engineering. The system sent the call to the correct backend (we had integrations with internal reporting portals, Salesforce, and several on-premise services), applied a JSON query generated by a large language model (LLM) to filter and shape the response, and delivered it via email, as a Drive document, or as a graph in the browser.
By mid-2025, the system was generating several hundred reports per month. These reports were consumed by leaders and analysts and distributed to external stakeholders. It had become the default way most teams pulled data on an ad hoc basis.
The contract between the LLM and the rest of the system was a JSON object structured as described in the example above.
json
{
"description": "Sales volume requested by the user for the given date range, here is the API call to get the response",
"api_call": "/api/sales_volume",
"postal_body": {
"start_date": "2026-01-01",
"end_date": "2026-03-31",
"region": "northeast"
}
}
We built it on Claude Sonnet 3.5 in early 2025. We upgraded to 3.7 without incident and to 4.0 without incident. By the time Sonnet 4.5 was released, we had become complacent about the stability and predictability of LLMs in solving what we thought was a simple problem. Model updates it had become routine, like removing a minor version of a well-behaved library.
Then we implemented 4.5. For a significant percentage of requests, the model started folding the post_body content into the description field. Two failure modes followed.
First, the filter parameters never made it to the API. Our system reads postal_body as source of truth for the request payload, and that field came back empty. The API call was made without the date range or region filter. Depending on the specific API called, the backend returned sales volume for all time or all regions or it returned a 500 error.
Second, the model began asking clarifying questions in its response. This was new. Previous versions always tried their best for an ambiguous request and returned a structured object. Sonnet 4.5, being more cautious, sometimes responded with a question. Our system had no way for this. It was built under the assumption that each model invocation would result in an API call. There was no human component in the loop and no state to hold a partially completed request. This caused subsequent systems to fail in multiple ways.
We’re back to 4.0. That was harder than it should have been: between the 4.0 and 4.5 implementations, our team added new API integrations, all of which qualified against 4.5. Reversing the model meant regrading each of them against 4.0 under time pressure.
Why does the traditional engineering discipline fail here?
Software engineering is based on the ability to limit the effect of a change. When you update a driver or library, read the release notes to see if you expect any breaking changes. Unit tests circumscribe what could possibly have been moved. You can take advantage of the following property: the system being changed is deterministic enough that its behavior can be predicted, or at least sampled densely enough to give you confidence. The blast radius is limited by construction.
LLM Supported Systems break this assumption. The component that produces your result is not under your control. You cannot differentiate the version of a model from 4.0 to 4.5. It is a total replacement for the functionality your system depends on.
This is what we mean by a infinite explosion radius: a change whose aftereffects cannot be enumerated in advance because the input space (natural language) and failure modes (anything the model can do differently) are unlimited.
Anatomy of failure
The autopsy revealed that our message had always been poorly specified. We had told the model to return a JSON object with three fields. We had described what each field was for. We do not explicitly state that the description must be a natural language string and must not contain serialized representations of other fields.
Previous versions of the model inferred this constraint from context. Sonnet 4.5, evidently better to be "useful" in its formatting options, decided that asking for clarification or providing the body of the request in the description made the response more useful. From the model’s perspective, this was a reasonable interpretation of an ambiguous instruction. However, this violated the assumptions under which our system was built.
The error was not in the model. The error was in our assumption that the model would continue to fill our specification gaps as it always had. Three successful upgrades had taught us to believe that those gaps were safe.
Structured output modes and tool usage APIs would have detected this specific failure at the schema level. We did not use them for engineering reasons outside the scope of this article. But schemas only restrict syntax, not semantics. A schema cannot specify that a clarification question should not appear in a system without clarification path, or that a date range should never be silently defaulted to all times. Schemes solve the easier half of the problem.
The evals-first architecture
The discipline that closes this gap is to treat the assessment set (not the message) as the formal instrument. system specification. The message is a implementation of the specification. The model is a interpreter. The evaluations are the specification itself, and any model or quick change is valid if and only if it passes them.
In practice, an evaluation is a triple: an input, a property that the output must satisfy, and a scoring function. For our system, the evaluation that regression 4.5 would have captured looks something like this:
piton
def test_description_contains_no_serialized_payload (response):
desc = response ("description").lower()
prohibited = ("curl", "postal_body", "{", "http://", "https://")
assert not none (token on desc for token on forbidden), \
F"filtered structured content description: {response(‘description’)}"
A few hundred such properties, some handwritten for known important invariants, some generated as regression tests from real production traffic, some scored by an LLM judge for fuzzier qualities like pitch, become a gate. Model updates and quick changes should be treated as pull requests that should turn the suite green before merging.
Assessments are expensive to build and maintain. They move as your product changes. LLM scoring as a judge introduces its own variation into the results. And the suite can only detect failure modes that you have intended to specify; you cannot evaluate your path to safety against a category of failure you have never imagined. We learned this lesson the hard way: no one on our team had ever written a statement that said "the description field must not contain a curl command," because no one had thought that the model would put one there.
Assessments are not a silver bullet. They give you the ability to limit the burst radius of a change in the only way available when the underlying function is a black box: by densely sampling the input-output response you’re really interested in, and refusing to deploy it when that behavior moves.
The roadmap
The engineering community has yet to develop a body of knowledge for writing effective evaluations. There are no widely accepted standards for what “coverage” means in natural language input spaces. CI/CD systems were not built to control probabilistic test results. As agents take on more autonomous work (writing code, moving money, scheduling infrastructure changes), the gap between "the model passed our smoke tests" and "We know what this system will do in production." becomes the central engineering problem of the coming years.
The teams that close that gap will be the ones that stop treating assessments as a QA afterthought and start treating them as the actual specification of what their system is.
Vijay Sagar Gullapalli is an artificial intelligence engineer, founder of Adopt AI, and a USPTO patented inventor.
Sarat Mahavratayajula is a Senior Software Engineer at Sherwin-Williams.





