
Enterprise AI programs rarely fail due to bad ideas. Most often they get stuck in an ungoverned pilot mode and never come to fruition. At a recent VentureBeat event, tech leaders at MassMutual and Mass General Brigham explained how they avoided that trap and what the results look like when discipline replaces dispersion.
At MassMutual, the results are concrete: 30% increase in developer productivity, IT help desk resolution times reduced from 11 minutes to one, and customer service calls reduced from 15 minutes to just one or two.
“We always start with the question: why do we care about this problem?” Sears Merritt, head of enterprise experience and technology at MassMutual, said at the event. “If we solve the problem, how will we know we’ve solved it? And how much value is associated with doing so?”
Define metrics and establish strong feedback loops
MassMutual, a 175-year-old company serving millions of policyholders and customers, has driven AI into production across the business: customer service, IT, customer acquisition, underwriting, services, claims and other areas.
Merritt said his team follows the scientific method, starting with a hypothesis and testing to see if it has a result that tangibly moves the business forward. Some ideas are great, but they may prove “business intractable” due to factors such as lack of data or access, or regulatory restrictions.
“We won’t go any further with an idea until we’re very clear about how we’re going to measure and how we’re going to define success.”
Ultimately, it’s up to different departments and leaders to define what quality means: choose a metric and define the minimum level of quality before putting a tool in the hands of teams and partners.
That starting point creates a rapid feedback loop. “What we find that holds us back is when there is no shared clarity about what outcome we are trying to achieve,” which can lead to confusion and constant readjustments, Merritt said. “We don’t go into production until there is a business partner who says, ‘Yes, that works.'”
His team is strategic when it comes to evaluating emerging tools and “extremely rigorous” when it comes to testing and measuring what they do. "good" half. For example, they perform confidence scoring to reduce hallucination rates, establish thresholds and evaluation criteria, and monitor deviation of characteristics and outcomes.
Merritt also operates a no-compromise policy, meaning the company is not limited to using a particular model. It has what he calls an “incredibly heterogeneous” technology environment that combines top models with mainframes running COBOL. That flexibility is not accidental. His team created layers of services, microservices, and common APIs that sit between the AI layer and everything beneath it, so when a better model comes along, swapping it out doesn’t mean starting over.
Because, Merritt explained, “today’s best could be tomorrow’s worst, and we don’t want to be left behind.”
Weeding instead of letting a thousand flowers bloom
Mass General Brigham (MGB), for his part, took more of a spray-and-pray approach…at first.
About 15,000 nonprofit health system researchers have been using AI, ML and deep learning for the past 10 to 15 years, CTO Nallan “Sri” Sriraman said at the same VB event.
But last year he made a bold decision: His team shut down a series of ungoverned AI pilots. Initially, “we followed the thousand-flower bloom (methodology), but we didn’t have a thousand flowers, we probably had a few dozen flowers trying to bloom,” he said.
Like Merritt’s team at MassMutual, MGB took a more holistic view and examined why they were developing certain tools for specific workflow departments. They asked themselves what capabilities they wanted and needed and what investment they required.
Sriraman’s team also spoke to its major platform vendors (Epic, Workday, ServiceNow, Microsoft) about their roadmaps. This was a “pivotal moment,” he noted, when they realized they were building internal tools that vendors were already providing (or planning to implement).
As Sriraman said: “Why are we building it ourselves? We’re already on the platform. It will be in the workflow. Take advantage of it.”
That said, the market is still nascent, which can make decision-making difficult. “The analogy I will give is when you ask six blind men to touch an elephant and say: what is this elephant like?” Sriraman said. “You’ll get six different answers.”
There’s nothing wrong with that, he noted; It’s just that everyone is discovering and experimenting as the landscape continues to change.
Instead of a Wild West environment, Sriraman’s team distributes Microsoft Copilot to users across the company and uses a “small landing zone” where they can safely test more sophisticated products and monitor token usage.
They also began to “consciously embed AI champions” across business groups. “This is kind of the reverse of letting a thousand flowers bloom, planting them and nurturing them carefully,” Sriraman said.
Observability is another important consideration; describes real-time dashboards that manage model drift and safety and allow IT teams to govern AI “a little more pragmatically.” Health monitoring is critical in AI systems, he noted, and his team has established principles and policies around the use of AI, not to mention least access privileges.
In clinical settings, the safety barriers are absolute: artificial intelligence systems never make the final decision. "There will always be a doctor or medical assistant on hand to make the decision," Sriraman said. He cited radiology reporting as an area where AI is widely used, but where a radiologist always approves.
Sriraman was clear: "You will not do this: Do not display PHI (protected health information) on Perplexity. Simple as that, right?"
And, most importantly, safety mechanisms must be in place. “We need a big red button, kill him,” Sriraman emphasized. “We don’t put anything into the operating environment without that.”
Ultimately, while agent AI is a transformative technology, the business approach doesn’t have to be dramatically different. “There is nothing new in this,” Sriraman said. “You can replace the word BPM (business process management) from the 90s and 2000s with AI. The same concepts apply.”





