
when he A big and beautiful bill It arrived as a 900-page unstructured document, with no standardized outline, no published IRS forms, and a strict submission deadline. Intuit’s TurboTax team had a question: Could AI compress a months-long deployment into days without sacrificing accuracy?
What they created to do this is less a fiscal story than a template, a workflow that combines commercial AI tools, a proprietary domain-specific language, and a custom unit testing framework that any development team with limited domain can learn from.
Joy Shaw, Intuit’s chief tax officer, has spent more than 30 years at the company and experienced both Tax Cuts and Jobs Act and the OBBB. "There was a lot of noise in the law itself and we were able to extract the tax implications, limit them to the individual tax provisions, limit them to our clients," Shaw told VentureBeat. "That kind of distillation was really quick using the tools and then allowed us to start coding before we even received forms and instructions."
How OBBB raised the bar
When the Tax Cuts and Jobs Act was passed in 2017, the TurboTax team worked on the legislation without the help of AI. It took months and the precision requirements left no room for shortcuts.
"We used to have to review the law and we would codify sections that referenced other sections of the legal code and try to figure it out on our own." Shaw said.
The OBBB came with the same accuracy requirements but with a different profile. At more than 900 pages, it was structurally more complex than the TCJA. It arrived as an unstructured document without a standardized outline. The House and Senate versions used different language to describe the same provisions. And the team had to begin implementation before the IRS released any official forms or instructions.
The question was whether AI tools could compress the timeline without compromising the outcome. The answer required a specific sequence and tools that did not yet exist.
From unstructured document to domain-specific code
The OBBB was still moving through Congress when the TurboTax team began working on it. Using large language models, the team summarized the House version, then the Senate version, and then reconciled the differences. Both chambers referenced the same underlying sections of the tax code, a consistent anchor point that allowed the models to draw comparisons between structurally inconsistent documents.
On signing day, the team had already leaked provisions affecting TurboTax customers, restricted to specific tax situations and customer profiles. Analysis, reconciliation and filtering of provisions went from weeks to hours.
Those tasks were handled by general-purpose ChatGPT and LLM. But those tools reached a limit when the work moved from analysis to implementation. TurboTax does not run in a standard programming language. Its tax calculation engine is based on a proprietary domain-specific language maintained internally at Intuit. Any model that generates code for that codebase has to translate the legal text into a syntax it was never trained on and identify how new provisions interact with decades of existing code without breaking what already works.
Claude became the main tool for that translation and dependency mapping work. Shaw said he could identify what changed and what didn’t, allowing developers to focus only on the new provisions.
"It is able to integrate with things that do not change and identify the dependencies of what did change," she said. "That sped up the development process and allowed us to focus only on those things that did change."
Matching build tools with near-zero error threshold
The general purpose LLMs got the team working on code. For that code to be shipped required two proprietary tools created during the OBBB cycle.
The first automatically generated TurboTax product directly detects changes in the law. Previously, developers selected those screens individually for each layout. The new tool handled most of it automatically, with manual customization only when necessary.
The second was a purpose-built unit testing framework. Intuit had always done automated testing, but the previous system only produced pass/fail results. When a test failed, developers had to manually open the underlying tax return data file to trace the cause.
"The automation would tell you pass, fail, you would have to dig deeper into the actual tax data file to see what could have been wrong." Shaw said. The new framework identifies the specific code segment responsible, generates an explanation, and allows the fix to be made within the framework itself.
Shaw said the accuracy of a consumer tax product should be close to 100 percent. Sarah Aerni, Intuit’s vice president of technology for Consumer Group, said the architecture has to produce deterministic results.
"Having the kinds of capabilities around determinism and correcting them in a verifiable way through testing, that’s what leads to that kind of trust." Aerni said.
Tools drive speed. But Intuit also uses LLM-based assessment tools to validate AI-generated results, and even these require a human tax expert to assess whether the result is correct. "It all comes down to having human experience to be able to validate and verify almost anything," Aerni said.
Four components that any team in a regulated industry can use
The OBBB was a tax problem, but the underlying conditions are not unique to taxes. Healthcare, financial services, legal technology, and government contracting teams regularly face the same combination: complex regulatory documents, tight deadlines, proprietary code bases, and near-zero fault tolerance.
As implemented by Intuit, four elements of the workflow are transferable to other domain-restricted development environments:
-
Use commercial LLMs for document analysis. General-purpose models handle analysis, reconciliation and filtering of provisions well. That’s where they add speed without creating accuracy risk.
-
Switch to domain-aware tools when analysis turns to implementation. General-purpose models that generate code in a proprietary environment without understanding it will produce results that cannot be trusted at scale.
-
Build testing infrastructure before the deadline, not during the sprint. Generic automated tests produce pass/fail results. Domain-specific testing tools that identify flaws and enable fixes in context is what makes AI-generated code deliverable.
-
Deploy AI tools throughout the organization, not just in engineering. Shaw said Intuit trained and monitored usage across all functions. Fluency in AI was spread throughout the organization rather than concentrated on early adopters.
"We continue to lean into the AI and human intelligence opportunity here, so our customers get what they need from the experiences we create." Aerni said.





