Thinking Machines Preview AI Voice and Video Conversations in Near Real-Time with New 'Interaction Models'

Is AI leaving the era of "in turns" chat?

At this point, all of us who use AI models regularly for work or in our personal lives know that the basic mode of interaction between text, images, audio and video remains the same: the human user provides input, waits between milliseconds and minutes (or in some cases, for particularly difficult queries, hours and days), and the AI model provides a result.

But for AI to truly take on the load of jobs that require natural interaction, it will have to do more than just provide this kind of "in turns" Interactivity: Ultimately, you will need to respond more fluidly and naturally to human input, including responding while processing the next human input, whether text or another format.

That at least seems to be the claim of thinking machineshe well-funded AI startup founded last year by former OpenAI CTO Mira Murati and former OpenAI researcher and co-founder John Schulman, among others.

Today, the firm announced an advance in the investigation of what it considers to be "Interaction models, a new class of native multimodal systems that treat interactivity as a first-class citizen of the model architecture rather than external software. "leverage," making some impressive gains on third-party benchmarks and reduced latency as a result.

However, the models are not yet available to the general public or even to companies, the company says in its announcement blog post: "In the coming months, we will open a limited research preview to gather feedback, with a broader release later this year."

Simultaneous ‘full duplex’ input/output processing

At the center of this announcement is a fundamental change in the way AI perceives time and presence. Current frontier models often experience reality in a single thread; They wait for a user to finish an input before they begin processing, and their perception freezes while they generate a response.

In their blog post, Thinking Machines researchers described the status quo as a limitation that forces humans to "contort" to AI interfaces, asking questions like emails and grouping your thoughts.

to solve this "collaboration bottleneck," Thinking Machines has moved away from the standard sequence of alternating tiles.

Instead, they use a multi-stream microspin design that processes 200 ms input and output chunks simultaneously.

This "full duplex" The architecture allows the model to listen, speak, and see in real time, allowing it to step back while a user speaks or intervene when it notices a visual cue, such as a user typing an error into a code snippet or a friend entering a video frame. Technically, the model uses encoderless early fusion.

Instead of relying on massive standalone encoders like Whisper for audio, the system takes raw audio signals like dMel and image patches (40×40) through a lightweight embedding layer, co-training all components from scratch within the transformer.

Dual model system

The progress of the research presents TML-Interaction-Smallto Mix of Experts (MoE) of 276 billion parameters model with 12 billion active parameters. Because real-time interaction requires near-instant response times that often conflict with deep reasoning, the company has designed a two-part system:

The interaction model: It remains in constant exchange with the user, managing dialogue management, presence and immediate follow-up.
The background model: An asynchronous agent that handles sustained reasoning, web browsing, or calls to complex tools, passing the results to the interaction model so that they are naturally integrated into the conversation.

This setup allows the AI to perform tasks like live translation or generate a UI graph while continuing to listen to user feedback, a capability demonstrated in the announcement video where the model provided typical human reaction times for several signals while simultaneously generating a bar graph.

Impressive performance on major benchmarks vs. fast interaction models from other leading AI labs

To demonstrate the effectiveness of this approach, the lab used FD banka benchmark designed specifically to measure interaction quality rather than just raw intelligence. The results show that TML-Interaction-Small significantly outperforms existing real-time systems:

Sensitivity: Achieved a shift latency of 0.40 secondscompared to 0.57 seconds for Gemini-3.1-flash-live and 1.18 seconds for GPT-realtime-2.0 (minimum).
Quality of interaction: In FD-bench V1.5 it obtained a score 77.8almost doubling the scores of its main competitors (GPT-realtime-2.0 minimum scored 46.8).
Visual proactivity: In specialized tests such as RepCount-A (counting physical repetitions on video) and Proactive video quality controlthe Thinking Machines model successfully interacted with the visual world, while other frontier models remained silent or provided incorrect answers.

Metric	TML-Interaction-Small	GPT-realtime-2.0 (min)	Gemini-3.1-flash-live (min)
Shift latencies	0.40	1.18	0.57
Interaction quality (average)	77.8	46.8	54.3
IFEval (voice bank)	82.1	81.7	67.6
Harmbench (% rejection)	99.0	99.5	99.0

A potentially huge help for companies, once the models are available

If made available to the enterprise sector, Thinking Machines interaction models would represent a fundamental change in the way companies integrate AI into their operational workflows.

A native interaction model like TML-Interaction-Small enables several business capabilities that are currently impossible or very fragile with standard multimodal models:

Today’s enterprise AI requires a "fold" must be completed before the data can be analyzed. In a manufacturing or laboratory environment, a native interaction model can monitor a video stream and proactively intervene the moment it detects a security violation or deviation from a protocol, without waiting for the worker to request feedback.

The model’s success on visual benchmarks like RepCount-A (accurate repetition counting) and ProactiveVideoQA (answering questions as visual evidence appears) suggests it could serve as a real-time auditor for high-stakes physical tasks.

The main friction in voice customer service is the 1-2 second time. "treatment" common delay in standard APIs of 2026. Thinking Machines’ model achieves a turn latency of 0.40 seconds, about the speed of a natural human conversation.

Because it handles simultaneous voice natively, an enterprise support bot could listen to a customer’s frustration, provide "return channel" signs (such as "I see" either "mm-hmm") without interrupting the user and offering live translation that feels like a natural conversation rather than a series of disconnected recordings.

Standard LLMs lack an internal clock; they "know" time only if provided in a text message. Interaction models are natively time-aware, allowing them to handle time-sensitive processes such as "Remind me to check the temperature every 4 minutes." either "Let me know if this process takes longer than the previous one". This is essential for industrial maintenance and pharmaceutical research, where time is an essential variable.

Background of thinking machines

This launch marks the second major milestone for Thinking Machines after the Tinker launching in October 2025a managed API for fine-tuning language models that allows researchers and developers to control their data and training methods while Thinking Machines handles the infrastructure load of distributed training.

The company said Tinker supports both small and large open models, including expert combination models, and early users included groups from Princeton, Stanford, Berkeley and Redwood Research.

At its launch in early 2025, Thinking Machines billed itself as an AI research and products company attempting to make advanced AI systems “more widely understood, customizable, and generally more capable.”

In July 2025, Thinking Machines said it had raised around $2 billion in a Valuation of 12 billion dollars in a round led by Andreessen Horowitz, with participation from Nvidia, Accel, ServiceNow, Cisco, AMD and Jane Street, described by CABLING as the largest seed funding round in history.

The Wall Street Journal reported in August 2025 that rival tech CEO Mark Zuckerberg approached Murati about acquiring Thinking Machines Lab, and after she declined, Meta sought out more than a dozen of the startup’s roughly 50 employees.

In March and April 2026, the company also became known for its computing ambitions: it announced a Nvidia Association deploy at least one gigawatt of next-generation Vera Rubin systems, then expanded its relationship with Google Cloud to use Google’s AI hypercomputer infrastructure with Nvidia GB300 systems for model research, reinforcement learning workloads, frontier model training, and Tinker.

By April 2026, Business Insider reported that Meta had hired seven founding members of Thinking Machines, including Mark Jen and Yinghai Lu, while another Thinking Machines researcher, Tianyi Zhang, also moved to Meta. The same report said that Joshua Gross, who helped build Thinking Machines’ flagship tuning product, Tinker, had joined Meta Superintelligence Labs and that the company had grown to about 130 employees despite the departures.

Thinking Machines wasn’t just losing people, though: It also hired Meta veteran Soumith Chintala, creator of PyTorch, as CTO, and added other high-profile technical talent like Neal Wu. TechCrunch Separately reported in April 2026 that Weiyao Wang, an eight-year Meta veteran who worked on multimodal perception systems, had joined Thinking Machines, underscoring that the flow of talent was not one-way.

Thinking Machines previously stated that it was committed to "important open source components" in their launches to empower the research community. It is unclear whether these new interaction models will be governed by the same values and release terms.

But one thing is certain: by making interactivity native to the model, Thinking Machines believes that scaling a model will now make it a smarter, more effective collaborator.

Source link

Thinking Machines Preview AI Voice and Video Conversations in Near Real-Time with New ‘Interaction Models’

Simultaneous ‘full duplex’ input/output processing

Dual model system

Impressive performance on major benchmarks vs. fast interaction models from other leading AI labs

A potentially huge help for companies, once the models are available

Background of thinking machines

Leave a ReplyCancel Reply

WhatsApp beta previews Liquid Glass changes for message reactions and more

Google Home is getting better with context and it’s easier to complain

Samsung’s solution for apps with “excessive” advertising is already being implemented

Simultaneous ‘full duplex’ input/output processing

Dual model system

Impressive performance on major benchmarks vs. fast interaction models from other leading AI labs

A potentially huge help for companies, once the models are available

Background of thinking machines

Leave a ReplyCancel Reply

Trending now

WhatsApp beta previews Liquid Glass changes for message reactions and more

Google Home is getting better with context and it’s easier to complain

Samsung’s solution for apps with “excessive” advertising is already being implemented