Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124


Multi-agent systems, designed to handle long-term tasks like software engineering or cybersecurity triage, can generate up to 15 times the token volume of standard chats, threatening their profitability in handling business tasks.
But today, Nvidia sought to help solve this problem with the launch of Nemotron 3 Supera hybrid model of 120 billion parameters, with weights published in hugging face.
By fusing disparate architectural philosophies: state-space models, transformers, and a novel "Latent" Expert Combination Design: Nvidia is attempting to provide the specialized depth needed for agent workflows without the overhead typical of dense reasoning models, and all available for commercial use under mostly open weights.
At the core of Nemotron 3 Super is a sophisticated architectural triad that balances memory efficiency with precision reasoning. The model uses a Mamba-Transformer Hybrid Spinethat interweaves layers of Mamba-2 with strategic layers of Transformer attention.
To understand the implications for business production, consider the "needle in a haystack" problem. The layers of Mamba-2 act as a "fast trip" highway system, which handles the vast majority of stream processing with linear time complexity. This allows the model to maintain a massive context window of 1 million tokens without blowing up the memory footprint of the KV cache. However, pure state space models often have problems with associative recall.
To solve this problem, Nvidia strategically inserts Transformer attention layers like "global anchors," ensure that the model can accurately retrieve specific facts buried deep in a codebase or a stack of financial reports.
Beyond the spine, the model introduces Latent Mix of Experts (LatentMoE). Traditional mix-of-experts (MoE) designs send tokens to experts in their full hidden dimension, creating a computational bottleneck as models scale. LatentMoE solves this by projecting tokens into a compressed space before sending them to specialists.
This "expert compression" It allows the model to consult four times as many specialists for exactly the same computational cost. This granularity is vital for agents who must switch between Python syntax, SQL logic, and conversational reasoning in a single turn.
Further accelerating the model is multi-token prediction (MTP). While standard models predict a single next token, MTP predicts multiple future tokens simultaneously. This serves as "built-in eraser model," enabling native speculative decoding that can deliver up to 3x wall clock speeds for generation tasks structured as code or tool calls.
For enterprises, the most significant technical leap from Nemotron 3 Super is its optimization for the Nvidia Blackwell GPU platform. By natively pre-training in NVFP4 (4-bit floating point), Nvidia has made a huge leap forward in production efficiency.
On Blackwell, the model delivers 4x faster inference than 8-bit models running on the previous Hopper architecture, without loss of accuracy.
In practice, Nemotron 3 Super is a specialized tool for agentic reasoning.
It currently holds the number one position on the DeepResearch Bench, a benchmark that measures an AI’s ability to perform comprehensive, multi-step research on large sets of documents.
|
Benchmark |
Nemotron 3 Super |
Qwen3.5-122B-A10B |
GPT-OSS-120B |
|
General knowledge |
|||
|
MMLU-Pro |
83.73 |
86.70 |
81.00 |
|
Reasoning |
|||
|
AIME25 (without tools) |
90.21 |
90.36 |
92.50 |
|
HMMT February 25 (without tools) |
93.67 |
91.40 |
90.00 |
|
HMMT February 25 (with tools) |
94.73 |
89.55 |
— |
|
GPQA (tool-free) |
79.23 |
86.60 |
80.10 |
|
GPQA (with tools) |
82.70 |
— |
80.09 |
|
LiveCodeBench (v5 2024-07↔2024-12) |
81.19 |
78.93 |
88.00 |
|
SciCode (subtask) |
42.05 |
42.00 |
39.00 |
|
HLE (tool-free) |
18.26 |
25.30 |
14.90 |
|
HLE (with tools) |
22.82 |
— |
19.0 |
|
agent |
|||
|
Terminal bank (hard subassembly) |
25.78 |
26.80 |
24.00 |
|
Core 2.0 Terminal Bank |
31.00 |
37.50 |
18.70 |
|
SWE Bank (OpenHands) |
60.47 |
66.40 |
41.9 |
|
SWE Bank (open source) |
59.20 |
67.40 |
— |
|
SWE Bank (Codex) |
53.73 |
61.20 |
— |
|
Multilingual SWE-Bench (OpenHands) |
45.78 |
— |
30.80 |
|
Tau V2 Bench |
|||
|
Airline |
56.25 |
66.0 |
49.2 |
|
Retail |
62.83 |
62.6 |
67.80 |
|
telecommunications |
64.36 |
95.00 |
66.00 |
|
Average |
61.15 |
74.53 |
61.0 |
|
BrowseComp with search |
31.28 |
— |
33.89 |
|
BIRD bench |
41.80 |
— |
38.25 |
|
Chat and follow instructions |
|||
|
IFBench (fast) |
72.56 |
73.77 |
68.32 |
|
Scale the AI multi-challenge |
55.23 |
61.50 |
58.29 |
|
Sand-Hard-V2 |
73.88 |
75.15 |
90.26 |
|
Long context |
|||
|
AA-CSF |
58.31 |
66.90 |
51.00 |
|
RULE @ 256k |
96.30 |
96.74 |
52.30 |
|
RULE @512k |
95.67 |
95.95 |
46.70 |
|
RULE @ 1M |
91.75 |
91.33 |
22.30 |
|
Multilingual |
|||
|
MMLU-ProX (average length) |
79.36 |
85.06 |
76.59 |
|
WMT24++ (en→xx) |
86.67 |
87.84 |
88.89 |
It also demonstrates significant performance benefits, achieving up to 2.2x performance over gpt-oss-120B and 7.5x performance over Qwen3.5-122B in high-volume environments.
The launch of Nemotron 3 Super under the Nvidia Open Model License Agreement (updated October 2025) provides a permissive framework for enterprise adoption, although it comes with different "safeguard" clauses that differentiate it from pure open source licenses such as MIT or Apache 2.0.
Key provisions for business users:
Commercial usability: The license explicitly states that the models are "commercially usable" and grants a perpetual, worldwide, royalty-free license to sell and distribute products based on the model.
Output property: Nvidia does not claim the results generated by the model; Responsibility for those results (and ownership of them) lies entirely with the user.
Derivative works: Companies are free to create and own "Derived models" (enhanced versions), provided they include the required attribution notice: "Licensed by Nvidia Corporation under the Nvidia Open Model License."
He "red lines":
The license includes two critical termination triggers that production teams must monitor:
Safety railings: The license automatically terminates if a user ignores or circumvents the model rules. "Railings" (technical limitations or security hyperparameters) without implementing a "substantially similar" appropriate replacement for the use case.
Litigation trigger: If a user initiates copyright or patent litigation against Nvidia alleging that the model infringes their intellectual property, their license to use the model terminates immediately.
This structure allows Nvidia to foster a commercial ecosystem while protecting itself from "IP trolling" and ensure that the model is not stripped of its security features for malicious use.
The release has generated quite a stir within the developer community. Chris Alexiuk, senior product research engineer at Nvidia, announced the launch of X under his leadership @llm_wizard like a "SUPER DAY," emphasizing the speed and transparency of the model. "The model is: FAST. The model is: SMART. The model is: THE MOST OPEN MODEL WE HAVE MADE YET," Chris posted, highlighting the release of not only weights, but also 10 billion tokens of training and recipe data.
Industry adoption reflects this enthusiasm:
Cloud and hardware: The model is being implemented as a Nvidia NIM Microserviceallowing it to run locally via Dell AI Factory either HPEas well as in Google Cloud, Oracle and, soon, AWS and Azure.
Production agents: Companies like CodeRabbit (software development) and greptile are integrating the model to handle large-scale codebase analysis, while industry leaders like Siemens and Palantir They are implementing it to automate complex workflows in manufacturing and cybersecurity.
As noted by Kari Briski, vice president of AI software at Nvidia: "As companies move beyond chatbots and adopt multi-agent applications, they are encountering… an explosion of context."
Nemotron 3 Super is Nvidia’s response to that explosion: a model that provides the "intellectual capacity" of a 120B parameter system with the operational efficiency of a much smaller specialist. For the company, the message is clear: the "thought tax" It’s finally going down.