Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124


Companies that have been juggling separate models for reasoning, multimodal tasks, and agent coding can simplify their stack: Mistral’s new Small 4 brings all three together into a single open source model, with adjustable reasoning levels under the hood.
Small 4 enters a crowded field of small models, including qwen and Claude Haiku – that compete on inference costs and baseline performance. Mistral’s argument: shorter outputs that translate into lower latency and cheaper tokens.
Mistral Small 4 Updates Mistral Small 3.2, which came out in June 2025, and is available under an Apache 2.0 license. “With Small 4, users no longer need to choose between a rapid instruction model, a powerful reasoning engine, or a multimodal assistant: one model now offers all three, with configurable reasoning effort and best-in-class efficiency,” Mistral said in a blog post.
The company said that despite its smaller size (Mistral Small 4 has 119 billion total parameters with only 6 billion active parameters per token), the model combines the capabilities of all Mistral models. It has the reasoning capabilities of Magistral, the multimodal understanding of Pixtral, and the agentic coding performance of Devstral. It also has a 256K context window which the company says works well for long conversations and analysis.
Rob May, co-founder and CEO of small language model marketplace Neurometric, told VentureBeat that Mistral Small 4 stands out for its architectural flexibility. However, it joins a growing number of smaller models which he says risk adding further fragmentation to the market.
"From a technical perspective, yes, it can be competitive against other models,” May said. “The biggest issue is that it has to overcome market confusion. Mistral has to earn the shared mindset to have a chance to be part of that set of tests first. Only then will they be able to show the technical capabilities of the model.”
Small models still offer good options for business builders looking to have the same LLM experience at a lower cost.
The model is built on a mixed expert architecture, like other Mistral models. It has 128 experts with four assets in each token, which, according to Mistral, allows for efficient expansion and specialization.
This allows the Mistral Small 4 to respond faster, even to outputs that require more intensive reasoning. It can also process and reason about text and images, allowing users to analyze documents and graphics.
Mistral said the model features a new parameter it calls reasoning_effort, which would allow users to “dynamically adjust the behavior of the model.” According to Mistral, companies could configure Small 4 to offer quick, lightweight answers in the same style as Mistral Small 3.2, or make it more detailed along the lines of Magistral, providing step-by-step reasoning for complex tasks.
Mistral said the Small 4 runs on fewer chips than comparable models, with a recommended configuration of four Nvidia HGX H100 or H200, or two Nvidia DGX B200.
“Delivering advanced open source AI models requires extensive optimization. Through close collaboration with Nvidia, inference has been optimized for both vLLM and open source SGLang, ensuring efficient, high-performance service across all deployment scenarios,” Mistral said.
According to Mistral benchmarks, Small 4 performs close to the level of Mistral Medium 3.1 and Mistral Large 3, particularly in MMLU Pro.
Mistral said the instruction-following performance makes the Small 4 suitable for high-volume enterprise tasks, such as document comprehension.
While competitive with other small models from other companies, Small 4 still performs below other popular open source models, especially on intensive reasoning tasks. Qwen 3.5 122B and Qwen 3-next 80B outperform Small 4 in LiveCodeBench, as does Claude Haiku in instructional mode.
Mistral Small 4 was able to beat OpenAI’s GPT-OSS 120B in the LCR.
Mistral maintains that Small 4 achieves these scores with “significantly shorter results” that translate into lower latency and inference costs than the other models. Specifically in instruction mode, the Small 4 produces the shortest outputs of any model tested: 2.1,000 characters versus 14.2,000 for Claude Haiku and 23.6,000 for the GPT-OSS 120B. In reasoning mode, the results are much longer (18.7K), which is expected for that use case.
May said that while the choice of model depends on an organization’s goals, latency is one of the three pillars they should prioritize. “It depends on your goals and what you are optimizing your architecture for. Enterprises should prioritize these three pillars: reliability and structured throughput, latency-to-intelligence ratio, fine-tuning, and privacy,” May said.