Cohere opens an encryption agent running on a single H100



Engineering teams building agent coding pipelines now have a concrete open source alternative to managed models like Claude Fable 5 – one that works with a single H100. The trade-off: Cohere’s North Mini Code, which launched Tuesday, generated three times as many output tokens as comparable models in independent tests, a cost of verbosity that is compounded in high-volume production workloads.

The new open source model is a 30 billion parameter Mix of Experts (MoE) model with 3 billion active parameters per token, built for agent software engineering including subagent orchestration, architecture mapping, code review, and terminal work. The model supports a context window of 256,000 tokens with a maximum generation length of 64,000 tokens and is available in hugging face under an Apache 2.0 license.

What Mini Code North can do

North Mini Code addresses the complete agent coding stack. This is what the model does and what it runs with.

Software engineering. Cohere created North Mini Code specifically for agent software engineering, not adapted from a general purpose base. It has built-in tool usage capabilities and supports interlaced thinking, which Cohere says improves performance in multi-step agent work.

Architecture mapping and code review. North Mini Code can analyze and map system architecture, show dependencies, and perform code reviews on large code bases. With a context window of 256,000 tokens, it can contain important multi-file projects in a single context pass.

Terminal-based agency tasks. The model is trained for terminal environments, handling shell interactions, package scripts, and command line tools. Cohere compared it to Terminal-Bench v2, which tests agents in real terminal environments rather than synthetic code generation tasks.

how it was built

North Mini Code is a sparse expert mix model with 128 experts, of which 8 are token-activated. The computation requirement at inference time is closer to a 3 billion parameter model despite the 30 billion total parameters. Nick Frosst, co-founder of Cohere, demonstrated it running on a Mac Studio via MLX with around 20 gigabytes of RAM, the same machine he uses for his own local encoding work.

Cohere trained the model through two stages of supervised tuning followed by reinforcement learning with testable rewards on over 70,000 testable tasks spanning approximately 5,000 repositories, deduplicated against SWE-Bench.

Instead of optimizing against a single-agent scaffold, Cohere trained on three. SWE-Agent uses a rich CLI with specialized commands. Mini-SWE-Agent uses a single bash tool with raw shell output. OpenCode uses individually written tools that return structured JSON. Cohere reports a 10 percentage point gain in evaluating OpenCode from the multi-harness approach while maintaining the performance of SWE-Agent.

where it fits

North Mini Code enters a market that now includes Mistral Devstral Small 2, GitHub Copilot, Cursor, and Claude Fable 5, each with varying cost and implementation tradeoffs.

Cohere’s main benchmark comparison is against Mistral Devstral Small 2a dense model of 24 billion parameters. In internal testing reported by vendors, Cohere claims 2.8x output performance and a 30% inter-token latency advantage over Devstral Small 2 in internal testing with identical hardware configurations. Cohere also states, in his Hugging Face technical postthat North Mini Code outperforms open source models by up to four times their parameter count in their reported benchmarks, including models with 120 billion parameters.

Artificial analyzes Standalone, it ranks eighth out of 127 comparable open weight models for output speed of 210 tokens per second, with a time to first token of 0.25 seconds versus a class average of 1.95 seconds. It is ranked 18th out of 127 in the Artificial Analytics Intelligence Index. A sign of the same data: the model generated 75 million output tokens to complete the Intelligence Index against a class average of 25 million. In high-volume agent pipelines, that verbiage becomes latency and inference cost.

"Suddenly people think: Am I getting enough economic value from a model’s tokens?" Frosst said during the launch video. "Deploying locally is a way to empower people and make AI something that really works for them."

GitHub Copilot, Cursor, and Claude Code operate on per-use or subscription pricing with no local option. Anthropic’s Claude Fable 5, now the most capable publicly available managed encryption model, costs $50 per million output tokens. For Frosst, the model is the polar opposite of Fable.

"It’s small, cost-effective, Apache 2.0 and can be deployed locally. This is how LLMs should work. small, open source, transparent and sovereign, versus large, expensive, proprietary and hegemonic," Frost wrote in a publish in X.

What this means for companies

For teams building production agent coding processes, the release of North Mini Code clarifies a set of decisions that have been being made for months.

Specifically designed agent training is now a basis for evaluation. The distinction between models tuned for code and models trained specifically for agent workflows, with verified tool calls and multi-harness robustness, is now an important factor in ongoing decisions. Any model vendor that claims to have agent coding capability should be able to answer whether their training used verifiable agent tasks or was adapted from a general-purpose base.

Verbosity is a hidden cost that benchmarks don’t reveal. The artificial analysis measured that North Mini Code generated three times more output tokens than comparable models. That verbosity is compounded by the cost of inference and latency in high-volume pipelines. Performance testing against actual workload volume is the evaluation step that bypasses benchmark ratings.

The division of prices at the border is now a true architectural decision. Fable 5 at $50 per million output tokens and North Mini Code on a single H100 represent a genuine trade-off between cost control and data residency on the one hand, and managed infrastructure overhead on the other. Teams running high-volume agent coding processes should model both cost paths against their actual workload before committing to either.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *