Cohere’s open ASR model achieves a word error rate of 5.4%, low enough to replace speech APIs in production processes.



Companies building voice-enabled workflows have had limited options for production-grade transcription: closed APIs with data residency risks or open models that trade accuracy for deployability. Cohere’s new open ASR model, Transcribe, is designed to compete on four key differentiators: contextual accuracy, latency, control and cost.

Cohere says Transcribe outperforms current leaders in accuracy and, unlike closed APIs, can run on an organization’s own infrastructure.

Cohere, which can be accessed via an API or in Cohere’s Model Vault as cohere-transcribe-03-2026, has 2 billion parameters and is licensed under Apache-2.0. The company said Transcribe has an average word error rate (WER) of just 5.42%, so it makes fewer errors than similar models.

He is trained in 14 languages: English, French, German, Italian, Spanish, Greek, Dutch, Polish, Portuguese, Chinese, Japanese, Korean, Vietnamese and Arabic. The company did not specify which Chinese dialect the model was trained on.

Cohere said he trained the model “with a deliberate focus on minimizing WER, while keeping production readiness a priority.” According to Cohere, the result is a model that companies can connect directly to voice-driven automations, transcription pipelines, and audio search workflows.

Self-hosted transcription for production pipelines

Until recently, enterprise transcription has been a trade-off: closed APIs offered accuracy but locked data; Open models offered control but fell behind in performance. Unlike Whisper, which was released as a research model under license from MIT, Transcribe is available for commercial use at launch and can run on an organization’s local GPU infrastructure. Early adopters noted that the commercial-ready open weight approach was meaningful for enterprise deployments.

Organizations can bring Transcribe to their own on-premises instances, as Cohere said the model has a more manageable inference footprint for on-premises GPUs. The company said they were able to do this because the model “extends the Pareto frontier, delivering state-of-the-art accuracy (low WER) while maintaining best-in-class performance (high RTFx) within the cohort of billion-plus parameter models.”

How Transcribe Compares

Transcribe beat out voice model stalwarts including OpenAI’s Whisper, which powers ChatGPT’s voice feature, and ElevenLabs, which many big retail brands deploy. He currently heads the Hugging Face ASR Leaderboardleading with an average word error rate of 5.42%, beating Whisper Large v3 with 7.44%, ElevenLabs Scribe v2 with 5.83%, and Qwen3-ASR-1.7B with 5.76%.

According to other data sets tested by Hugging Face, Transcribe also performed well. The AMI dataset, which measures meeting understanding and dialogue analysis, Transcribe recorded a score of 8.15%. For the Voxpopuli data set testing understanding of different accents, the model scored 5.87%, second only to Zoom Scribe.

Early adopters have pointed to accuracy and local deployment as notable factors, especially for teams that have been routing audio data through external APIs and want to bring that workload in-house.

For engineering teams creating RAG pipelines or agent workflows with audio inputs, Transcribe offers a path to production-grade transcription without the data residency and latency penalties of closed APIs.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *