Google releases Gemma 4 under Apache 2.0, and that license change may matter more than benchmarks



For the past two years, companies evaluating open-weight models have faced an uncomfortable balance. Google’s Gemma line consistently offered solid performance, but its custom license (with usage restrictions and terms that Google could update at will) pushed many teams toward Alibaba’s Mistral or Qwen. The legal review added friction. Compliance teams pointed out extreme cases. And as capable as Gemma 3 was, "open" with asterisks is not the same as open.

gem 4 eliminates that friction completely. Google DeepMind’s new family of open models is delivered under a standard apache 2.0 license – the same permissive terms used by Qwen, Mistral, Arcee and most of the open weight ecosystem.

No custom clauses, no "Harmful use" exceptions that required legal interpretation, with no restrictions on redistribution or commercial implementation. For enterprise teams who have been waiting for Google to play by the same licensing terms as the rest of the industry, the wait is over.

The moment is remarkable. While some Chinese AI labs (most notably Alibaba’s latest Qwen models, the Qwen3.5 Omni and Qwen 3.6 Plus) have begun to back away from fully open releases for their latest models, Google is moving in the opposite direction: opening up its most capable Gemma version yet, while explicitly stating that the architecture is based on its commercial version. Gemini 3 investigation.

Four models, two levels: from edge to workstation in a single family

Gemma 4 arrives in four different models organized into two levels of implementation. He "workstation" The level includes a 31B parameter dense model and a 26B A4B Expert Mixture Model – both support text and image input with 256,000 token contextual windows. He "edge" level consists of the E2B and E4Bcompact models designed for phones, embedded devices, and laptops, supporting text, images, and audio with 128,000-token contextual windows.

The naming convention requires some analysis. He "my" prefix denotes "effective parameters" — the E2B has 2.3 billion effective parameters but 5.1 billion in total, because each decoder layer carries its own small embedding table through a technique Google calls Embeddings per layer (PLE). These tables are large on disk but cheap to compute, which is why the model performs like a 2B even though it technically weighs more.

He "TO" in 26B A4B means "active parameters" — Only 3.8 billion of the MoE model’s 25.2 billion total parameters are activated during inference, meaning it delivers approximately 26B class intelligence with computational costs comparable to a 4B model.

For IT leaders sizing GPU requirements, this translates directly to deployment flexibility. The MoE model can run on consumer GPUs and should quickly appear in tools like Ollama and LM Studio. The dense 31B model requires more free space (think an NVIDIA H100 or RTX 6000 Pro for unquantized inference), but Google is shipping it too Quantization Aware Training (QAT) Checkpoints to maintain quality with lower precision. On Google Cloud, both workstation models can now run in a fully serverless configuration via Run in the cloud with NVIDIA RTX Pro 6000 GPU, which spins up to zero when idle.

The bet of the Ministry of Education: 128 small experts to save on inference costs

Architectural choices within the 26B A4B model deserve special attention from teams evaluating inference economy. Instead of following the pattern of the Ministry of Education’s recent large models used by a handful of top experts, Google opted to 128 little expertsactivating eight per token plus an always active shared expert. The result is a model that compares competitively with dense models in the 27B-31B range while running at approximately the speed of a 4B model during inference.

This is not just a comparative curiosity: it directly affects service costs. A model that delivers class 27B reasoning with class 4B performance means fewer GPUs, lower latency, and cheaper per-token inference in production. For organizations running coding assistants, document processing pipelines, or multi-shift agent workflows, the MoE variant may be the most practical option in the family.

Both workstation models use a hybrid attention mechanism which intersperses local sliding window attention with full global attention, with the final layer always global. This design enables the 256K context window while keeping memory consumption manageable, an important consideration for teams processing large documents, code bases, or multi-shift agent conversations.

Native multimodality: vision, audio, and feature calling built in from the ground up

Previous generations of open models typically treated multimodality as an add-on. The vision encoders were bolted to the text backbones. Audio required an external ASR channel like Whisper. The function call was based on quick engineering and hoping the model would cooperate. Gemma 4 integrates all these capabilities at the architectural level.

All four models handle variable aspect ratio image input with configurable visual token budgets – a significant improvement over the old Gemma 3n vision encoder, which had issues with OCR and document understanding. The new encoder supports budgets of 70 to 1120 tokens per image, allowing developers to trade details with computation depending on the task.

The lowest budgets work for classification and subtitles; higher budgets handle OCR, document analysis, and detailed visual analysis. Multiple image and video input (processed as sequences of frames) is natively supported, enabling visual reasoning across multiple documents or screenshots.

The two edge models add native audio processing — automatic speech recognition and speech-to-translated text conversion, all on the device. The audio encoder has been compressed to 305 million parameters, up from 681 million for the Gemma 3n, while the frame duration has been reduced from 160ms to 40ms for more responsive transcription. For teams building voice applications that need to keep data local (think healthcare, field services, or multilingual customer interaction), running ASR, translation, reasoning, and function calls in a single model on a phone or edge device is a true architectural simplification.

function call It is also native on all four models, based on Google research FeatureGemma launch at the end of last year. Unlike previous approaches that relied on instruction following to convince models to use structured tools, Gemma 4’s function call was trained on the model from scratch, optimized for multi-turn agent flows with multiple tools. This shows up in agent benchmarks, but more importantly it reduces the immediate engineering overhead that enterprise teams typically spend when creating agents that use tools.

Landmarks in context: where Gemma 4 lands in a crowded field

The benchmark figures tell a clear story of generational improvement. The dense 31B model scores 89.2% of us LIKE 2026 (a rigorous test of mathematical reasoning), 80.0% in LiveCodeBench v6and hits a Codeforces ELO of 2150 – numbers that would have been top of the line with proprietary models not too long ago. In vision, MMMU Pro reaches 76.9% and MATH-Vision reaches 85.6%.

For comparison, Gemma 3 27B scored 20.8% in AIME and 29.1% in LiveCodeBench without thinking.

The MoE model follows closely: 88.3% in AIME 2026, 77.1% in LiveCodeBench and 82.3% in GPQA Diamond, a benchmark of scientific reasoning at the graduate level. The performance gap between MoE and the dense variants is modest given the significant inference cost advantage of the MoE architecture.

Edge models punch above their weight class. The E4B reaches 42.5% in AIME 2026 and 52.0% in LiveCodeBench, which is strong for a model running on a T4 GPU. The even smaller E2B manages 37.5% and 44.0% respectively. Both significantly outperform the Gemma 3 27B (hands down) in most benchmarks despite being a fraction of the size, thanks to the built-in reasoning power.

These numbers should be read in an increasingly competitive openweight landscape. Qwen 3.5, GLM-5 and Kimi K2.5 compete aggressively in this parameter range and the field is moving fast. What sets Gemma 4 apart is less a single benchmark and more the combination: robust reasoning, native multimodality in text, vision and audio, function calls, 256K context, and a genuinely permissive license, all in a single model family with deployment options from edge devices to serverless cloud services.

What business teams should watch next

Google is releasing pre-trained base models and instruction-tuned variants, which is important for organizations planning to tune for specific domains. Gemma’s base models have historically been solid foundations for custom training, and the Apache 2.0 license now removes any ambiguity about whether optimized derivatives can be implemented commercially.

The serverless deployment option via Cloud Run with GPU support is worth looking at for teams that need inference capability that scales to zero. Paying only for actual compute during inference, rather than maintaining always-on GPU instances, could significantly change the economics of deploying open models in production, particularly for internal tools and lower-traffic applications.

Google has hinted that this may not be the full Gemma 4 family, and additional model sizes are likely to follow. But the combination available today (workstation-class reasoning models and edge-class multimodal models, all under Apache 2.0, all based on Gemini 3 research) represents the most comprehensive open model release Google has ever released. For enterprise teams that have been waiting for Google’s open models to compete in terms of licensing and performance, the evaluation can finally begin without a prior call to the legal department.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *