Local LLMs have now become useful tools and can easily handle tasks you wouldn’t have thought of even a year ago. The latest from Google is Gemma 4and while there are four models in the family, each one is adapted for different tasks.
That makes them interesting to use: you can choose the one that fits your hardware needs, and they are all published under the Apache 2.0 license, making them safe to build on. The smaller models work on laptops or mobile phones, while the two larger ones are designed for the best quality results on more capable hardware.
Gemma 4 comes with different capacities
Chances are, your device can run at least one of these
Most of the time, when four different model weights are released, they are the same model, just quantized to smaller sizes. That makes them behave similarly, but with reduced accuracy as the models get smaller.
Gemma 4 does something different. The four models are all multi-modal, but are designed for different use cases appropriate to the hardware they can run on.
|
Model |
VRAM Q4 (4 bits) |
8-bit VRAM |
VRAM FP16 |
best for |
|---|---|---|---|---|
|
E2B(2B) |
~3GB |
~5GB |
~5GB |
Lightweight, integrated chat |
|
E4B (4B) |
5GB |
7.5GB |
15GB |
General talk, summary. |
|
26B Ministry of Education (A4B) |
~16GB |
25GB |
48GB |
RAG, coding assistance |
|
31B Dense |
24GB |
34GB |
62–80GB |
High quality generation |
The 31B Dense model is the flagship model and comfortably scores well in AI benchmarks used across the industry. So well that they can outperform models with 10 times the parameters, which is impressive, but that’s not the model most people will use. It still requires hardware that is out of reach for many, but that’s where the other models come in.
The 26B MoE consumes even less system resources and will serve as your coding assistant. But the E2B and E4B models are more interesting. These can run on relatively low-powered smartphones or laptops to enable PDF summaries, chatting to understand local storage, or other light tasks you would have done for cloud LLMs not long ago.
Downloadable and usable with your choice of LLM server
You can run Gemma 4 on your phone via the Google AI Edge Gallery appor on PC with Be, vllm, call.cpp, LM Studioor any other LLM server of your choice. That means you can easily choose the LLM model that suits your device while still giving you enough resources for a decent context window and other important settings.
Gemma 4 is the perfect on-premise solution for older hardware
Maybe you already have what you need
Gemma 4 doesn’t need high-end GPUs that cost five figures. You can run it on them, sure, but they’re not strictly necessary unless you want to run the Model 31B with FP16 accuracy.
The 26B MoE model, with a little quantization, works very well on RTX 5090 or RX 7900 XTX; with CPU offloading, you can run it on 16GB VRAM. This is because only a few billion parameters are used at any given time, so offloading does not cause as much of a performance hit as it does with other types of models.
Apple Silicon can run E4B with 8GB of RAM, or 26B MoE with 16GB (although it’s more comfortable with 32GB), and 64GB of RAM will happily run the 31B Dense model. It won’t run as fast as a dedicated GPU, but this underscores the benefits of unified memory architectures like Apple Silicon, AMD’s Strix Halo, and Nvidia’s DGX Spark.
The only thing to remember is that you will also need enough system RAM, because token generation speed requires more than just VRAM. 24GB is a good start if you have it, and anything more is a plus.
You don’t even need to stress your hardware
If you are using Gemma 4 31B up to Google AI StudyThe API for Gemma 4 gives you 1500 free requests per day as long as you stay below 15 requests per minute. There is no limit to the number of tokens you can use, so you can go crazy with whatever you want to build with the Gemma 4 model.
We don’t know how long it will last, as all other Google AI APIs have switched to token billing, but it’s worth using it while you can. That’s the full model, which would typically need a $10,000 GPU to run locally.
Even the smallest models can increase productivity
Once you stop treating them like a chatbot
Gemma’s smallest model, E2BIt was designed for laptop or mobile phone use. It’s small, uses around 5GB of RAM in total, and can happily run on your CPU rather than a GPU. That gives you a 128K context window and still has functional tool calls, thinking modes, and system prompt support to make your LLM feel like it’s yours.
That’s a good size for use in Home Assistantto create automations, troubleshoot, and other general tasks. It’s probably enough to run it as your local voice assistant too, and that means it won’t send data to Google, Amazon or Apple in the process.
we have tested E2B beforeand while it did the job, it has some quirks. Some of them may be because it is running through LM Studio, i.e. YMMV, but sometimes it ignores prompts telling it not to show thought or to exchange temperature symbols. Still, these are minor issues when it still does what is asked of it, and on a 2B model at that.
You don’t need powerful hardware to run local LLMs like Gemma 4
With the release of Gemma 4, Google made it possible to run capable LLMs with very modest hardware requirements. This is a great advance, since although the four models are designed for different uses, they all share the same training data and underlying characteristics. It also means you can run AI tasks privately, without transferring data from your device and with more modest power requirements as they only run when you request them.







