Google's free Gemma 4 model runs on hardware you probably already have

Local LLMs have now become useful tools and can easily handle tasks you wouldn’t have thought of even a year ago. The latest from Google is Gemma 4and while there are four models in the family, each one is adapted for different tasks.

That makes them interesting to use: you can choose the one that fits your hardware needs, and they are all published under the Apache 2.0 license, making them safe to build on. The smaller models work on laptops or mobile phones, while the two larger ones are designed for the best quality results on more capable hardware.

Google’s Gemma 4 isn’t the smartest local LLM I’ve taken, but it’s the one I look for the most

Google’s newest Gemma 4 models are powerful and useful.

Gemma 4 comes with different capacities

Chances are, your device can run at least one of these

lm studio showing variants of google's gemma 4 model that can be downloaded

Most of the time, when four different model weights are released, they are the same model, just quantized to smaller sizes. That makes them behave similarly, but with reduced accuracy as the models get smaller.

Gemma 4 does something different. The four models are all multi-modal, but are designed for different use cases appropriate to the hardware they can run on.

Model	VRAM Q4 (4 bits)	8-bit VRAM	VRAM FP16	best for
E2B(2B)	~3GB	~5GB	~5GB	Lightweight, integrated chat
E4B (4B)	5GB	7.5GB	15GB	General talk, summary.
26B Ministry of Education (A4B)	~16GB	25GB	48GB	RAG, coding assistance
31B Dense	24GB	34GB	62–80GB	High quality generation

The 31B Dense model is the flagship model and comfortably scores well in AI benchmarks used across the industry. So well that they can outperform models with 10 times the parameters, which is impressive, but that’s not the model most people will use. It still requires hardware that is out of reach for many, but that’s where the other models come in.

The 26B MoE consumes even less system resources and will serve as your coding assistant. But the E2B and E4B models are more interesting. These can run on relatively low-powered smartphones or laptops to enable PDF summaries, chatting to understand local storage, or other light tasks you would have done for cloud LLMs not long ago.

Downloadable and usable with your choice of LLM server

You can run Gemma 4 on your phone via the Google AI Edge Gallery appor on PC with Be, vllm, call.cpp, LM Studioor any other LLM server of your choice. That means you can easily choose the LLM model that suits your device while still giving you enough resources for a decent context window and other important settings.

Gemma 4 is the perfect on-premise solution for older hardware

Maybe you already have what you need

Gemma 4 doesn’t need high-end GPUs that cost five figures. You can run it on them, sure, but they’re not strictly necessary unless you want to run the Model 31B with FP16 accuracy.

The 26B MoE model, with a little quantization, works very well on RTX 5090 or RX 7900 XTX; with CPU offloading, you can run it on 16GB VRAM. This is because only a few billion parameters are used at any given time, so offloading does not cause as much of a performance hit as it does with other types of models.

Apple Silicon can run E4B with 8GB of RAM, or 26B MoE with 16GB (although it’s more comfortable with 32GB), and 64GB of RAM will happily run the 31B Dense model. It won’t run as fast as a dedicated GPU, but this underscores the benefits of unified memory architectures like Apple Silicon, AMD’s Strix Halo, and Nvidia’s DGX Spark.

The only thing to remember is that you will also need enough system RAM, because token generation speed requires more than just VRAM. 24GB is a good start if you have it, and anything more is a plus.

You don’t even need to stress your hardware

If you are using Gemma 4 31B up to Google AI StudyThe API for Gemma 4 gives you 1500 free requests per day as long as you stay below 15 requests per minute. There is no limit to the number of tokens you can use, so you can go crazy with whatever you want to build with the Gemma 4 model.

We don’t know how long it will last, as all other Google AI APIs have switched to token billing, but it’s worth using it while you can. That’s the full model, which would typically need a $10,000 GPU to run locally.

Your old GPU can still run great LLMs – you just need the right settings

There are many things you can do with these models.

Even the smallest models can increase productivity

Once you stop treating them like a chatbot

Gemma’s smallest model, E2BIt was designed for laptop or mobile phone use. It’s small, uses around 5GB of RAM in total, and can happily run on your CPU rather than a GPU. That gives you a 128K context window and still has functional tool calls, thinking modes, and system prompt support to make your LLM feel like it’s yours.

That’s a good size for use in Home Assistantto create automations, troubleshoot, and other general tasks. It’s probably enough to run it as your local voice assistant too, and that means it won’t send data to Google, Amazon or Apple in the process.

we have tested E2B beforeand while it did the job, it has some quirks. Some of them may be because it is running through LM Studio, i.e. YMMV, but sometimes it ignores prompts telling it not to show thought or to exchange temperature symbols. Still, these are minor issues when it still does what is asked of it, and on a 2B model at that.

My self-hosted LLMs are much more than just a chat replacement – this is how they increase my productivity

My local LLMs are enough to replace cloud platforms for my productivity tasks

You don’t need powerful hardware to run local LLMs like Gemma 4

With the release of Gemma 4, Google made it possible to run capable LLMs with very modest hardware requirements. This is a great advance, since although the four models are designed for different uses, they all share the same training data and underlying characteristics. It also means you can run AI tasks privately, without transferring data from your device and with more modest power requirements as they only run when you request them.

Source link

Google’s free Gemma 4 model runs on hardware you probably already have

Google’s Gemma 4 isn’t the smartest local LLM I’ve taken, but it’s the one I look for the most

Gemma 4 comes with different capacities

Chances are, your device can run at least one of these

Downloadable and usable with your choice of LLM server

Gemma 4 is the perfect on-premise solution for older hardware

Maybe you already have what you need

You don’t even need to stress your hardware

Your old GPU can still run great LLMs – you just need the right settings

Even the smallest models can increase productivity

Once you stop treating them like a chatbot

My self-hosted LLMs are much more than just a chat replacement – this is how they increase my productivity

You don’t need powerful hardware to run local LLMs like Gemma 4

Leave a ReplyCancel Reply

Epic Games Store Finally Gets Profiles, Reviews, and Controller Support to Take on Steam

6 apps that make any Android phone look like a Google Pixel

Snap unveils smart glasses specs at AWE 2026

Google’s Gemma 4 isn’t the smartest local LLM I’ve taken, but it’s the one I look for the most

Gemma 4 comes with different capacities

Chances are, your device can run at least one of these

Downloadable and usable with your choice of LLM server

Gemma 4 is the perfect on-premise solution for older hardware

Maybe you already have what you need

You don’t even need to stress your hardware

Your old GPU can still run great LLMs – you just need the right settings

Even the smallest models can increase productivity

Once you stop treating them like a chatbot

My self-hosted LLMs are much more than just a chat replacement – this is how they increase my productivity

You don’t need powerful hardware to run local LLMs like Gemma 4

Leave a ReplyCancel Reply

Trending now

Epic Games Store Finally Gets Profiles, Reviews, and Controller Support to Take on Steam

6 apps that make any Android phone look like a Google Pixel

Snap unveils smart glasses specs at AWE 2026