I ran my local LLM for hours and watched it get sillier in real time

Local LLMs are at a point where I can use them for most of my coding tasks, but moving up to larger models It comes with some caveats. if you have unified memory like on a Macyou’re golden, but I’ve been running models on a NVIDIA RTX 5090 and I noticed that the model seems to get worse over time.

Yes, the fastest consumer GPU money can buy right now becomes a paperweight the more I talk to my local LLM. I thought it was the model or the server since I use Qwen 3.6 27B on Q4_K_M quant inside LM Studio, but I tried other smaller models and vLLM, and the same patterns emerged. I did some math and decided that context length was to blame, but it wasn’t, and the real reasons were much more interesting.

I tried Google’s new DiffusionGemma and seeing it generate text as an image is unlike any local LLM.

Google recently launched DiffusionGemma, and it’s weird in the best of ways.

The obvious answer was the wrong one.

The usual calculations for VRAM usage do not apply to Qwen 3.6

On paper, the RTX 5090 is a dream local inference beast. 32 GB of GDDR7 memory, with plenty of bandwidth to supply data to the CUDA cores. LM Studio says Qwen 3.6 27B in Q4 is within capabilities and performs well when running. But over time the responses vary, the token generation slows down and gets worse even when I’m not actively using the chat.

The first gut reaction is to blame the model, but it hasn’t changed since it’s pre-trained and you’re not training it while running it locally. Or that the days of activity have degraded him, but that is also incorrect. But he context lengthwhich LM Studio raised to the maximum of 262,144 tokens, caused all the damage.

Obvious things really aren’t obvious.

Here’s the thing: even simple calculations when diagnosing these problems were wrong. KV cache in a 256K context should work on 64GB, double the VRAM of the RTX 5090. Job done, this model is useless. Except Qwen 3.6 27B is not a standard transformer; It is built on a hybrid architecture where only 16 of the 64 layers use all the attention. The other layers do not scale with context, giving you 16GB of total VRAM usage at the full 256K context.

However, that’s not enough to stop you from using all your VRAM, because LLM isn’t the only thing that needs memory. You need to add 16.8 GB of weights and then add the overhead for Windows 11, your browser, the vision encoder, and CUDA buffering. Now you’re done 32GBand this will affect everything.

Context length	KV Cache (fp16)	+ 16.8 GB weight	Does it fit in 32 GB?
262K (the configuration)	~16GB	~33GB + general expenses	No – little more → spills
128K	~8GB	~25GB	Yeah
64K	~4GB	~21GB	yes, easily
32K	~2GB	~19GB	Yes, lots of space.

That’s the short explanation of why my local LLM seems dumber. In LM Studio defaults you run into problems and have to offload parts to the CPU. “A little more” is “too much help” in handling prompts, queries, and returning results, and makes your capable LLM behave like a child’s toy.

One thing makes your local LLM feel “dumber”

Once the context window fills up, your conversation loses history

lm studio qwen 3.6 with high context window

It’s important to understand what’s going on under the hood while speaking to a local LLM. Each step of a conversation feeds back into the model’s history to gather context and generate new insights. That means more tokens per turn and more instructions to read before you can use tokens to answer.

This presents a peculiarity in the operation of transformer models. They are considerably worse at remembering information in the middle of a long context window, preferring to read the beginning and the end and make do with that. When you talk to your local LLM long enough, you start to discard older instructions and forget your initial message.

The setting you thought would give you intelligence, the 256K context window, is starting to become the Achilles heel of the whole thing. Add that Qwen 3.6 is a reasoning model, with a hidden thought trail, and you’ll burn through that context window quickly and degrade your experience, not the model.

The model is not changing; the surrounding scaffolding is falling apart

We should clarify one thing. The model does not change one bit. The weights are frozen in inference; It’s not learning more, you can’t acquire bad habits and no, you’re not making it worse. However, all the other pieces of the puzzle around that model are valid, and context, memory pressure, and thermals contribute to LLM operating at suboptimal levels.

For the record, LLMs in the frontier cloud also have the same limitations. Long input and attention issues cause the best LLMs to lose track of the conversation over time, and it’s a constant battle for companies to find solutions that don’t break tools.

I finally found a local LLM that I really want to use for coding

Qwen3-Coder-Next is a great model and is even better with Claude Code as a harness.

Slowdowns are due to another setting

KV cache is something to keep in mind

qwen 3.6 adjusting settings in lm studio

The more I learn about LLMs, the more I realize that I know nothing, and the sheer amount of configuration required to make them work well is something I didn’t know. The KV cache is essential, but it is defaulted by the length of the context you set, regardless of whether you use the entire context window.

That hidden memory cost may be enough to cause your VRAM to overflow. Interestingly, the fact that you’re using an Nvidia GPU on Windows is part of the problem, because the driver doesn’t crash when VRAM fills up. Instead, it silently dumps the excess into the system RAM via the PCIe bus, slowing down the entire computer.

A new chat solves many things.

Whether it’s one long chat or smaller chats over the course of a week, the VRAM margin starts to disappear. When it disappears, it overflows back into the system memory and slows down faster than before. Opening a new chat, reloading the model, or even restarting LM Studio fixes things again, telling me that the cache was full and not clearing.

Yes, as with many computer problems, the solution is a simple “you turned it off and on again.” It’s funny that that physical action helps with AI, but here we are. A new chat resets the nonsense and reloading the model fixes the slowdowns.

My local LLM was just a chat window until Hermes Agent allowed it to run scripts, files and jobs for me.

Hermes Agent is a great addition to my home lab.

Your local LLM isn’t really dumber, but the way you use it is

The degradation of local LLM performance over time is due to multiple factors working together. Problems with the context window make results less useful over time, while thermal and memory problems contribute to slowdowns. LLMs are tools; They are not so mysterious. Treat them like any other computer tool and restart them from time to time, and your experience will feel less variable.

Source link

I ran my local LLM for hours and watched it get sillier in real time

I tried Google’s new DiffusionGemma and seeing it generate text as an image is unlike any local LLM.

The obvious answer was the wrong one.