Local LLMs are at a point where I can use them for most of my coding tasks, but moving up to larger models It comes with some caveats. if you have unified memory like on a Macyou’re golden, but I’ve been running models on a NVIDIA RTX 5090 and I noticed that the model seems to get worse over time.
Yes, the fastest consumer GPU money can buy right now becomes a paperweight the more I talk to my local LLM. I thought it was the model or the server since I use Qwen 3.6 27B on Q4_K_M quant inside LM Studio, but I tried other smaller models and vLLM, and the same patterns emerged. I did some math and decided that context length was to blame, but it wasn’t, and the real reasons were much more interesting.
The obvious answer was the wrong one.
The usual calculations for VRAM usage do not apply to Qwen 3.6
On paper, the RTX 5090 is a dream local inference beast. 32 GB of GDDR7 memory, with plenty of bandwidth to supply data to the CUDA cores. LM Studio says Qwen 3.6 27B in Q4 is within capabilities and performs well when running. But over time the responses vary, the token generation slows down and gets worse even when I’m not actively using the chat.
The first gut reaction is to blame the model, but it hasn’t changed since it’s pre-trained and you’re not training it while running it locally. Or that the days of activity have degraded him, but that is also incorrect. But he context lengthwhich LM Studio raised to the maximum of 262,144 tokens, caused all the damage.
Obvious things really aren’t obvious.
Here’s the thing: even simple calculations when diagnosing these problems were wrong. KV cache in a 256K context should work on 64GB, double the VRAM of the RTX 5090. Job done, this model is useless. Except Qwen 3.6 27B is not a standard transformer; It is built on a hybrid architecture where only 16 of the 64 layers use all the attention. The other layers do not scale with context, giving you 16GB of total VRAM usage at the full 256K context.
However, that’s not enough to stop you from using all your VRAM, because LLM isn’t the only thing that needs memory. You need to add 16.8 GB of weights and then add the overhead for Windows 11, your browser, the vision encoder, and CUDA buffering. Now you’re done 32GBand this will affect everything.
|
Context length |
KV Cache (fp16) |
+ 16.8 GB weight |
Does it fit in 32 GB? |
|---|---|---|---|
|
262K (the configuration) |
~16GB |
~33GB + general expenses |
No – little more → spills |
|
128K |
~8GB |
~25GB |
Yeah |
|
64K |
~4GB |
~21GB |
yes, easily |
|
32K |
~2GB |
~19GB |
Yes, lots of space. |
That’s the short explanation of why my local LLM seems dumber. In LM Studio defaults you run into problems and have to offload parts to the CPU. “A little more” is “too much help” in handling prompts, queries, and returning results, and makes your capable LLM behave like a child’s toy.
One thing makes your local LLM feel “dumber”
Once the context window fills up, your conversation loses history
It’s important to understand what’s going on under the hood while speaking to a local LLM. Each step of a conversation feeds back into the model’s history to gather context and generate new insights. That means more tokens per turn and more instructions to read before you can use tokens to answer.
This presents a peculiarity in the operation of transformer models. They are considerably worse at remembering information in the middle of a long context window, preferring to read the beginning and the end and make do with that. When you talk to your local LLM long enough, you start to discard older instructions and forget your initial message.
The setting you thought would give you intelligence, the 256K context window, is starting to become the Achilles heel of the whole thing. Add that Qwen 3.6 is a reasoning model, with a hidden thought trail, and you’ll burn through that context window quickly and degrade your experience, not the model.
The model is not changing; the surrounding scaffolding is falling apart
We should clarify one thing. The model does not change one bit. The weights are frozen in inference; It’s not learning more, you can’t acquire bad habits and no, you’re not making it worse. However, all the other pieces of the puzzle around that model are valid, and context, memory pressure, and thermals contribute to LLM operating at suboptimal levels.
For the record, LLMs in the frontier cloud also have the same limitations. Long input and attention issues cause the best LLMs to lose track of the conversation over time, and it’s a constant battle for companies to find solutions that don’t break tools.
Slowdowns are due to another setting
KV cache is something to keep in mind
The more I learn about LLMs, the more I realize that I know nothing, and the sheer amount of configuration required to make them work well is something I didn’t know. The KV cache is essential, but it is defaulted by the length of the context you set, regardless of whether you use the entire context window.
That hidden memory cost may be enough to cause your VRAM to overflow. Interestingly, the fact that you’re using an Nvidia GPU on Windows is part of the problem, because the driver doesn’t crash when VRAM fills up. Instead, it silently dumps the excess into the system RAM via the PCIe bus, slowing down the entire computer.
A new chat solves many things.
Whether it’s one long chat or smaller chats over the course of a week, the VRAM margin starts to disappear. When it disappears, it overflows back into the system memory and slows down faster than before. Opening a new chat, reloading the model, or even restarting LM Studio fixes things again, telling me that the cache was full and not clearing.
Yes, as with many computer problems, the solution is a simple “you turned it off and on again.” It’s funny that that physical action helps with AI, but here we are. A new chat resets the nonsense and reloading the model fixes the slowdowns.
Your local LLM isn’t really dumber, but the way you use it is
The degradation of local LLM performance over time is due to multiple factors working together. Problems with the context window make results less useful over time, while thermal and memory problems contribute to slowdowns. LLMs are tools; They are not so mysterious. Treat them like any other computer tool and restart them from time to time, and your experience will feel less variable.







