Although I have started to reduce the VS Code extensions in my coding arsenal, I consider some of them to be almost essential for my programming tasks. For example, I still rely on extensions for C++, Python, Terraform, ansibleand other coding/IaC languages that I use to train my DevOps skills. Likewise, I have Container Tools for my self-hosting experiments, while Prettier makes my terribly formatted code a little more readable.
However, there is one extension that I consider more important than anything else in my setup: llama-vscode. If you haven’t heard of it, llama-vscode is designed to pair large language models with VS Code, and I dare say it’s better than GitHub Copilot for my coding needs, especially once I pair it with the bulky LLMs running on my local workstations.
I don’t like the Copilot functionality built into VS Code
Its subscription fees and privacy issues make it terrible for my workloads.
Let’s be clear: I’m not trying to say that the Copilot integration built into VS Code isn’t powerful enough. If anything, it is far superior to my local models when it comes to processing hundreds of billions of parameters. Numbers aren’t everything, however, and certain 26B-35B models are powerful enough to serve as decent replacements for their cloud counterparts (and I’ll get to that in a moment).
What really makes me avoid using Copilot is its subscription-heavy, cloud-based nature. The free version places restrictions on the number of chat and autocomplete messages, and I’m forced to hit those limits in a few coding sessions. Sure, it may be cheaper than other AI-powered VS Code rivals, but I’d rather not spend extra money on subscription fees every month.
Even if I give up my stingy nature, there is also the issue of privacy (or rather, the lack thereof) when I rely on an external server for my coding tasks. I often use LLM to debug complex projects or to understand what a certain function does, and this involves uploading several fragments (and sometimes entire configuration files) to the clanker. Between the sensitive nature of many project files and the fact that I often include sensitive information like user credentials and network details when I ask AI for help, you can see why I don’t want to use cloud-based models in my workflow.
The llama-vscode extension has all the AI features I could ask for
It’s enough to replace Copilot in my VS Code setup
Despite its self-hosted nature, llama-vscode is capable enough to hold its own against the Copilot functionality built into VS Code. The auto-suggest feature works very well, especially when combined with a decent LLM. I also love that there are different shortcuts to accept the first word, line, or even the entire suggested chunks.
The chat feature is equally useful for asking my LLMs about random features, and I can even add entire files as context when I ping clankers to help me troubleshoot or debug a project. Better yet, VS even supports agent coding and I can fine-tune the tools and MCP servers I want my LLMs to take advantage of during a coding session. While its user interface is a little more complicated to use than VS Code’s Copilot, I got used to llama-vscode within just a few hours of using it for the first time.
The extension can even activate a llama.cpp environment
But I’ve paired it with bulky models running on local instances of the called server.
As for models, llama-vscode includes built-in templates for common LLMs, ranging from simple Qwen 2.5 encoder models that can run on CPU to full GPT OSS (20B). There are even provisions for accessing OpenRouter-based models, but I stay away from them for obvious reasons. I currently use two dedicated llama.cpp servers that I already set up before transitioning to llama-vscode, as it’s much easier to tune the model parameters on a separate LLM hosting server.
On my main PC, I have an RTX 3080 Ti running Qwen3.6-35B-A3Band I use it for most of my VS Code tasks. But for the rest of my self-hosted application stack, I implemented a Gemma-4-26B-A4B instance on my GTX 1080. Since they are both Expert Mix models, I can simply offload the experts and less-used parts of the LLM to system RAM, while leaving the attention layers on the GPU, thus running the models on hardware without VRAM and still getting reasonable token generation speeds. Connecting them to llama-vscode was as easy as heading to the Settings menu and entering my systems’ IP addresses into the endpoint URL fields.
Qwen3.6-35B-A3B, in particular, is extremely useful for my coding projects. I rely on it for everything from debugging strange functions to troubleshooting terminal outputs from failed Proxmox experiments, and it hasn’t let me down once. The best part? Since inference tasks only take a few seconds, my LLM hosting servers have almost no impact on my energy bills.







