I left Copilot on VS Code for this free extension and it's miles ahead

Although I have started to reduce the VS Code extensions in my coding arsenal, I consider some of them to be almost essential for my programming tasks. For example, I still rely on extensions for C++, Python, Terraform, ansibleand other coding/IaC languages that I use to train my DevOps skills. Likewise, I have Container Tools for my self-hosting experiments, while Prettier makes my terribly formatted code a little more readable.

However, there is one extension that I consider more important than anything else in my setup: llama-vscode. If you haven’t heard of it, llama-vscode is designed to pair large language models with VS Code, and I dare say it’s better than GitHub Copilot for my coding needs, especially once I pair it with the bulky LLMs running on my local workstations.

I rebuilt my VS Code setup from scratch this year and it’s the fastest it’s ever been.

My VS Code was drowning in extensions

I don’t like the Copilot functionality built into VS Code

Its subscription fees and privacy issues make it terrible for my workloads.

Accessing the Copilot integration in VS Code

Let’s be clear: I’m not trying to say that the Copilot integration built into VS Code isn’t powerful enough. If anything, it is far superior to my local models when it comes to processing hundreds of billions of parameters. Numbers aren’t everything, however, and certain 26B-35B models are powerful enough to serve as decent replacements for their cloud counterparts (and I’ll get to that in a moment).

What really makes me avoid using Copilot is its subscription-heavy, cloud-based nature. The free version places restrictions on the number of chat and autocomplete messages, and I’m forced to hit those limits in a few coding sessions. Sure, it may be cheaper than other AI-powered VS Code rivals, but I’d rather not spend extra money on subscription fees every month.

Even if I give up my stingy nature, there is also the issue of privacy (or rather, the lack thereof) when I rely on an external server for my coding tasks. I often use LLM to debug complex projects or to understand what a certain function does, and this involves uploading several fragments (and sometimes entire configuration files) to the clanker. Between the sensitive nature of many project files and the fact that I often include sensitive information like user credentials and network details when I ask AI for help, you can see why I don’t want to use cloud-based models in my workflow.

Your old GPU can still run great LLMs – you just need the right settings

There are many things you can do with these models.

The llama-vscode extension has all the AI features I could ask for

It’s enough to replace Copilot in my VS Code setup

Despite its self-hosted nature, llama-vscode is capable enough to hold its own against the Copilot functionality built into VS Code. The auto-suggest feature works very well, especially when combined with a decent LLM. I also love that there are different shortcuts to accept the first word, line, or even the entire suggested chunks.

The chat feature is equally useful for asking my LLMs about random features, and I can even add entire files as context when I ping clankers to help me troubleshoot or debug a project. Better yet, VS even supports agent coding and I can fine-tune the tools and MCP servers I want my LLMs to take advantage of during a coding session. While its user interface is a little more complicated to use than VS Code’s Copilot, I got used to llama-vscode within just a few hours of using it for the first time.

The extension can even activate a llama.cpp environment

But I’ve paired it with bulky models running on local instances of the called server.

As for models, llama-vscode includes built-in templates for common LLMs, ranging from simple Qwen 2.5 encoder models that can run on CPU to full GPT OSS (20B). There are even provisions for accessing OpenRouter-based models, but I stay away from them for obvious reasons. I currently use two dedicated llama.cpp servers that I already set up before transitioning to llama-vscode, as it’s much easier to tune the model parameters on a separate LLM hosting server.

On my main PC, I have an RTX 3080 Ti running Qwen3.6-35B-A3Band I use it for most of my VS Code tasks. But for the rest of my self-hosted application stack, I implemented a Gemma-4-26B-A4B instance on my GTX 1080. Since they are both Expert Mix models, I can simply offload the experts and less-used parts of the LLM to system RAM, while leaving the attention layers on the GPU, thus running the models on hardware without VRAM and still getting reasonable token generation speeds. Connecting them to llama-vscode was as easy as heading to the Settings menu and entering my systems’ IP addresses into the endpoint URL fields.

Qwen3.6-35B-A3B, in particular, is extremely useful for my coding projects. I rely on it for everything from debugging strange functions to troubleshooting terminal outputs from failed Proxmox experiments, and it hasn’t let me down once. The best part? Since inference tasks only take a few seconds, my LLM hosting servers have almost no impact on my energy bills.

Source link

I left Copilot on VS Code for this free extension and it’s miles ahead

I rebuilt my VS Code setup from scratch this year and it’s the fastest it’s ever been.

I don’t like the Copilot functionality built into VS Code

Its subscription fees and privacy issues make it terrible for my workloads.

Your old GPU can still run great LLMs – you just need the right settings

The llama-vscode extension has all the AI features I could ask for

It’s enough to replace Copilot in my VS Code setup

The extension can even activate a llama.cpp environment

But I’ve paired it with bulky models running on local instances of the called server.

Leave a ReplyCancel Reply

Run it again: a budget Nothing Ear 3a is all these rumors can talk about

This is what Jeff Bezos’ new startup, Prometheus, will do

Google CEO Sundar Pichai was the target of protests during his commencement speech at Stanford. It wasn’t because of the AI

I rebuilt my VS Code setup from scratch this year and it’s the fastest it’s ever been.

I don’t like the Copilot functionality built into VS Code

Its subscription fees and privacy issues make it terrible for my workloads.

Your old GPU can still run great LLMs – you just need the right settings

The llama-vscode extension has all the AI ​​features I could ask for

It’s enough to replace Copilot in my VS Code setup

The extension can even activate a llama.cpp environment

But I’ve paired it with bulky models running on local instances of the called server.

Leave a ReplyCancel Reply

Trending now

Run it again: a budget Nothing Ear 3a is all these rumors can talk about

This is what Jeff Bezos’ new startup, Prometheus, will do

Google CEO Sundar Pichai was the target of protests during his commencement speech at Stanford. It wasn’t because of the AI

The llama-vscode extension has all the AI features I could ask for