I finally stopped forcing on-premises LLMs and went back to cloud AI.


Until a few months ago, you probably could have gotten away with this using free levels of AI tools exclusively. Most tools didn’t have hard limits and if you reached them you could switch to a different tool and continue. If you reach your Claude limit on the free tier, simply switch to your free ChatGPT account. If you were left with nothing (or didn’t get what you needed with ChatGPT), Gemini was waiting. For a long stretch, that was the workflow for most.

Premium tiers gave you more generous limits and access to powerful models (and features earlier than most). The first part was sorted by bouncing between free tiers of different tools, and the average person didn’t really care about getting early access to features. The free stuff worked and that was enough. Fast forward to today, and the free tiers of most tools are no longer enough. Claude’s free tier has brutal limitsGemini now also has weekly limits, and the same can be said for ChatGPT. Free level rotation technically works, but now each tool also has its strengths, and bouncing between them means giving up the best of each.

Another problem is how expensive it has become to run multiple AI subscriptions. While paying for the basic version of a Pro plan is fine, those levels don’t really unlock the features that really justify paying in the first place. These are behind the higher tiers and you typically have to pay over $100 a month to get them. Ultimately, you may have seen many people running AI models locally on their servers. What most people leave out is that running LLM locally is not for everyone, including me.

First of all, you need really powerful hardware.

I hope you like the sound of the fans.

Lenovo Thinkstation PGX, displaying the PGX logo

The way LLMs work is that they are trained on massive data sets. The larger the data set and the more parameters a model has, the smarter it will be. More parameters also translate into a much larger model file. Now, with cloud-based LLMs, all this data is stored and processed on the provider’s own infrastructure. These are massive data centers equipped with hardware designed for this type of workload. Every time you send a message, the hardware does all the work and you get the response you want. There is no burden on your device, meaning all you really need is a browser and an internet connection.

Do you want to stay up to date with the latest in AI? The XDA AI Insider newsletter is published weekly with deep dives, tool recommendations, and practical coverage you won’t find anywhere else on the site. Subscribe by Modifying your newsletter preferences.!

However, running that same model locally is a completely different story. Instead of offloading the job to a remote server, you need to host the entire model on your own machine. Basically, that means your CPU, GPU, and RAM are doing the work that an entire data center used to do on your behalf. Each part of the process, including loading the model, processing the message, and generating a response, must be done on its own hardware. And while companies are now clearly directing their efforts toward making smaller, more efficient models that can run on consumer hardware, the reality is that models worth running still require a serious machine. In fact, you’ll probably need more powerful hardware than you already have to run some of the smaller models that can actually be used.

The “small” models that people recommend for beginners still need at least 16GB of RAM to run comfortably, and that’s just to get anything working. And while the results may meet your expectations depending on the task you’re performing, the speed at which the model generates answers depends entirely on your hardware. I have an Apple Silicon Mac and while they are known to be good for running LLM locally, I also only have 8GB of RAM because in the past Mahnoor didn’t know that one day he would want to run LLM locally on it.

Even the smallest models I’ve tested run noticeably slower (and typically cause my Mac to run out of application memory mid-stream) than I’d get with any cloud tool. Anything beyond a quick message makes me stare at the screen waiting for the tokens to appear. So unless you already have some pretty capable hardware or are willing to shell out thousands of dollars, running local LLMs simply isn’t a realistic option.

The setup alone is enough to scare most people.

Five hours later and still not a single message has been sent

Running a llama.cpp server on a Raspberry Pi

Since I’m literally someone who writes about technology for a living and also majors in Computer Science, I’m much more comfortable with the technical side of things than most people. I’m comfortable with command lines, configuration files, and just the general weirdness of running unknown software on my machine. I can usually fix problems when they inevitably break and I don’t mind digging into the documentation and figuring things out on the fly when necessary.

That said, despite this familiarity, I still feel like a three-year-old who hasn’t seen a computer when I hear my colleagues talk about running local LLMs on their own servers. The local AI rabbit hole is deeper than most people realize, and all you need to do is take a look at one of the XDA articles on self-hosting setups to see what I mean. People run entire AI stacks on home servers and dedicate Mac Minis to these causes.

And while I could see myself doing all of that if the work I do needed all the privacy benefits that come with local LLMs plus no fee limits and complete control over what I’m executing, the reality is that most of what I do day to day doesn’t need any of that. My own workload doesn’t justify the effort, and that realization was enough to send me back to cloud AI without much guilt.

You miss out on all the features that make cloud AI really useful

Congratulations, you have a 2022 chatbot.

desktop applications for claude, gemini and chatgpt open in a cluster

As I mentioned above, the way on-premises LLMs work is that everything they need to run is literally installed and contained on your own machine. This includes the knowledge you have, the things you can do, and the limits of what you can respond. While newer models now have support for MCP servers and all the fancy stuff, not all on-premises LLMs support them, and getting them to work locally is a project in itself. Additionally, with on-premise LLMs, you also lose the entire ecosystem that you get with cloud-based AI providers.


Flame in terminal

My local LLM can call Claude when stuck and changed everything in my local settings first.

Local LLMs are not very good on their own

For example, one of my Favorite AI features lately are Projects. Claude and ChatGPT have it, and NotebookLM is pretty much based on the idea of ​​knowledge-based projects. Being able to upload research papers, drafts, transcripts, and reference documents in a single space and have AI work on all of them is exactly the type of AI application I need in my life. However, in this case, it’s not really the model that does the heavy lifting. Rather, it is the entire workflow built around it. And Projects is really just one example. With cloud AI tools that are constantly updated and have large teams working on them, you get much more than just access to a model. You get everything that cloud providers have built on top of it, and that’s where most of what makes a chatbot useful for real work really lives.

I don’t really need the privacy that local LLMs promise

The biggest benefits of local LLMs at the moment seem to be the lack of fee caps and complete privacy. Since the model resides on your machine and never connects to anyone else’s servers, nothing you type leaves your device. For people working with sensitive data, that alone may be reason enough to accept all other compensations. Now, I use AI for studying, as an occasional replacement for Grammarly, for coding tools and automations for my workflow, as a research partner, for brainstorming when I write, etc.

None of that involves data that is even remotely sensitive, or that I would be uncomfortable with a company seeing. While speed limits are certainly an issue, I’d rather pay for a higher level of cloud tools that I’m already using than fight with on-premises hardware just to avoid them. Even at $100 a month, I’m still spending less than I would on a machine capable of running the models I really want to use locally. I may change my mind when on-premises LLMs get even better, but for now, cloud AI wins for me.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *