Sorry but has anyone in this thread actually tried running local LLMs on CPU? You can easily run a 7B model at varying levels of quantization (ie. 5 bit quantization) and get a generalized prompt-able LLM. Yeah, of course it’s going to take ~4GB of RAM (which is mem-mapped and paged into memory), but you can easily fine tune smaller more specific models (like the translation one mentioned above) and have surprising intelligence at a fraction of the resources.
Take, for example, phi-2 which performs as well as 13B param models but with 2.7B params. Yeah, that’s still going to take 1.5GB RAM which Firefox wouldn’t reasonably ship, but many lighter weight specialized tasks could easily use something like a fine tuned 0.3B model with quantization.
When you consider the price of a used android (ie. Oneplus 6T for $80 on ebay) and compare it spec for spec with a raspberry pi, it’s actually a really good deal. Like you get:
The way I set mine up is to run the server directly on Android using Termux, having an app autostart Termux on boot, and making sure to disable battery optimizations on the app. And then I just had the phone always plugged into the outlet to maintain the battery (and of course android would just trickle charge / disable once full charged).
Of course this isn’t perfect because you still have much more variability in play (at the OS level) than an RPi (along with not having a standard environment like debian unless you use proot), but it overall is a very powerful setup that works quite well.