interesting thread
Please wake me up when I can run it locally.
I prefer privacy over convenience.
This thread/fork split from: AI is getting ridiculously productive
interesting thread
Please wake me up when I can run it locally.
I prefer privacy over convenience.
This thread/fork split from: AI is getting ridiculously productive
Qwen3.5 27B can run on a Tesla P40 for less than $300 USD, and there a lot of use cases you can get at that level.
Right now Iām playing around with qwen3.5:35b-a3b-q8_0 on my workstation. Itās not perfect, but itās definitely got utility.
To be fair, the most frequent issue is speed because Iām really running right on the edge of my hardware can do⦠but itās not terribly unusable or anything. The larger issue is that, I suspect, that you really do need some skill to get the most out of it locally⦠Iām getting there, but the road is longā¦
Iāve got a test server with a couple of GPUs, and also wait for the day when local inference is more competitive. Things are moving in the right direction. While we wait, IMO youāll be well served to buy a low-end LLM subscription and learn the tools now.
Edit to say: if you want to run local inference now for real workloads, you could get a RTX 4070 ~ $550 or 3060 ~ $350, put it on an old Ubuntu box and run OpenClaw/NemoClaw. Ubuntu 26.04 is supposed to have pre-configured CUDA drivers (sudo apt install cuda).
I would much prefer locally too if the quality was there. But is not a matter of convenience for me. It is a matter of extending what I can do, how much I can do, and in the end making more money. Come to think of it the theme of most tools.
I do think locally will be there one day. When that day comes I hope experience with todayās solutions will make a transition easy.
The headline question was when. In 2019 or so Nvidia Jetson AGP (?) was roughly 30 tops and 32GB ram. In 2022 the Nvidia Jetson AGX Orin was about 200 tops and 64GB ram. In 2025 we got the Nvidia Thor with roughly 2000 tops and 128GB ram. Linear prediction would then give 20000 tops and 256GB ram in 2028, and 200000 tops and 512GB in 2031.
For local running of quality code though I think the models need to be made for that to keep quality up with smaller models. There is a lot in this world that coding models have no need to know so local models for specialized tasks like coding could be a lot smaller (and spend less power and resources in the process).
Iām by no means an expert, but recently I was also wondering if there could be benefits from the ābigger and faster transferā to the model when run locally. Sending everything in a slow ping pong of text feels kinda inefficient, but I guess itās what we have right now (at least the text part).
To draw a potential parallel to the past. Universities used to have powerful central time shared servers for the computation power people might today hold in their pocket.
One thing to keep in mind is that small local models are not useless, they are great fuzzy logic implementations or great text wranglers.
I use Qwen through ollama on my MacBook Air for various little tasks : receipts classification (my accountant is very old school), enrichment of transcripts (transcribed locally with whisper) with project context, and data structuring from quick text notes. The scripts that run those tools were also partly written locally by qwen !
If the Talaas exploration (a startup that burns LLM weights on silicon) succeds, little specialized models might be way more common. Llama8b-in-an-USB-stick running at 10000 tokens/second can be very useful. (If you did not try this demo, https://chatjimmy.ai/ demonstrates supposedly on-chip inference of llama8b, my tries hover around 15000 tokens/second).
So small local models seem to be on a track to become « edge inference » with lower consumption⦠still bringing another new generation of hardware and gadgets though
.
The problem I see here is a āmisalignment of expectationsā (within the AI retail market - i.e. us). The NVIDIAās market is essentially data centers. Why would they jeopardize their customersā revenues from token sales with a product they could only sell to us every once in a while? IMO, their strategy so far has been going in the opposite direction - to finance their customersā purchases of their own products en mass (it hardly gets any bubblier than that, btw).
A rough breakdown of NVIDIAās sales for the last fiscal year (according to Grok):
Also, the LLM providers, what incentive would they have to train models we can then use for free? Thatās not the name of the game.
Maybe, just maybe, this opens a new window of entrepreneurial opportunity: training and selling models for private use, but again, with the customers running them on what?
Exactly this!
And even if there is something which is running reasonable on smaller (local) hardware there always will be a much faster, way more āintelligentā, much more desirable LLM.
It always has been like this not only for LLMs
Thinking about all the AI models companies like Google, Meta and others have made available for free it seems to me that is part of the game. For most training some of the big models (not just LLMs) is impossible due to both lack of enough data and processing resources. But here freebies had made a lot of difference as fine tuning of such available models is achievable for many. Either way there are free LLMs available today too, and I hope they will keep improving as well.
I think that the commercial big ones will likely always be better. But all LLMs keep improving, while the difficulty of programming stay fairly constant if not easier. Thus my hope is that at some point local LLMs will make production quality code. Then I will ask myself if paying a service to do even better quality or faster is really worth it. Good enough is good enough at some point.
I donāt know. At some level this huge LLMs providers in the sky remind me of laser printing services. Great businesses until the prices and quality was such that every office just got their own laser printer instead. I think LLMs for programming might be temporary. Will we still be doing it like we are now in 5 to 10 years? Iām not so sure.
Like with crypto, I suspect the answer to efficiency will be found in purpose-built ASICs. The model certainly doesnāt have to produce perfect one-shot results if the agents can quorum/loop at 17K tps.
I often run 2-3 sessions at once with Claude. Thatās no problem for my machine, since the hard work is done in the cloud. But to move that workflow to local models, the models would have to be 1) smart enough to handle my tasks AND 2) cheap enough to run that I could run multiple simultaneously.
Iād also take some convincing of item 1; frontier LLMs have improved drastically in the last few months, and although Iām sympathetic to wanting to run local models for privacy / control / independence / openness, I wouldnāt want to give up capability either.
What I find is that all asks I might make of the LLM are not equal. Many of the more routine, simpler kinds of work are well within the capabilities of local models. However, more advanced work either runs into performance limitations or the fact that the SOTA/Frontier models are just better. Mix in things like search and you get even more flavor.
So, for that simple work: I run those locally. Why burn subscription tokens on work that doesnāt need the power? For the more advanced work, then I use Cursor with what model I feel best for the task. I subscribe on the low end so what I do via the subscription can max it out fairly quickly. Finally, for streamlining initial research or search oriented interactions, Iāll use Kagi Assistant which Iāve found very productive⦠to be fair Iāve not used many other similarly search aligned assistants, so I could be missing out.
I still have a fair amount to go to get the most out of the local setup, but this perspective that it doesnāt need to be all cloud or all local I think offers some ways to get the most out of all the expenses Iāve made on computers⦠both capital (hardware) and expense (subscriptions).
Waiting for Intel Panther Lake and AMD Strix Halo to be available for general and wide scale usage.
As of March 2026, providers can lose a lot of money on inference because investors value their companies based on growth, not on their probability of profit. Weāve circled this drain many times.
As long as this is the case, local inference will be very hard to do in a fiscally responsible manner.
When the investor market inevitably changes its mind, it actually does want to see profits. The money for subsidizing inference will dry up. Thatās when the playing field will be leveled, and if you buy hardware after that time & keep your local GPU busy for enough hours of the day, I suspect you would come out on top financially.
I am imagining long-running background agents with many separate checks for the correctness of feature implementations, & every kind of automated code-quality check. Keep the GPU busy for hours. Only when all checks pass does it open a PR.
financially
In my case, local execution is about privacy, not financial optimization.
I am imagining long-running background agents with many separate checks for the correctness of feature implementations, & every kind of automated code-quality check. Keep the GPU busy for hours. Only when all checks pass does it open a PR.
Yes - IMO non-time-critical will be the sweet spot for local inference. Long running Claw/Jido/Hermes type agent networks.