For development and prototyping, I’d like to retain a basic ability to perform LLM inference on my own hardware, using open source models. My go-to LLM runner is Ollama.
Currently I run an Nvidia eGPU connected by Thunderbolt to an Ubuntu server. I absolutely hate the setup, mostly because the Nvidia driver configuration is terrible, secondarily because the GPU memory is limited (12GB for $375 RTX3060, 16GB for $1300 RTX4080).
In theory, an APU can support on the order of ~190GB of unified memory, of which a large portion (like maybe ~120GB) can be allocated to the ‘GPU’ for large models and big context windows.
Does anyone have experience running APU machine with unified memory for LLM inference? I’m curious about costs, performance, driver configuration, and Ollama compatibility.
If you can wait 6 months or so, when the Mini M5’s come out (or a bit later for the Studios) you might be able to pick some up on the second hand market and build a cluster.
Worth keeping an eye on the AI portal on DT too, as it’ll usually show the most interesting threads in the trending lists: https://devtalk.com/ai
Keep in mind that current autoregressive LLMs are heavily bottlenecked by memory bandwidth with low batch sizes (e.g. local inference). If you divide the memory bandwidth by the size of the model you get a reasonably accurate inference speed estimate.
So e.g. if you have a chip with 300GB/s bandwidth and you run a q4 70B model, about 35GB, you will get around 8 t/s. If you look at benchmarks for M chips that’s about right. Frankly 8 t/s feels pretty slow but it’s not unusuable.
However, if you were to jump to, say, a 100GB model, now you’re at 3 t/s. So even if you spend big on 128GB RAM there is not enough bandwidth to move those weights in and out of the registers and the performance is not good.
However, MoE models are designed to only load a subset of weights into the cores for each token. So if you have a 100GB model which only uses 10GB of weights for each token, now you have 30 t/s (pretty good) for your 100GB of RAM. This is why MoE models exist. Deepseek R1 (the real one) or LLama4 are in this category.
Anyway, the point being: these chips with 2-300 GB/s bandwidth might look nice, but keep in mind a 5090 is delivering 1800 GB/s memory bandwidth for its 32 GB of VRAM, i.e. generations will be 6-9 times faster flops notwithstanding (and it has plenty of flops too). The amount of RAM is not the only variable.
Also diffusion models might become a thing at any time and render this entire equation irrelevant, so who knows.
As I’ve mentioned a few times on here, while I have an active interest in “AI” and have been keeping up with the advancements, I am not actually an LLM user myself other than the occasional experiment to keep track of progress. In the medium term I think models could be useful for some product features I have in mind (better reader mode/summarization, search, categorization of bookmarks/feeds), so that’s why I keep up with things.
Running models locally is very expensive (due to the reasons I mentioned, batching is king), so most people aren’t doing it. The open models available are also not as good as the top closed models.
The problem with “closed models” is that effectively you are the server when it comes to the code. So, unless all you’re developing is open source, it’s not an option.
I fully agree with you. If I ever do ship those LLM features I will run my own inference. It’s not something I’m planning for in the near term, but that’s why I keep up with it as I said.
For personal use, well, that’s part of why I’m not using LLMs at all It does seem like llamacpp plus their neovim plugin on an M series mac might be a decent “better autocomplete” type setup, though. I have considered trying that out.