APU for LLM Inference?

For development and prototyping, I’d like to retain a basic ability to perform LLM inference on my own hardware, using open source models. My go-to LLM runner is Ollama.

Currently I run an Nvidia eGPU connected by Thunderbolt to an Ubuntu server. I absolutely hate the setup, mostly because the Nvidia driver configuration is terrible, secondarily because the GPU memory is limited (12GB for $375 RTX3060, 16GB for $1300 RTX4080).

Here and there I read about APU (Accelerated Processing Unit), which is a processor that combines GPU and CPU. Examples include: AMD Ryzen AI Max+ Pro 395, Intel Core Ultra 9 275HX, Apple M4 Pro.

Example APU machines include the Framework Desktop and the Mac M3 Ultra.

In theory, an APU can support on the order of ~190GB of unified memory, of which a large portion (like maybe ~120GB) can be allocated to the ‘GPU’ for large models and big context windows.

Does anyone have experience running APU machine with unified memory for LLM inference? I’m curious about costs, performance, driver configuration, and Ollama compatibility.

4 Likes

What about a cluster of M4 Minis? (or Mac Studios)

1 Like

What about a cluster of M4 Minis?

Yes - clustering seems like an emerging thing for local inference. The NVIDIA DGX Spark is built to cluster. So is the Framework Desktop.

A popular framework for clustering is an open source tool EXO.

2 Likes

Please continue to post any updates you might have on the matter. I too am looking to find most bang for the buck.

2 Likes

If you can wait 6 months or so, when the Mini M5’s come out (or a bit later for the Studios) you might be able to pick some up on the second hand market and build a cluster.

Worth keeping an eye on the AI portal on DT too, as it’ll usually show the most interesting threads in the trending lists: https://devtalk.com/ai

3 Likes

Keep in mind that current autoregressive LLMs are heavily bottlenecked by memory bandwidth with low batch sizes (e.g. local inference). If you divide the memory bandwidth by the size of the model you get a reasonably accurate inference speed estimate.

So e.g. if you have a chip with 300GB/s bandwidth and you run a q4 70B model, about 35GB, you will get around 8 t/s. If you look at benchmarks for M chips that’s about right. Frankly 8 t/s feels pretty slow but it’s not unusuable.

However, if you were to jump to, say, a 100GB model, now you’re at 3 t/s. So even if you spend big on 128GB RAM there is not enough bandwidth to move those weights in and out of the registers and the performance is not good.

However, MoE models are designed to only load a subset of weights into the cores for each token. So if you have a 100GB model which only uses 10GB of weights for each token, now you have 30 t/s (pretty good) for your 100GB of RAM. This is why MoE models exist. Deepseek R1 (the real one) or LLama4 are in this category.

Anyway, the point being: these chips with 2-300 GB/s bandwidth might look nice, but keep in mind a 5090 is delivering 1800 GB/s memory bandwidth for its 32 GB of VRAM, i.e. generations will be 6-9 times faster flops notwithstanding (and it has plenty of flops too). The amount of RAM is not the only variable.

Also diffusion models might become a thing at any time and render this entire equation irrelevant, so who knows.

6 Likes

What config are you using?

As I’ve mentioned a few times on here, while I have an active interest in “AI” and have been keeping up with the advancements, I am not actually an LLM user myself other than the occasional experiment to keep track of progress. In the medium term I think models could be useful for some product features I have in mind (better reader mode/summarization, search, categorization of bookmarks/feeds), so that’s why I keep up with things.

Running models locally is very expensive (due to the reasons I mentioned, batching is king), so most people aren’t doing it. The open models available are also not as good as the top closed models.

The problem with “closed models” is that effectively you are the server when it comes to the code. So, unless all you’re developing is open source, it’s not an option.

1 Like

I fully agree with you. If I ever do ship those LLM features I will run my own inference. It’s not something I’m planning for in the near term, but that’s why I keep up with it as I said.

For personal use, well, that’s part of why I’m not using LLMs at all :slight_smile: It does seem like llamacpp plus their neovim plugin on an M series mac might be a decent “better autocomplete” type setup, though. I have considered trying that out.

1 Like