RoboZoom

Bumblebee Mistral Resource Optimization

I am new to NX/Bumblebee, and am trying to leverage out of the box Mistral connectivity. I’m trying to use a “Small” model, and my computer is running out of RAM (31GiB) and SWAP (8 GiB) and crashing. Key specs:

OS: Fedora Linux 42 (KDE Plasma Desktop Edition) x86_64
CPU: AMD Ryzen 7 5800X (16) @ 4.85 GHz
GPU: AMD Radeon RX 5700 XT [Discrete]  
Memory: 6.35 GiB / 31.28 GiB (20%)
Swap: 2.09 GiB / 8.00 GiB (26%)

Is this normal and I just need a better computer, or do I have something killing me in my setup? Most of this below is blindly copied from examples I’ve found online. Here’s how I have the setup configured:

def setup_llm() do
    # token = File.read!("token.txt")
    repo = {:hf, "mistralai/Mistral-Small-3.2-24B-Instruct-2506"}

    {:ok, model_info} =
      Bumblebee.load_model(repo,
        backend: EXLA.Backend,
        module: Bumblebee.Text.Mistral,
        architecture: :base
      )

    {:ok, tokenizer} = Bumblebee.load_tokenizer(repo)
    {:ok, generation_config} = Bumblebee.load_generation_config(repo)

    generation_config =
      Bumblebee.configure(generation_config,
        max_new_tokens: 256,
        strategy: %{type: :multinomial_sampling, top_p: 0.6}
      )

    Bumblebee.Text.generation(model_info, tokenizer, generation_config,
      compile: [batch_size: 10, sequence_length: 512],
      # stream: true,
      defn_options: [compiler: EXLA]
    )
  end

If the answer is I need a better computer - got it - what should I be looking for for what my computer can handle?

4 comments

/phoenix /nx

0 238 4

2025-10-18 01:13:43 UTC

Marked As Solved

joelpaulkoch

I think you can usually calculate RAM with something like 4x params, so in this case it’s 4x 24B = 96 Gb. It’s not a precise formula but that won’t work on your machine. As you see, small is relative.

You can try smollm2 instead: HuggingFaceTB/SmolLM2-1.7B-Instruct · Hugging Face

Or if you want to go with Mistral, one of their older smaller models should work: mistralai/Mistral-7B-Instruct-v0.3 · Hugging Face

Or other models below or around 8B params. The quality of the output of older and smaller models will usually be worse compared to newer and larger models.

Post #2

Also Liked

joelpaulkoch

Basically, you need to implement the model in Bumblebee if it’s not supported yet. We wrote about that on the bitcrowd blog a while ago.

Often it’s just some small changes to already existing implementations, so as soon as you understand how your model is different, it is actually not a lot of code you have to write.

It takes a while to get into because everything is based on Nx which also means your usual Elixir debugging techniques won’t work (as your building a computational graph with the Elixir code).

There are other ways to debug, there are also some blog posts about Nx, Axon, Bumblebee on the dockyard blog: e.g. Nx for Absolute Beginners - DockYard.

And finally, you can also try to throw an LLM at the problem. It might not get you 100% there but give you an idea what’s missing.

Here is a recent PR that I think was first written mainly by an LLM: https://github.com/elixir-nx/bumblebee/pull/423

Here another (also first pass by LLM, then I rewrote most of it): Add SmolLM3 by joelpaulkoch · Pull Request #422 · elixir-nx/bumblebee · GitHub

For the new Mistral models specifically, I’m not sure but I think there are two main obstacles:

I think they use different tokenizers (tekken?) so that could bring some troubles if it’s not supported yet in Bumblebee
I think these are Mixture of Experts (MoE) models, and I don’t think there is already an implementation of an MoE model in Bumblebee, then I guess it would be a welcome contribution.

Post #4