Bumblebee Mistral Resource Optimization

I am new to NX/Bumblebee, and am trying to leverage out of the box Mistral connectivity. I’m trying to use a “Small” model, and my computer is running out of RAM (31GiB) and SWAP (8 GiB) and crashing. Key specs:

OS: Fedora Linux 42 (KDE Plasma Desktop Edition) x86_64
CPU: AMD Ryzen 7 5800X (16) @ 4.85 GHz
GPU: AMD Radeon RX 5700 XT [Discrete]  
Memory: 6.35 GiB / 31.28 GiB (20%)
Swap: 2.09 GiB / 8.00 GiB (26%)

Is this normal and I just need a better computer, or do I have something killing me in my setup? Most of this below is blindly copied from examples I’ve found online. Here’s how I have the setup configured:

def setup_llm() do
    # token = File.read!("token.txt")
    repo = {:hf, "mistralai/Mistral-Small-3.2-24B-Instruct-2506"}

    {:ok, model_info} =
      Bumblebee.load_model(repo,
        backend: EXLA.Backend,
        module: Bumblebee.Text.Mistral,
        architecture: :base
      )

    {:ok, tokenizer} = Bumblebee.load_tokenizer(repo)
    {:ok, generation_config} = Bumblebee.load_generation_config(repo)

    generation_config =
      Bumblebee.configure(generation_config,
        max_new_tokens: 256,
        strategy: %{type: :multinomial_sampling, top_p: 0.6}
      )

    Bumblebee.Text.generation(model_info, tokenizer, generation_config,
      compile: [batch_size: 10, sequence_length: 512],
      # stream: true,
      defn_options: [compiler: EXLA]
    )
  end

If the answer is I need a better computer - got it - what should I be looking for for what my computer can handle?

I think you can usually calculate RAM with something like 4x params, so in this case it’s 4x 24B = 96 Gb. It’s not a precise formula but that won’t work on your machine. As you see, small is relative.

You can try smollm2 instead: HuggingFaceTB/SmolLM2-1.7B-Instruct · Hugging Face

Or if you want to go with Mistral, one of their older smaller models should work: mistralai/Mistral-7B-Instruct-v0.3 · Hugging Face

Or other models below or around 8B params. The quality of the output of older and smaller models will usually be worse compared to newer and larger models.

2 Likes

Thanks - this is a very helpful metric.

Is there a guide for how to write adapters to models that Bumblebee does not natively support? Or should I be looking into Axon directly for that?

2 Likes

Basically, you need to implement the model in Bumblebee if it’s not supported yet. We wrote about that on the bitcrowd blog a while ago.

Often it’s just some small changes to already existing implementations, so as soon as you understand how your model is different, it is actually not a lot of code you have to write.

It takes a while to get into because everything is based on Nx which also means your usual Elixir debugging techniques won’t work (as your building a computational graph with the Elixir code).

There are other ways to debug, there are also some blog posts about Nx, Axon, Bumblebee on the dockyard blog: e.g. Nx for Absolute Beginners - DockYard.

And finally, you can also try to throw an LLM at the problem. It might not get you 100% there but give you an idea what’s missing.

Here is a recent PR that I think was first written mainly by an LLM: https://github.com/elixir-nx/bumblebee/pull/423

Here another (also first pass by LLM, then I rewrote most of it): https://github.com/elixir-nx/bumblebee/pull/422

For the new Mistral models specifically, I’m not sure but I think there are two main obstacles:

  1. I think they use different tokenizers (tekken?) so that could bring some troubles if it’s not supported yet in Bumblebee
  2. I think these are Mixture of Experts (MoE) models, and I don’t think there is already an implementation of an MoE model in Bumblebee, then I guess it would be a welcome contribution.
3 Likes

This is an incredibly useful response - and that blog post is excellent.

Thank you - I will look into attempting to build the adapter, and if it works, I’ll pay it forward upstream.

3 Likes