Bumblebee: Slow load_model in GenServer, slow Nx.Serving.run in exs file

I’ve run into an issue that I don’t understand. Running on an M1 mac with 32 GB of ram trying to execute a very small LLM and I’m seeing huge times to do certain function calls.

In my GenServer, I have the following:

  # @model "meta-llama/Llama-3.1-8B-Instruct"
  # @model "mistralai/Mistral-7B-Instruct-v0.3"
  # @model "HuggingFaceTB/SmolLM2-1.7B-Instruct"
  @model "meta-llama/Llama-3.2-1B-Instruct"

  def handle_cast(:load_model, _nothing) do
    Logger.info("Async model load beginning...", [service: "RagLLM"])

    repo = {:hf, @model, auth_token: System.fetch_env!("HF_TOKEN")}

    {:ok, model_info} = Bumblebee.load_model(repo, type: :bf16)
    Logger.info("Model loaded", [service: "RagLLM"])
    {:ok, tokenizer} = Bumblebee.load_tokenizer(repo)
    Logger.info("Tokenizer loaded", [service: "RagLLM"])
    {:ok, generation_config} = Bumblebee.load_generation_config(repo)
    Logger.info("Generation Config loaded", [service: "RagLLM"])

    generation_config = Bumblebee.configure(generation_config, max_new_tokens: 100)

    Logger.info("Model, tokenizer defined, bumblebee configured", [service: "RagLLM"])

    serving =
      Bumblebee.Text.generation(model_info, tokenizer, generation_config,
        compile: [batch_size: 1, sequence_length: 6000]
      )

    Logger.info("Serving created", [service: "RagLLM"])

    {:ok, _server} = Nx.Serving.start_link(serving: serving, name: LLMServing, batch_timeout: 100)

    {:noreply, serving}
  end

The load_model call can take 5 minutes to finish the load_model call. I’m using EMLX for my mac by setting the following in my application start function:

    if :os.type() == {:unix, :darwin} do
      IO.puts("Loading EMLX for MacOS")
      Nx.default_backend({EMLX.Backend, device: :gpu})
      Nx.Defn.default_options(compiler: EMLX)
    else
      Nx.default_backend({EXLA.Backend, device: :gpu})
      Nx.Defn.default_options(compiler: EXLA)
    end

But in an invoke_model.exs file, it gets through this line instantly but hangs on the NX.Serving.run/2 call.

if :os.type() == {:unix, :darwin} do
  Application.ensure_all_started(:emlx)
  Nx.default_backend({EMLX.Backend, device: :gpu})
  Nx.Defn.default_options(compiler: EMLX)
else
  Application.ensure_all_started(:exla)
end
Application.ensure_all_started(:bumblebee)

:observer.start()

repo = {:hf, "meta-llama/Llama-3.2-1B-Instruct", auth_token: System.fetch_env!("HF_TOKEN")}

{:ok, model_info} = Bumblebee.load_model(repo, type: :bf16)
IO.puts("Model loaded")
{:ok, tokenizer} = Bumblebee.load_tokenizer(repo)
IO.puts("Tokenizer loaded")
{:ok, generation_config} = Bumblebee.load_generation_config(repo)
IO.puts("Generation Config loaded")

generation_config = Bumblebee.configure(generation_config, max_new_tokens: 100)

IO.puts("Model, tokenizer defined, bumblebee configured")

serving =
  Bumblebee.Text.generation(model_info, tokenizer, generation_config,
    compile: [batch_size: 1, sequence_length: 6000]
  )
IO.puts("Starting serving")

{:ok, _server} = Nx.Serving.start_link(serving: serving, name: LLMServing, batch_timeout: 100)

:timer.sleep(10_000)

IO.puts("Simple run call") # Last output that shows up

IO.inspect(Nx.Serving.run(serving, "What is the meaning of life?"))

IO.puts("Running model...")

IO.inspect(Nx.Serving.batched_run(LLMServing, "What is the meaning of life?"))

IO.puts("Done!")

I am not sure what I am doing wrong. Any ideas?

You want this:

-Nx.default_backend({EMLX.Backend, device: :gpu})
+Nx.global_default_backend({EMLX.Backend, device: :gpu})

default_backend applies only to the calling process, global_default_backend applies to all.

but hangs on the NX.Serving.run/2 call

What about the batch_run? For how long does it hang? Does it work in the same setup if you use EXLA instead?

1 Like

Yes, I can’t get this function to complete either. I’ve waited over 30 minutes and nothing.

I’m trying even on an EC2 machine with 23 GB of GPU memory and can’t load the "meta-llama/Llama-3.2-1B-Instruct" model.

20:19:19.812 [error] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

20:19:19.812 [error] Memory usage: 2097610752 bytes free, 23696375808 bytes total.

20:19:19.813 [error] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

20:19:19.813 [error] Memory usage: 2097610752 bytes free, 23696375808 bytes total.
** (RuntimeError) DNN library initialization failed. Look at the errors above for more details.
    (exla 0.9.2) lib/exla/mlir/module.ex:147: EXLA.MLIR.Module.unwrap!/1
    (exla 0.9.2) lib/exla/mlir/module.ex:124: EXLA.MLIR.Module.compile/5
    (stdlib 7.0) timer.erl:599: :timer.tc/2
    (exla 0.9.2) lib/exla/defn.ex:432: anonymous fn/14 in EXLA.Defn.compile/8
    (exla 0.9.2) lib/exla/mlir/context_pool.ex:10: anonymous fn/3 in EXLA.MLIR.ContextPool.checkout/1
    (nimble_pool 1.1.0) lib/nimble_pool.ex:462: NimblePool.checkout!/4
    (exla 0.9.2) lib/exla/defn/locked_cache.ex:36: EXLA.Defn.LockedCache.run/2
    (stdlib 7.0) timer.erl:599: :timer.tc/2

Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

Do you have cuDNN installed on the machine? Which version?

Note that you can also try EXLA with CPU, if you have enough RAM. It will take a while, but if it finishes, we will known that the issue is likely in EMLX.

It appears yes it is already installed. I used the deep learning image, so most things should already be there I expect.

sudo apt-get install zlib1g
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
zlib1g is already the newest version (1:1.2.11.dfsg-2ubuntu9.2).
zlib1g set to manually installed.
0 upgraded, 0 newly installed, 0 to remove and 5 not upgraded.

What is the cuDNN version? You can check using something like apt list --installed | grep libcudnn.

I was able to get it working but changing my PATH and LD_LIBRARY_PATH to be the latest 12.8 but it uses 20 G of GPU memory when the huggingface calculator says it should only require 1-4 G (18 G to train) and I’m only doing inference so its using way more memory than needed it seems to me.

Every 2.0s: nvidia-smi                                                                                                         ip-172-31-34-180: Wed Jul 30 20:43:31 2025

Wed Jul 30 20:43:31 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
|  0%   39C    P0             63W /  300W |   20603MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           54025      C   ...g/28.0/erts-16.0/bin/beam.smp      20594MiB |
+-----------------------------------------------------------------------------------------+

It does seem like EMXL is having issues for me. On this machine it takes 2 minutes to load the model and execute. I also tried on an arm64 machine but couldn’t get it to work, perhaps it will if I make similar fixes like I did here.

Made the same changes to the arm64 machine and no, getting warnings and a crash. Used only 13 G of GPU memory though, out of the 15 G it has.

Every 2.0s: nvidia-smi                                                                                                          ip-172-31-42-52: Wed Jul 30 20:53:08 2025

Wed Jul 30 20:53:08 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20             Driver Version: 570.133.20     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA T4G                     On  |   00000000:00:1F.0 Off |                    0 |
| N/A   65C    P0             39W /   70W |   13543MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A          173457      C   ...g/28.0/erts-16.0/bin/beam.smp      13540MiB |
+-----------------------------------------------------------------------------------------+
20:56:05.810 [info] InUse at ec968ea00000 of size 2097152 next 141
** (RuntimeError) Out of memory while trying to allocate 11793234816 bytes.
    (exla 0.9.2) lib/exla/executable.ex:130: EXLA.Executable.unwrap!/1
    (exla 0.9.2) lib/exla/executable.ex:31: EXLA.Executable.run/3
    (exla 0.9.2) lib/exla/defn.ex:312: EXLA.Defn.maybe_outfeed/7
    (stdlib 7.0) timer.erl:599: :timer.tc/2
    (exla 0.9.2) lib/exla/defn.ex:244: anonymous fn/7 in EXLA.Defn.__compile__/4
    (nx 0.9.2) lib/nx/defn.ex:332: anonymous fn/4 in Nx.Defn.compile/3
    (bumblebee 0.6.3) lib/bumblebee/text/text_generation.ex:73: anonymous fn/4 in Bumblebee.Text.TextGeneration.generation/4
    (nx 0.9.2) lib/nx/serving.ex:1833: anonymous fn/2 in Nx.Serving.Default.handle_batch/3

So looks like it needs the full 20 G it is using on the other machine. I don’t understand why it needs so much nor do I understand why it doesn’t work with EMLX

I came across this issue, I wonder if it is the root cause since its still open. It could be crashed and not reporting anywhere.

Forcing {:nx, "~> 0.10.0", override: true}, got the examples working in that issue, so perhaps there have been updates that EMLX should update to. Just not my issue for some reason… :confused:

it uses 20 G of GPU memory

By default XLA preallocates 80% of the GPU memory upfront, so nvidia-smi will always show such high usage.

You can disable this behaviour by configuring preallocate: false

config :exla, :clients,
  host: [platform: :host],
  cuda: [platform: :cuda, preallocate: false],
  rocm: [platform: :rocm],
  tpu: [platform: :tpu]

Still, once XLA allocates some memory it won’t give it back, so the overall usage may still be inflated.

I built a GenServer which communicates with a simple python script over stdin/stdout using Exile.Process in order to run the LLM model. It is not the preferable solution but its all I have at the moment. I’m still experiencing the slow model download speed to be abysmal when my application is booted. For document ingest I’m using thenlper/gte-small for embeddings, which is only 66.75 MB but its terribly slow to download from an EC2 machine which makes no sense. When the python LLM model booted up, it downloaded quickly, but the Elixir one is super slow and I don’t understand why. I estimate the speed of download to be in the few KB per second, no more than 10 KB/s at most. Yet, when I booted the openai--whisper-tiny model, it downloaded (from the exact same machine while this is going on) 151.09 MB nearly instantly. It would be nice to be able to understand the format of the cache or to have a mix task which will prime the cache with a certain model (useful for infra setup too!) to avoid having to load the entire application. When I do this with iex -S mix run --no-start then Application.ensure_all_started(:bumblebee) and Bumblebee.load_model({:hf, thenlper/gte-small}) it is also a nearly instant downloaded. I’m not sure what is different about my project that is slowing it down like this.

Once I got past that manually, I still see infinite hanging on the batched_run call, even with smaller and simpler models like thenlper/gte-small. For example, within my GenServer when I add a document:

    # Chunk up
    chunks =
      text
      |> String.codepoints()
      |> Enum.chunk_every(@chunk_size)
      |> Enum.map(&Enum.join/1)


    Logger.info("Created #{length(chunks)} chunks")

    # Run tokenizer
    results = Nx.Serving.batched_run(DocsServing, chunks) |> IO.inspect()

    Logger.info("Vectorized #{length(results)} chunks")

I don’t see the second logger line, only:

iex(processing@127.0.0.1)1> Docs.add_doc("test/assets/doc.txt")

14:42:16.697 [info] Created 74 chunks

Is there some kind of deadlock occurring? I am creating the model in a handle_cast as the first message after init like so:

    # Load model
    repo = {:hf, "thenlper/gte-small"}

    {:ok, model_info} = Bumblebee.load_model(repo)
    {:ok, tokenizer} = Bumblebee.load_tokenizer(repo)

    serving =
      Bumblebee.Text.TextEmbedding.text_embedding(model_info, tokenizer,
        compile: [batch_size: 64, sequence_length: 512],
        output_attribute: :hidden_state,
        output_pool: :mean_pooling
      )

    # Start serving
    {:ok, _server} = Nx.Serving.start_link(serving: serving, name: DocsServing, batch_timeout: 100)

I set preallocate: false like you said, and I only see 362 MB of GPU memory being used while the process is running and I switched from using batched_run to run and I get a response now. It runs out of memory but that makes no sense.

I see that XLA has allocated its 13 G out of 15 G available.

Every 2.0s: nvidia-smi                                                                                                          ip-172-31-42-52: Fri Aug  1 15:07:32 2025

Fri Aug  1 15:07:32 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20             Driver Version: 570.133.20     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA T4G                     On  |   00000000:00:1F.0 Off |                    0 |
| N/A   57C    P0             37W /   70W |   13545MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A          194428      C   ...g/28.0/erts-16.0/bin/beam.smp      13542MiB |
+-----------------------------------------------------------------------------------------+

But still I get this:

15:07:12.370 [info] Created 74 chunks

15:07:29.172 [warning] Allocator (GPU_0_bfc) ran out of memory trying to allocate 192.00MiB (rounded to 201326592)requested by op
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
Current allocation summary follows.
Current allocation summary follows.

15:07:29.173 [info] BFCAllocator dump for GPU_0_bfc

...


15:07:29.179 [error] GenServer {OfflineRadioLlm.Registry, OfflineRadioLlm.Ingest.Docs} terminating
** (RuntimeError) Out of memory while trying to allocate 201326592 bytes.
    (exla 0.10.0) EXLA.NIF.run_io(#Reference<0.2779148131.354025504.5762>, [[#Reference<0.2779148131.354025510.8544>, #Reference<0.2779148131.354025510.8743>]], 0)
    (exla 0.10.0) lib/exla/executable.ex:31: EXLA.Executable.run/3
    (exla 0.10.0) lib/exla/defn.ex:128: EXLA.Defn.maybe_outfeed/7
    (stdlib 7.0) timer.erl:599: :timer.tc/2
    (exla 0.10.0) lib/exla/defn.ex:60: anonymous fn/7 in EXLA.Defn.__compile__/4
    (nx 0.10.0) lib/nx/defn/compiler.ex:134: Nx.Defn.Compiler.__jit__/4
    (nx 0.10.0) lib/nx/defn.ex:452: Nx.Defn.do_jit_apply/3
    (nx 0.10.0) lib/nx/defn/evaluator.ex:461: Nx.Defn.Evaluator.eval_apply/4
Last message (from #PID<0.255.0>): {:ingest, "test/assets/doc.txt"}
State: {%Nx.Serving{module: Nx.Serving.Default, arg: #Function<1.7074203/2 in Bumblebee.Text.TextEmbedding.text_embedding/3>, client_preprocessing: #Function<2.7074203/1 in Bumblebee.Text.TextEmbedding.text_embedding/3>, client_postprocessing: #Function<3.7074203/2 in Bumblebee.Text.TextEmbedding.text_embedding/3>, streaming: nil, batch_size: 64, distributed_postprocessing: &Function.identity/1, process_options: [batch_keys: [sequence_length: 512]], defn_options: []}, %HNSWLib.Index{space: :cosine, dim: 384, reference: #Reference<0.2779148131.354025473.10498>}, %{}}
Client #PID<0.255.0> is alive

...

15:07:29.211 [info] Sum Total of in-use chunks: 12.94GiB


15:07:29.211 [info] Total bytes in pool: 14074321152 memory_limit_: 14074321305 available bytes: 153 curr_region_allocation_bytes_: 17179869184


15:07:29.211 [info] Stats:
Limit:                     14074321305
InUse:                     13892561408
MaxInUse:                  13892561408
NumAllocs:                        2196
MaxAllocSize:                922746880
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

It looks like it tried to allocate another 2 GB but it was not available. Why does it try to use so much for this small embeddings model? It doesn’t make any sense to me whats happening here. I tried running a single chunk to see if that helps, like:

results = Nx.Serving.run(serving, [Enum.at(chunks, 0)]) |> IO.inspect()

That worked, it still grabs the full 13 G of GPU memory though. Perhaps like you said just getting 80% but I don’t understand why it would need more for this model with only 74 chunks, its not very much I don’t think. 74 chunks * 384 dimensional vector * 32 bits per vector (f32 type) is 909 thousand bytes, barely a megabyte. When I chunk by 10, 2, 1 items at a time, it always succeeds on the first loop then fails on the second loop.

I know thats a lot… Any thoughts on what to try? I am at a loss for how to make a small simple model like this work in Elixir and that doesn’t bode well for my proposal to my team to try it.

I don’t know what happened but I’m able to run models that previously did not work and I didn’t change anything.

I am however still running into out of memory problems. Normally, I would just say, ok it needs more GPU memory to run, duh. But I’m running into issues running much smaller models than I can with python. For example, the meta-llama/Llama-3.2-1B-Instruct runs out of memory in Elixir yet (obviously when shut down) running the meta-llama/Meta-Llama-3-8B-Instruct model in python works perfectly fine on the exact same machine (an amazon Graviton machine, ARM64). Perhaps there are still some optimizations that are not yet implemented in EXLA?