I built a GenServer which communicates with a simple python script over stdin/stdout using Exile.Process
in order to run the LLM model. It is not the preferable solution but its all I have at the moment. I’m still experiencing the slow model download speed to be abysmal when my application is booted. For document ingest I’m using thenlper/gte-small
for embeddings, which is only 66.75 MB but its terribly slow to download from an EC2 machine which makes no sense. When the python LLM model booted up, it downloaded quickly, but the Elixir one is super slow and I don’t understand why. I estimate the speed of download to be in the few KB per second, no more than 10 KB/s at most. Yet, when I booted the openai--whisper-tiny
model, it downloaded (from the exact same machine while this is going on) 151.09 MB nearly instantly. It would be nice to be able to understand the format of the cache or to have a mix task which will prime the cache with a certain model (useful for infra setup too!) to avoid having to load the entire application. When I do this with iex -S mix run --no-start
then Application.ensure_all_started(:bumblebee)
and Bumblebee.load_model({:hf, thenlper/gte-small})
it is also a nearly instant downloaded. I’m not sure what is different about my project that is slowing it down like this.
Once I got past that manually, I still see infinite hanging on the batched_run
call, even with smaller and simpler models like thenlper/gte-small
. For example, within my GenServer when I add a document:
# Chunk up
chunks =
text
|> String.codepoints()
|> Enum.chunk_every(@chunk_size)
|> Enum.map(&Enum.join/1)
Logger.info("Created #{length(chunks)} chunks")
# Run tokenizer
results = Nx.Serving.batched_run(DocsServing, chunks) |> IO.inspect()
Logger.info("Vectorized #{length(results)} chunks")
I don’t see the second logger line, only:
iex(processing@127.0.0.1)1> Docs.add_doc("test/assets/doc.txt")
14:42:16.697 [info] Created 74 chunks
Is there some kind of deadlock occurring? I am creating the model in a handle_cast
as the first message after init
like so:
# Load model
repo = {:hf, "thenlper/gte-small"}
{:ok, model_info} = Bumblebee.load_model(repo)
{:ok, tokenizer} = Bumblebee.load_tokenizer(repo)
serving =
Bumblebee.Text.TextEmbedding.text_embedding(model_info, tokenizer,
compile: [batch_size: 64, sequence_length: 512],
output_attribute: :hidden_state,
output_pool: :mean_pooling
)
# Start serving
{:ok, _server} = Nx.Serving.start_link(serving: serving, name: DocsServing, batch_timeout: 100)
I set preallocate: false
like you said, and I only see 362 MB of GPU memory being used while the process is running and I switched from using batched_run
to run
and I get a response now. It runs out of memory but that makes no sense.
I see that XLA has allocated its 13 G out of 15 G available.
Every 2.0s: nvidia-smi ip-172-31-42-52: Fri Aug 1 15:07:32 2025
Fri Aug 1 15:07:32 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20 Driver Version: 570.133.20 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA T4G On | 00000000:00:1F.0 Off | 0 |
| N/A 57C P0 37W / 70W | 13545MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 194428 C ...g/28.0/erts-16.0/bin/beam.smp 13542MiB |
+-----------------------------------------------------------------------------------------+
But still I get this:
15:07:12.370 [info] Created 74 chunks
15:07:29.172 [warning] Allocator (GPU_0_bfc) ran out of memory trying to allocate 192.00MiB (rounded to 201326592)requested by op
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
Current allocation summary follows.
Current allocation summary follows.
15:07:29.173 [info] BFCAllocator dump for GPU_0_bfc
...
15:07:29.179 [error] GenServer {OfflineRadioLlm.Registry, OfflineRadioLlm.Ingest.Docs} terminating
** (RuntimeError) Out of memory while trying to allocate 201326592 bytes.
(exla 0.10.0) EXLA.NIF.run_io(#Reference<0.2779148131.354025504.5762>, [[#Reference<0.2779148131.354025510.8544>, #Reference<0.2779148131.354025510.8743>]], 0)
(exla 0.10.0) lib/exla/executable.ex:31: EXLA.Executable.run/3
(exla 0.10.0) lib/exla/defn.ex:128: EXLA.Defn.maybe_outfeed/7
(stdlib 7.0) timer.erl:599: :timer.tc/2
(exla 0.10.0) lib/exla/defn.ex:60: anonymous fn/7 in EXLA.Defn.__compile__/4
(nx 0.10.0) lib/nx/defn/compiler.ex:134: Nx.Defn.Compiler.__jit__/4
(nx 0.10.0) lib/nx/defn.ex:452: Nx.Defn.do_jit_apply/3
(nx 0.10.0) lib/nx/defn/evaluator.ex:461: Nx.Defn.Evaluator.eval_apply/4
Last message (from #PID<0.255.0>): {:ingest, "test/assets/doc.txt"}
State: {%Nx.Serving{module: Nx.Serving.Default, arg: #Function<1.7074203/2 in Bumblebee.Text.TextEmbedding.text_embedding/3>, client_preprocessing: #Function<2.7074203/1 in Bumblebee.Text.TextEmbedding.text_embedding/3>, client_postprocessing: #Function<3.7074203/2 in Bumblebee.Text.TextEmbedding.text_embedding/3>, streaming: nil, batch_size: 64, distributed_postprocessing: &Function.identity/1, process_options: [batch_keys: [sequence_length: 512]], defn_options: []}, %HNSWLib.Index{space: :cosine, dim: 384, reference: #Reference<0.2779148131.354025473.10498>}, %{}}
Client #PID<0.255.0> is alive
...
15:07:29.211 [info] Sum Total of in-use chunks: 12.94GiB
15:07:29.211 [info] Total bytes in pool: 14074321152 memory_limit_: 14074321305 available bytes: 153 curr_region_allocation_bytes_: 17179869184
15:07:29.211 [info] Stats:
Limit: 14074321305
InUse: 13892561408
MaxInUse: 13892561408
NumAllocs: 2196
MaxAllocSize: 922746880
Reserved: 0
PeakReserved: 0
LargestFreeBlock: 0
It looks like it tried to allocate another 2 GB but it was not available. Why does it try to use so much for this small embeddings model? It doesn’t make any sense to me whats happening here. I tried running a single chunk to see if that helps, like:
results = Nx.Serving.run(serving, [Enum.at(chunks, 0)]) |> IO.inspect()
That worked, it still grabs the full 13 G of GPU memory though. Perhaps like you said just getting 80% but I don’t understand why it would need more for this model with only 74 chunks, its not very much I don’t think. 74 chunks * 384 dimensional vector * 32 bits per vector (f32
type) is 909 thousand bytes, barely a megabyte. When I chunk by 10, 2, 1 items at a time, it always succeeds on the first loop then fails on the second loop.
I know thats a lot… Any thoughts on what to try? I am at a loss for how to make a small simple model like this work in Elixir and that doesn’t bode well for my proposal to my team to try it.