Livebook cuda 12.2 XLA out of memory at 11006MiB

How to restrict memory usage on EXLA?

This is my memory usage before loading EXLA.

nvidia-smi
Sun Sep  3 13:13:11 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060        On  | 00000000:01:00.0  On |                  N/A |
| 45%   53C    P5              30W / 170W |    398MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1914      G   /usr/lib/xorg/Xorg                          131MiB |
|    0   N/A  N/A      2128      G   /usr/bin/gnome-shell                         63MiB |
|    0   N/A  N/A      7334      G   ...irefox/3068/usr/lib/firefox/firefox      193MiB |
+---------------------------------------------------------------------------------------+

And after.

❯ nvidia-smi
Sun Sep  3 13:18:13 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060        On  | 00000000:01:00.0  On |                  N/A |
|  0%   43C    P3              30W / 170W |  11446MiB / 12288MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1914      G   /usr/lib/xorg/Xorg                          131MiB |
|    0   N/A  N/A      2128      G   /usr/bin/gnome-shell                         65MiB |
|    0   N/A  N/A      7334      G   ...irefox/3068/usr/lib/firefox/firefox      229MiB |
|    0   N/A  N/A     15749      C   ...ang/25.2.1/erts-13.1.4/bin/beam.smp    11006MiB |
+---------------------------------------------------------------------------------------+

As you can see Firefox and gnome bumped a little bit and the beam is trying to use the rest.

The issue is I get it to run once, but if I re evaluate to say change the seed or steps I then run out of memory.

13:20:48.865 [info] Total bytes in pool: 11350867968 memory_limit_: 11350867968 available bytes: 0 curr_region_allocation_bytes_: 22701735936

13:20:48.865 [info] Stats: 
Limit:                     11350867968
InUse:                     11229623808
MaxInUse:                  11261113856
NumAllocs:                        6164
MaxAllocSize:               3514198016
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0


13:20:48.865 [warning] ****************************************************************************************************

13:20:48.865 [error] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 6553600 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
             parameter allocation:    6.25MiB
              constant allocation:         0B
        maybe_live_out allocation:    6.25MiB
     preallocated temp allocation:         0B
                 total allocation:   12.50MiB
              total fragmentation:         0B (0.00%)
Peak buffers:
	Buffer 1:
		Size: 6.25MiB
		Entry Parameter Subshape: f32[1638400]
		==========================

	Buffer 2:
		Size: 6.25MiB
		XLA Label: copy
		Shape: f32[1280,1280]
		==========================

	Buffer 3:
		Size: 8B
		XLA Label: tuple
		Shape: (f32[1280,1280])
		==========================

So is there a way to limit the MaxInUse for xla from the start? 10 should be enough right? Its trying to grab 11 now.

1 Like

XLA reserves memory upfront and then allocates within that reservation as needed. This behaviour can be customized with client options preallocate: false or other :memory_fraction. However, I don’t think this will help with the OOM error.

We are still yet to do more optimisations for stable diffusion, but two things you can try this:

  1. Load the parameters into the CPU with Bumblebee.load_model(..., backend: {EXLA.Backend, client: :host})
  2. Enable lazy transfers in serving defn options: defn_options: [compiler: EXLA, lazy_transfers: :always]

This way, instead of placing all parameters on the GPU, they will be transferred as needed.

Also make sure to try with batch size 1.

5 Likes

@jonatanklosko what you wrote in your response makes total sense but it seems to either conflict or is just a different way of saying what is in the Bumblebee docs here … https://github.com/elixir-nx/bumblebee/tree/main/examples/phoenix#configuring-nx

My question is … Is there a reason or benefit to doing it one way or another? Because I tried it the way its explained in the bumblebee doc and I’m still getting a OOM error.

@nutheory they aren’t conflicting, but aim to do the same thing in a slightly different way. :preallocate_params is a recent addition, I updated the docs to mention it instead. Recently I’ve also added this docs page that goes into more details on how to minimise memory usage by trading off inference time : )

1 Like

@jonatanklosko thanks for clarifying that. So I got everything working in a docker container using EXLA on a A100(ubuntu/cuda) at Lambda Labs (im working on a proof of concept). Im noticing exla is about 3x the inference time of oobabooga run on the same server. I feel like maybe this could be a pytorch/libtorch/torchx difference? I really wanna test that theory though, but i cant figure out how to build libtorch with cuda 11.6, it results in a working docker build that breaks (on app startup) from having a mismatch/old build of CuDNN (8.3.2) that is older than the oldest image I could find on docker hub (8.4). I dont know if you can help or point me in the right direction … but would love help from anyone with more experience in docker/cuda/libtorch.

If you paste how you build the serving (which options you use) perhaps there’s something that can be improved.

One thing that can make a difference is quantization, if the Python version you test uses it. We don’t support quantization in Bumblebee yet, but there’s ongoing work in EXLA that should allow it in the future.

@jonatanklosko ah shit you’re right i was using quantization, good catch. So the LLM im using is NousResearch/Llama-2-7b-hf … I tried two scenarios … 1) I loaded up the .bin files (25gb) to the cpu… 2) I loaded up the .safetensors (12gb) with preallocate: true… both gave the same inference time.

def load_text_serving do
model = System.fetch_env!(“TEXT_GENERATION_MODEL”)
auth_token = System.fetch_env!(“HF_TOKEN”)

{:ok, txt_model} =
  Bumblebee.load_model({:hf, model, auth_token: auth_token},
    backend: {EXLA.Backend, client: :host},
    params_filename: "model.safetensors"
  )

{:ok, tokenizer} =
  Bumblebee.load_tokenizer({:hf, model, auth_token: auth_token})

{:ok, generation_config} =
  Bumblebee.load_generation_config({:hf, model, auth_token: auth_token})

config =
  Bumblebee.configure(generation_config,
    min_new_tokens: 120,
    max_new_tokens: 600,
    strategy: %{type: :contrastive_search, top_k: 4, alpha: 0.6}
  )

Bumblebee.Text.generation(txt_model, tokenizer, config,
  compile: [batch_size: 1, sequence_length: 1000],
  defn_options: [compiler: EXLA, preallocate_params: true]
)

end

With sequence_length: 1000 we always pad the input text to 1000 tokens. You can create a couple variants of the computation for different input lengths by doing sequence_length: [100, 300, 1000]. Note that this does not require more memory.

As for the f16 .safetensors, you can try changing axon policy to force f16 at every step, though I haven’t tested if this makes a difference:

policy = Axon.MixedPrecision.create_policy(params: {:f, 16}, compute: {:f, 16}, output: {:f, 16})
model_info = update_in(model_info.model, &Axon.MixedPrecision.apply_policy(&1, policy))

Ah wait, you have preallocate_params: true inside :defn_options, it should be outside : )

this is why i got confused …

  1. defn_options: [compiler: EXLA], preallocate_params: true - move and keep all parameters to the GPU upfront. This requires the most memory, but should provide the fastest inference time.
  2. defn_options: [compiler: EXLA] - copy all parameters to the GPU before each computation and discard afterwards. This requires less memory, but the copying increases the inference time.
  3. defn_options: [compiler: EXLA, lazy_transfers: :always] - lazily copy parameters to the GPU during the computation as needed. This requires the least memory, at the cost of inference time.

https://hexdocs.pm/bumblebee/llama.html

its my bad obviously but its a easy trip up…

when you do BB.load_model … its BB.load_model(hf_params, repo_params) … which makes total sense…

with defn_options … why is one (ie. “lazy transfers”) in the keyword list as where “preallocate_params: true” is another parameter all together?

1 Like

with defn_options … why is one (ie. “lazy transfers”) in the keyword list as where “preallocate_params: true” is another parameter all together?

All the top-level options (including :preallocate_params are handled by Bumblebee). The keyword list given as :defn_options is what we pass to Nx.Defn.jit/compile, where compiler: EXLA specifies the compiler and the remaining options are passed to the compiler.

So in other words :preallocate_params is for Bumblebee, :lazy_transfers is for the EXLA compiler : )

1 Like

Thanks for catching that preallocate issue… and all your help.

Hi @jonatanklosko, I’m trying stable diffusion from livebook,
but also encounter memory limit during smart cell execution.
How to increase memory limit in livebook to run the model?

@vgrechin to reduce memory usage you can set the number of images to 1. Another thing you can do is converting smart cell to code, and trying the changes in this comment.

How much RAM does your GPU have? If it’s 4GB then IIRC that’s not enough to run the Stable Diffusion.

1 Like

I use Orin Nano 8Gb so wonder why 4Gb is limit. I’m able to run Stable Diffusion in a docker from GitHub - dusty-nv/jetson-containers: Machine Learning Containers for NVIDIA Jetson and JetPack-L4T. That’s why I’m trying here in Elixir.

Where is this log coming from? Do you log it yourself, or is it just the default XLA behaviour on that device?

cc @seanmor5

I run exla and livebook itself on the device, but access from a browser running on another machine. I’ve got access to both these boxes, so no problem to collect any logs. Not sure if it is the default XLA behaviour because I still use XLA here for nothing else but livebook

Oh, I meant, where is the existing “Limit” log coming from : )

Also, is there a different way you can see GPU usage, and is actually 50% of the GPU free?

I usually track GPU loading in ‘jtop’ for other models, but it case of Stable Diffusion it doesn’t involve much GPU, just consumes memory, reducing its free space until the error.