Trying the BERT fine tunning on GPU

I have successfully fine-tuned the BERT model using my CPU. I would like to try fine-tuning using my GPU an RTX4090 with 24GB RAM. I am getting an out-of-memory error. I am a little surprised but I would like to make sure the memory is truly the issue and it’s not a misconfiguration.
The out-of-memory error raised on loading the model

{:ok, model} = Bumblebee.load_model({:hf, "bert-base-cased"}, spec: spec)


1 Like

have you configured it to run on GPU and not CPU?

eg the appropriate XLA_TARGET ?

maybe post more code…

The code is the one from the example here with the difference I have updated the libraries in order to be able to use cuda 12.

XLA_TARGET is set to cuda120 with my version of cuda being cuda 12.2. Should I downgrade to cuda 12.0? From my understanding cuda120 is CUDA 12.0+ so 12.0 and above.
Running nvidia-smi I can see the GPU memory is filled up.

I have managed to get it to run using XLA_TARGET=cuda so it will compile it’s own version.

To get it to work, the batch size needs to be decreased, on my GPU it’s a batch size of 4.


@steffel this is very interesting, the model is like 0.5gb, so loading it should be far from running out of memory. Note that XLA preallocates memory upfront, so your GPU memory usage will bump to a high value, but an OOM is definitely unexpected.

What is the CUDA and cuDNN version you built with locally to make it work?

@steffel can you please run with XLA_ARCHIVE_URL= and see if it makes a difference?

Regarding the versions:
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
cudnn: v8

The model loads without any memory issues.
Now when it comes to training the model I am using mrm8488/codebert-base-finetuned-detect-insecure-code that is based on roberta-base when compiling myself I am not able to push the sequence_length up to 1024 (batch_size=1) with your version it works. I would like to know what is different.

Oh, so just to make sure we are on the same page, the original OOM was during loading or training?

The archive I sent is precompiled using CUDA 12.1, rather than 12.0. According to Jax README it should still be compatible with CUDA 12.0, in which case we would be fine precompiling with 12.1. I will need to do some further tests.

First I had an OOM during loading.
I managed to solve it using my own built version of xla extension.
I was able to train using a batch_size=1 and sequence_length=64 or 128.
With sequence_length above 128, I am getting an OOM error during training.
Something like that:

** (RuntimeError) Out of memory while trying to allocate 4290931080 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
             parameter allocation:    1.86GiB
              constant allocation:     2.2KiB
        maybe_live_out allocation:    1.86GiB
     preallocated temp allocation:    4.00GiB
  preallocated temp fragmentation:     9.1KiB (0.00%)
                 total allocation:    7.71GiB
              total fragmentation:  656.76MiB (8.32%)

The OOM while training seems to be back, I am a little confused.
Here is my livemd if that can help.

First I had an OOM during loading.

Ok, so there must be buggy behaviour in the archive when compiling with 12.0, and compiling with 12.1 fixes it.

The OOM while training seems to be back

If you restart the runtime and run the whole notebook once, is the OOM behaviour consistent or is it random?

I think I got too optimistic in my previous response.
While using my custom-built version it OOM on epoch two, here it OOM on epoch 4. So I only watched until epoch 3 and concluded that it will not OOM. Conclusion: It’s later but still OOM

I see, thanks for all the details! So the OOM during training is not version-specific for the most part, for the initial one I will investigate more and probably update the precompile versions : )

Do you have anything to suggest regarding the OOM training, you said the model is quite small so it should fit on a card with 24GB of ram.
Is it the

     preallocated temp allocation:    4.00GiB

That is too small?

I was talking about loading specifically and comparing the 0.5GB model to the available 24GB. Training can definitely be memory expensive. There are likely memory optimisations we could do in axon/bumblebee specific to this training, so far we’ve been mostly focused on inference and deployment in bumblebee.

@steffel what’s the exact version of cuDNN you are using? (apt-cache policy libcudnn8 | head -n 3) I tried to reproduced this on T4 (16 GB) using latest CUDA and cuDNN, but the model loaded fine.

The model loads fine when I build or use your suggested version of XLA. It’s on fine tuning that it runs out of memory.

Regarding the version:

Yeah, I tried to reproduce using the usual XLA_TARGET=cuda120. Maybe there is something specific to the GPU model, driver, or else. I will probably just bump the precompiled/required versions to match jax.

@steffel I published xla v0.5.1 with the updated archive, feel free to update (or Mix.install([...], force: true)) to give it a try.

1 Like