Memory issues following Bumblebee example

Hello! I’m following the example here bumblebee/fine_tuning.livemd at main · elixir-nx/bumblebee · GitHub for fine tuning the BERT model from huggingface using bumblebee, and I keep running into memory issues.

I’m setting up a Tesla T4 GPU on GCP (posted my setup steps here: Getting set up to do GPU accelerated machine learning in Elixir on a fresh Google Cloud Platform VM · GitHub – would love feedback on whether there’s an easier way to get started)/

After running into OOM a few times on smaller VMs I set up a n1-standard-16 with 60GB of memory and again ran into issues (I’ll post the output below).

Is 60GB not enough to run this example? Is it an issue with the T4’s memory ()? Is there a recommended resource allocation for running bumblebee examples?

[edit] it does seem like the memory issue is on the GPU. Can anyone chime in on whether it’s just not possible to run this notebook on a single T4 vs something I’m doing wrong?

Thank you SO much!


14:58:53.921 [info] Sum Total of in-use chunks: 13.06GiB

14:58:53.921 [info] total_region_allocated_bytes_: 14019467520 memory_limit_: 14019467673 available bytes: 153 curr_region_allocation_bytes_: 28038935552

14:58:53.921 [info] Stats:
Limit:                     14019467673
InUse:                     14019467264
MaxInUse:                  14019467264
NumAllocs:                      304098
MaxAllocSize:                945403392
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0


14:58:53.921 [warning] ****************************************************************************************************

14:58:53.921 [error] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 2359296 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
             parameter allocation:    1.61GiB
              constant allocation:     2.1KiB
        maybe_live_out allocation:    1.61GiB
     preallocated temp allocation:  901.61MiB
  preallocated temp fragmentation:   11.25MiB (1.25%)
                 total allocation:    4.11GiB
Peak buffers:
        Buffer 1:
                Size: 84.95MiB
                Entry Parameter Subshape: f32[28996,768]
                ==========================

        Buffer 2:
                Size: 84.95MiB
                Entry Parameter Subshape: f32[28996,768]
                ==========================

        Buffer 3:
                Size: 84.95MiB
                Entry Parameter Subshape: f32[28996,768]
                ==========================

        Buffer 4:
                Size: 84.95MiB
                Entry Parameter Subshape: f32[28996,768]
                ==========================

        Buffer 5:
                Size: 84.95MiB
                XLA Label: fusion
                Shape: f32[28996,768]
                ==========================

        Buffer 6:
                Size: 84.95MiB
                XLA Label: fusion
                Shape: f32[28996,768]
                ==========================

        Buffer 7:
                Size: 84.95MiB
                XLA Label: fusion
                Shape: f32[28996,768]
                ==========================

        Buffer 8:
                Size: 84.95MiB
                XLA Label: fusion
                Shape: f32[28996,768]
                ==========================

        Buffer 9:
                Size: 24.00MiB
                XLA Label: fusion
                Shape: f32[32,64,3072]
                ==========================

        Buffer 10:
                Size: 24.00MiB
                XLA Label: fusion
                Shape: f32[32,64,3072]
                ==========================

        Buffer 11:
                Size: 24.00MiB
                XLA Label: fusion
                Shape: f32[32,64,3072]
                ==========================

        Buffer 12:
                Size: 24.00MiB
                XLA Label: fusion
                Shape: f32[2048,3072]
                ==========================

        Buffer 13:
                Size: 24.00MiB
                XLA Label: fusion
                Shape: f32[2048,3072]
                ==========================

        Buffer 14:
                Size: 24.00MiB
                XLA Label: fusion
                Shape: f32[2048,3072]
                ==========================

        Buffer 15:
                Size: 24.00MiB
                XLA Label: fusion
                Shape: f32[2048,3072]
                ==========================

Did you alter the livebook code in any way? Especially the sequence length and batch size parameters?

Thanks for your reply! No, pure copy and paste except for the CSV file path.

It also failed on my macbook pro when doing cpu training, which is what led me down the path of provisioning a gpu.

It’s odd to me that it’s failing then. The batch sizes are small, and it seems that the BEAM process is fluctuating around 5GB RAM running on the CPU.

Your GPU has 14GB available it seems, so I would expect that to work properly.
I recommend that you open an issue on GitHub because this does smell like some sort of memory leak

Thanks :pray:, I’ll open an issue on Bumblebee

for what its worth I watched the mem usage with a cpu backend for a while and saw it spike to 32GB used in the second epoch… maybe I just underestimated how memory hungry this training process is.

Ah, try forcing versions 0.5.3 instead of 0.5.1 just for the sake of the sanity check please?

No difference with 0.5.3 :frowning:

I was able to get this fine tuning example working with the medium BERT model from hugging face. Here is the full source if you want to see that in action

For those curious to learn more about the full setup and even what the feedback loop is like between CPU/GPU :slight_smile: I wrote a blog about my experience this weekend

https://toranbillups.com/blog/archive/2023/04/29/training-axon-models-with-nvidia-gpus/

2 Likes

Awesome!