wrgoldstein

wrgoldstein

Memory issues following Bumblebee example

Hello! I’m following the example here bumblebee/notebooks/fine_tuning.livemd at main · elixir-nx/bumblebee · GitHub for fine tuning the BERT model from huggingface using bumblebee, and I keep running into memory issues.

I’m setting up a Tesla T4 GPU on GCP (posted my setup steps here: Getting set up to do GPU accelerated machine learning in Elixir on a fresh Google Cloud Platform VM · GitHub – would love feedback on whether there’s an easier way to get started)/

After running into OOM a few times on smaller VMs I set up a n1-standard-16 with 60GB of memory and again ran into issues (I’ll post the output below).

Is 60GB not enough to run this example? Is it an issue with the T4’s memory ()? Is there a recommended resource allocation for running bumblebee examples?

[edit] it does seem like the memory issue is on the GPU. Can anyone chime in on whether it’s just not possible to run this notebook on a single T4 vs something I’m doing wrong?

Thank you SO much!


14:58:53.921 [info] Sum Total of in-use chunks: 13.06GiB

14:58:53.921 [info] total_region_allocated_bytes_: 14019467520 memory_limit_: 14019467673 available bytes: 153 curr_region_allocation_bytes_: 28038935552

14:58:53.921 [info] Stats:
Limit:                     14019467673
InUse:                     14019467264
MaxInUse:                  14019467264
NumAllocs:                      304098
MaxAllocSize:                945403392
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0


14:58:53.921 [warning] ****************************************************************************************************

14:58:53.921 [error] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 2359296 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
             parameter allocation:    1.61GiB
              constant allocation:     2.1KiB
        maybe_live_out allocation:    1.61GiB
     preallocated temp allocation:  901.61MiB
  preallocated temp fragmentation:   11.25MiB (1.25%)
                 total allocation:    4.11GiB
Peak buffers:
        Buffer 1:
                Size: 84.95MiB
                Entry Parameter Subshape: f32[28996,768]
                ==========================

        Buffer 2:
                Size: 84.95MiB
                Entry Parameter Subshape: f32[28996,768]
                ==========================

        Buffer 3:
                Size: 84.95MiB
                Entry Parameter Subshape: f32[28996,768]
                ==========================

        Buffer 4:
                Size: 84.95MiB
                Entry Parameter Subshape: f32[28996,768]
                ==========================

        Buffer 5:
                Size: 84.95MiB
                XLA Label: fusion
                Shape: f32[28996,768]
                ==========================

        Buffer 6:
                Size: 84.95MiB
                XLA Label: fusion
                Shape: f32[28996,768]
                ==========================

        Buffer 7:
                Size: 84.95MiB
                XLA Label: fusion
                Shape: f32[28996,768]
                ==========================

        Buffer 8:
                Size: 84.95MiB
                XLA Label: fusion
                Shape: f32[28996,768]
                ==========================

        Buffer 9:
                Size: 24.00MiB
                XLA Label: fusion
                Shape: f32[32,64,3072]
                ==========================

        Buffer 10:
                Size: 24.00MiB
                XLA Label: fusion
                Shape: f32[32,64,3072]
                ==========================

        Buffer 11:
                Size: 24.00MiB
                XLA Label: fusion
                Shape: f32[32,64,3072]
                ==========================

        Buffer 12:
                Size: 24.00MiB
                XLA Label: fusion
                Shape: f32[2048,3072]
                ==========================

        Buffer 13:
                Size: 24.00MiB
                XLA Label: fusion
                Shape: f32[2048,3072]
                ==========================

        Buffer 14:
                Size: 24.00MiB
                XLA Label: fusion
                Shape: f32[2048,3072]
                ==========================

        Buffer 15:
                Size: 24.00MiB
                XLA Label: fusion
                Shape: f32[2048,3072]
                ==========================

Most Liked Responses

toranb

toranb

I was able to get this fine tuning example working with the medium BERT model from hugging face. Here is the full source if you want to see that in action

For those curious to learn more about the full setup and even what the feedback loop is like between CPU/GPU :slight_smile: I wrote a blog about my experience this weekend

Where Next?

Popular in Questions Top

aadeshere1
I have a another noob question about loop. Since elixir is immutable, while loop is not directly possible. total = 10 while total != 0 ...
New
sergio
In Ruby, I can go: User.find_by(email: "foobar@email.com").update(email: "hello@email.com") How can I do something similar in Elixir? ...
New
marius95
Hello everyone, I try to use an Javascript Event Handler in my root.html.leex file. Therefore I created a function in the app.js file: ...
New
mcarvalho
What is the difference between System.get_env and Application.get_env? For example, what are best practices to use one versus another.
New
jerry
Good day to you all. I have been struggling to get a query involving like and ilike to work. Can anyone assist me on this, please? pro...
New
LegitStack
I’m trying to make a websocket server in Phoenix or raw Elixir. I heard about gun, I think I could use cowboy, but since I’m not that sma...
New
jay1
Why is it that the mnesia database isn’t the most preferred database for use in Elixir/Phoenix?
New
alice
Hey, Just curious what are the main benefits of Elixir compared to Clojure? When is Elixir more useful than Clojure and vice versa? Th...
New
freewebwithme
Using vs code and installed ElixirLS: support and debugger. And I got an error popped up on start up says Failed to run ‘elixir’ comma...
New
Qqwy
Original source of discussion: This topic on the Pragmatic Programmers’ Functional Web Development with Elixir, OTP, and Phoenix forum. ...
New

Other popular topics Top

aadeshere1
I have a another noob question about loop. Since elixir is immutable, while loop is not directly possible. total = 10 while total != 0 ...
New
marius95
Hello everyone, I try to use an Javascript Event Handler in my root.html.leex file. Therefore I created a function in the app.js file: ...
New
Nvim
Anybody knows a comprehensive comparison of Django and Phoenix, thanks for the help. Where are they similar? Where do they differ the m...
New
fireproofsocks
Forgive me if this is obvious, but how does one delete a database record WITHOUT selecting it first? Ecto.Repo — Ecto v3.14.0 has exampl...
New
jay1
Why is it that the mnesia database isn’t the most preferred database for use in Elixir/Phoenix?
New
SoCreat
i’m a new one to elixir which editor can i use vs code? or atom? Thanks! :smiley:
New
RisingFromAshes
I’ve read in another post that it may be possible with a router helper - but I couldn’t find an appropriate one, and tbh, I’m still just ...
New
jason.o
In the code below, if the create action is not set to accept “extra_key” as an input, it errors out with a message shown above. Is there ...
New
dblack
I’ve got an issue with an app and I’ve no idea of how to troubleshoot it. I’m hoping someone here might have seen something similar. I p...
New
svb
Hi! Currently I want to submit a form by pressing the Enter key. However, since my input field is of type “textarea” this is just adds a...
New

We're in Beta

About us Mission Statement