Hello! I’m following the example here bumblebee/fine_tuning.livemd at main · elixir-nx/bumblebee · GitHub for fine tuning the BERT model from huggingface using bumblebee, and I keep running into memory issues.
I’m setting up a Tesla T4 GPU on GCP (posted my setup steps here: Getting set up to do GPU accelerated machine learning in Elixir on a fresh Google Cloud Platform VM · GitHub – would love feedback on whether there’s an easier way to get started)/
After running into OOM a few times on smaller VMs I set up a n1-standard-16
with 60GB of memory and again ran into issues (I’ll post the output below).
Is 60GB not enough to run this example? Is it an issue with the T4’s memory ()? Is there a recommended resource allocation for running bumblebee examples?
[edit] it does seem like the memory issue is on the GPU. Can anyone chime in on whether it’s just not possible to run this notebook on a single T4 vs something I’m doing wrong?
Thank you SO much!
14:58:53.921 [info] Sum Total of in-use chunks: 13.06GiB
14:58:53.921 [info] total_region_allocated_bytes_: 14019467520 memory_limit_: 14019467673 available bytes: 153 curr_region_allocation_bytes_: 28038935552
14:58:53.921 [info] Stats:
Limit: 14019467673
InUse: 14019467264
MaxInUse: 14019467264
NumAllocs: 304098
MaxAllocSize: 945403392
Reserved: 0
PeakReserved: 0
LargestFreeBlock: 0
14:58:53.921 [warning] ****************************************************************************************************
14:58:53.921 [error] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 2359296 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
parameter allocation: 1.61GiB
constant allocation: 2.1KiB
maybe_live_out allocation: 1.61GiB
preallocated temp allocation: 901.61MiB
preallocated temp fragmentation: 11.25MiB (1.25%)
total allocation: 4.11GiB
Peak buffers:
Buffer 1:
Size: 84.95MiB
Entry Parameter Subshape: f32[28996,768]
==========================
Buffer 2:
Size: 84.95MiB
Entry Parameter Subshape: f32[28996,768]
==========================
Buffer 3:
Size: 84.95MiB
Entry Parameter Subshape: f32[28996,768]
==========================
Buffer 4:
Size: 84.95MiB
Entry Parameter Subshape: f32[28996,768]
==========================
Buffer 5:
Size: 84.95MiB
XLA Label: fusion
Shape: f32[28996,768]
==========================
Buffer 6:
Size: 84.95MiB
XLA Label: fusion
Shape: f32[28996,768]
==========================
Buffer 7:
Size: 84.95MiB
XLA Label: fusion
Shape: f32[28996,768]
==========================
Buffer 8:
Size: 84.95MiB
XLA Label: fusion
Shape: f32[28996,768]
==========================
Buffer 9:
Size: 24.00MiB
XLA Label: fusion
Shape: f32[32,64,3072]
==========================
Buffer 10:
Size: 24.00MiB
XLA Label: fusion
Shape: f32[32,64,3072]
==========================
Buffer 11:
Size: 24.00MiB
XLA Label: fusion
Shape: f32[32,64,3072]
==========================
Buffer 12:
Size: 24.00MiB
XLA Label: fusion
Shape: f32[2048,3072]
==========================
Buffer 13:
Size: 24.00MiB
XLA Label: fusion
Shape: f32[2048,3072]
==========================
Buffer 14:
Size: 24.00MiB
XLA Label: fusion
Shape: f32[2048,3072]
==========================
Buffer 15:
Size: 24.00MiB
XLA Label: fusion
Shape: f32[2048,3072]
==========================