Bumblebee Llama 2 example run out of memory on Kaggle CPU/GPU instance

I created a Jupter Notebook to run livebook on Kaggle. It is a CPU+GPU instance with 29GB CPU memory and 15.9GB GPU memory. The GPU is P100.

The livebook app came up successfully. However when running the Llama 2 example from the Bumblebee document, I got the following error:

** (RuntimeError) Out of memory while trying to allocate 90177536 bytes.
(exla 0.6.1) lib/exla/device_buffer.ex:55: EXLA.DeviceBuffer.unwrap!/1
(exla 0.6.1) lib/exla/device_buffer.ex:22: EXLA.DeviceBuffer.place_on_device/4
(exla 0.6.1) lib/exla/backend.ex:46: EXLA.Backend.from_binary/3
(bumblebee 0.4.2) lib/bumblebee/conversion/pytorch/loader.ex:79: Bumblebee.Conversion.PyTorch.Loader.object_resolver/1
(unpickler 0.1.0) lib/unpickler.ex:828: Unpickler.resolve_object/2
(unpickler 0.1.0) lib/unpickler.ex:818: anonymous fn/2 in Unpickler.finalize_stack_items/2
(elixir 1.15.7) lib/map.ex:957: Map.get_and_update/3
#cell:aq3ma36lrddcxej7ksiueluzuh4wwxhw:4: (file)

It happened when loading the model and creating the serving, before the livebook could go to the next step to do the inference.

One thing I noticed is that Kaggle shows the CPU memory is 29GB; however, the livebook app runtime shows it has 32GB. Is this a problem?

I’m able to run the same example on a windows laptop with WSL. The Ubuntu Linux is assign 38GB memory, and everything works fine.

What are the CPU+GPU memory requirements to run Llama-2-7b-chat-hf model in livebook?

After trying some combinations, I managed to bring up the Llama-2-7b model in Kaggle’s P100 GPU.

To summarize, the follow two make the difference:

  • Load the model to CPU instead of GPU, this is going to save the consumption of GPU memory;
{:ok, model_info} = Bumblebee.load_model(repo, backend: {EXLA.Backend, client: :host})
  • Change the sequence_length from 1028 to 768;
serving =
  Bumblebee.Text.generation(model_info, tokenizer, generation_config,
    compile: [batch_size: 1, sequence_length: 768],
    stream: true,
    defn_options: [compiler: EXLA, lazy_transfers: :always]

Changing the value for max_new_tokens between 128 and 512 didn’t help on the OOM.

generation_config =
    max_new_tokens: 256,
    strategy: %{type: :multinomial_sampling, top_p: 0.6}