Using Phi3.5 wih Bumblebee: (ArgumentError) could not match the class name "Phi3ForCausalLM" to any of the supported models

I am trying to see how I can use different HF models with Bumblebee. I am trying to load microsoft/Phi-3.5-mini-instruct

{:ok, microsoft} = Bumblebee.load_model({:hf, "microsoft/Phi-3.5-mini-instruct"})

and getting following error.

** (ArgumentError) could not match the class name "Phi3ForCausalLM" to any of the supported models, please specify the :module and :architecture options
    (bumblebee 0.5.3) lib/bumblebee.ex:409: Bumblebee.do_load_spec/4
    (bumblebee 0.5.3) lib/bumblebee.ex:578: Bumblebee.maybe_load_model_spec/3
    (bumblebee 0.5.3) lib/bumblebee.ex:566: Bumblebee.load_model/2
    #cell:etc7eozxbver4bzi:1: (file

but the lib/bumblebee.ex list Phi3ForCausalLM. Shouldn’t it be able to load this as module and architecture are listed there ? Am I missing something as just started exploring this space

It was added after the last release, so try bumblebee main : )

1 Like

Thanks, not sure If I made progress or not got a new error

** (RuntimeError) conversion failed, invalid format for "rope_scaling", got: %{"long_factor" => [1.0800000429153442, 1.1100000143051147, 1.1399999856948853, 1.340000033378601, 1.5899999141693115, 1.600000023841858, 1.6200000047683716, 2.620000123977661, 3.2300000190734863, 3.2300000190734863, 4.789999961853027, 7.400000095367432, 7.700000286102295, 9.09000015258789, 12.199999809265137, 17.670000076293945, 24.46000099182129, 28.57000160217285, 30.420001983642578, 30.840002059936523, 32.590003967285156, 32.93000411987305, 42.320003509521484, 44.96000289916992, 50.340003967285156, 50.45000457763672, 57.55000305175781, 57.93000411987305, 58.21000289916992, 60.1400032043457, 62.61000442504883, 62.62000274658203, 62.71000289916992, 63.1400032043457, 63.1400032043457, 63.77000427246094, 63.93000411987305, 63.96000289916992, 63.970001220703125, 64.02999877929688, 64.06999969482422, 64.08000183105469, 64.12000274658203, 64.41000366210938, 64.4800033569336, 64.51000213623047, 64.52999877929688, 64.83999633789062], "short_factor" => [1.0, 1.0199999809265137, 1.0299999713897705, 1.0299999713897705, 1.0499999523162842, 1.0499999523162842, 1.0499999523162842, 1.0499999523162842, 1.0499999523162842, 1.0699999332427979, 1.0999999046325684, 1.1099998950958252, 1.1599998474121094, 1.1599998474121094, 1.1699998378753662, 1.2899998426437378, 1.339999794960022, 1.679999828338623, 1.7899998426437378, 1.8199998140335083, 1.8499997854232788, 1.8799997568130493, 1.9099997282028198, 1.9399996995925903, 1.9899996519088745, 2.0199997425079346, 2.0199997425079346, 2.0199997425079346, 2.0199997425079346, 2.0199997425079346, 2.0199997425079346, 2.0299997329711914, 2.0299997329711914, 2.0299997329711914, 2.0299997329711914, 2.0299997329711914, 2.0299997329711914, 2.0299997329711914, 2.0299997329711914, 2.0299997329711914, 2.0799996852874756, 2.0899996757507324, 2.189999580383301, 2.2199995517730713, 2.5899994373321533, 2.729999542236328, 2.749999523162842, 2.8399994373321533], "type" => "longrope"}
    (bumblebee 0.5.3) lib/bumblebee/shared/converters.ex:20: anonymous fn/3 in Bumblebee.Shared.Converters.convert!/2
    (elixir 1.17.2) lib/enum.ex:2531: Enum."-reduce/3-lists^foldl/2-0-"/3
    (bumblebee 0.5.3) lib/bumblebee/shared/converters.ex:14: Bumblebee.Shared.Converters.convert!/2
    (bumblebee 0.5.3) lib/bumblebee/text/phi3.ex:447: Bumblebee.HuggingFace.Transformers.Config.Bumblebee.Text.Phi3.load/2
    (bumblebee 0.5.3) lib/bumblebee.ex:452: Bumblebee.do_load_spec/4
    (bumblebee 0.5.3) lib/bumblebee.ex:603: Bumblebee.maybe_load_model_spec/3
    (bumblebee 0.5.3) lib/bumblebee.ex:591: Bumblebee.load_model/2
    #cell:ihx26ou7pekcazv2:1: (file)

I have a few more queries let me know of I should open another issue or can they be answered here if its in your domain.

I am using livebook in a container and have the /data folder mapped to a volume

  1. My notebooks are getting saved but the models are downloaded on ever load. Where do models download ? I can map it to avoid redownloading or it is all in memory for now ?
  2. I have ENV passed to the running container and can see them available in the running container but not available to the livebook. What is needed to make sure the ENV in the container are accessible to the livebook notebooks

mapping /home/livebook solves redownloading models issue in every restart. Is there any downside to mapping `/home/livebook. It also helps to persist secrets which is for now the easy option till I can figure out how to have access to ENV vars in livebook that are available in container.

not sure If I made progress or not got a new error

Ah, there have been recent changes in huggingface/transformers around the rope scaling strategies. I plan to look at that soon, I will let you know once that’s fixed.

My notebooks are getting saved but the models are downloaded on ever load.

By default bumblebee caches downloads in {user_cache}/bumblebee, so ~/.cache/bumblebee on Linux. This can be customized with BUMBLEBEE_CACHE_DIR, but mounting a volume at /home/livebook sounds good.

What is needed to make sure the ENV in the container are accessible to the livebook notebooks

Can you provide a minimal reproduction? It works as expected when I run the container and specify -e FOO=foo : )

@darnahsan I fixed the rope scaling, so you should be able to get pass that error on Bumblebee main : )

2 Likes

I am past the rope scaling but I keep getting
image

I have tried phi3.5 and nvidia/Llama-3.1-Minitron-4B-Width-Base and same error pops up after downloading model

{:ok, model} = Bumblebee.load_model({:hf, "nvidia/Llama-3.1-Minitron-4B-Width-Base"})

logs don’t have any error in them

My guess is that this is the runtime running out of memory. How much RAM do you have? (or GPU memory)

Btw. I noticed some params warnings when loading nvidia/Llama-3.1-Minitron-4B-Width-Base and discovered there’s a new config parameter we don’t load, I’ve just fixed it on main.

I can successfully load it (CPU, 32 GB RAM):

It was indeed a memory issue, though I could see ram available on the livebook home screen it was not avaialbe to the container. I managed to run it but the Llama3.1-Minitron and SmolLM models are outputting garbage both have the same architecture LlamaForCausalLM. I can see the below message in the logs while running them, can this be the reason

Phi3.5 seems to be working well.

I am passing through the compose file and can see the var set in the container as well.

services:
  #LIVEBOOK
  mantis:
    image: "darnahsan/mantis:${LIVEBOOK_TAG}"
    container_name: mantis
    environment:
      - LIVEBOOK_TAG=${LIVEBOOK_TAG}
      - LIVEBOOK_PASSWORD=${LIVEBOOK_PASSWORD}
      - LIEVBOOK_PORT=${LIVEBOOK_PORT:-8080}
      - LIVEBOOK_IFRAME_PORT=${LIVEBOOK_IFRAME_PORT:-8081}
      - LIVEBOOK_HF_TOKEN=${LIVEBOOK_HF_TOKEN}
      - LIVEBOOK_SHUTDOWN_ENABLED=${LIVEBOOK_SHUTDOWN_ENABLED}
      - EXTERNAL_PORT=${EXTERNAL_PORT:-18080}
      - EXTERNAL_IFRAME_PORT=${EXTERNAL_IFRAME_PORT:-18081}
    volumes:
      - "livebook_data:/data"
      - "livebook_home:/home/livebook/"
    ports:
      - "${EXTERNAL_PORT}:${LIVEBOOK_PORT}"
      - "${EXTERNAL_IFRAME_PORT}:${LIVEBOOK_IFRAME_PORT}"
    # mem_reservation: ${LIVEBOOK_MEM_MIN:-8G}
    restart: unless-stopped

volumes:
  livebook_data: {}
  livebook_home: {}

Llama3.1-Minitron and SmolLM models are outputting garbage

I tried nvidia/Llama-3.1-Minitron-4B-Width-Base and it worked fine, but SmolLM has another option related to parameters that we didn’t respect, hence the debug message. I’ve just fixed it on main and HuggingFaceTB/SmolLM-1.7B-Instruct also works.

For the record, here’s the notebook:

Notebook
# llama3

```elixir
Mix.install([
  {:bumblebee, github: "elixir-nx/bumblebee"},
  {:nx, "~> 0.8.0", override: true},
  {:exla, "~> 0.8.0", override: true},
  {:kino, "~> 0.14.0"}
])

Nx.global_default_backend(EXLA.Backend)
```

## Section

```elixir
# repo = {:hf, "nvidia/Llama-3.1-Minitron-4B-Width-Base"}
repo = {:hf, "HuggingFaceTB/SmolLM-1.7B-instruct"}

{:ok, model_info} = Bumblebee.load_model(repo)
{:ok, tokenizer} = Bumblebee.load_tokenizer(repo)
{:ok, generation_config} = Bumblebee.load_generation_config(repo)

:ok
```

<!-- livebook:{"output":true} -->

```

12:14:16.182 [info] Loaded cuDNN version 90400

```

<!-- livebook:{"output":true} -->

```
:ok
```

```elixir
generation_config =
  Bumblebee.configure(generation_config,
    max_new_tokens: 256
    # strategy: %{type: :multinomial_sampling, top_p: 0.6}
  )

serving =
  Bumblebee.Text.generation(model_info, tokenizer, generation_config,
    compile: [batch_size: 1, sequence_length: 256],
    stream: true,
    defn_options: [compiler: EXLA]
  )

Kino.start_child({Nx.Serving, name: Llama, serving: serving})
```

<!-- livebook:{"output":true} -->

```
{:ok, #PID<0.261.0>}
```

```elixir
prompt = "Complete the paragraph: our solar system is"

Nx.Serving.batched_run(Llama, prompt) |> Enum.each(&IO.write/1)
```

<!-- livebook:{"output":true} -->

```
 a vast and complex system of celestial bodies that orbit around the sun. The planets in our solar system are divided into two categories: terrestrial planets and gas giants. The terrestrial planets are rocky and have a solid surface, while the gas giants are composed mostly of hydrogen and helium gases. The four terrestrial planets are Mercury, Venus, Earth, and Mars, while the four gas giants are Jupiter, Saturn, Uranus, and Neptune. The gas giants are much larger than the terrestrial planets, with Jupiter being the largest. The gas giants are also much more massive than the terrestrial planets, with Jupiter being over 10 times larger than Earth. The gas giants are also much more distant from the sun than the terrestrial planets, with Jupiter being over 10 times farther away than Earth.
```

<!-- livebook:{"output":true} -->

```
:ok
```

LIVEBOOK_HF_TOKEN

All env vars with LIVEBOOK_ prefix are used to configure Livebook itself and we remove them from env on startup, that’s why it is not propagated to the runtime. You can make it LB_HF_TOKEN and it should work. You can also set the token using Livebook secrets, see the lock icon in the session sidebar.

1 Like

Thanks, managed to have the models run on CPU though they tend to get stuck in a loop after couple of sentences :grinning:.

I have 1 question as I have used ollama to run same models with the ollama web-ui and they seem to work fine on gpu but whenI load them using bumblebee they always seem to run out of memory. The smaller under 1B models tend to work fine. Is it something ollama does that makes it magically work on gpu even for larger models or could they be quantized versions ?

Thanks, managed to have the models run on CPU though they tend to get stuck in a loop after couple of sentences :grinning:.

If you can provide a notebook that reproduces it, I can have a look : )

Regarding ollama, if there is quantization is involved then that is definitely a major memory usage reduction. There was a recent work to support quantization in Axon, but currently it still requires loading the full-precision parameters first and converting to quantized version.

1 Like

Bumblebee 101

Mix.install([
  {:bumblebee, [github: "elixir-nx/bumblebee", override: true]},
  {:nx, "~> 0.8.0", [override: true]},
  {:exla, "~> 0.8.0", [override: true]},
  {:kino, "~> 0.14.0", [override: true]},
  {:kino_bumblebee, [github: "livebook-dev/kino_bumblebee", override: true]},
  {:kino_db, "~> 0.2.12"}
])

Nx.global_default_backend({EXLA.Backend, client: :host})

Untitled

hf_token = System.fetch_env!("LB_HF_TOKEN")
{:ok, bert} = Bumblebee.load_model({:hf, "google-bert/bert-base-uncased"})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "google-bert/bert-base-uncased"})

serving = Bumblebee.Text.fill_mask(bert, tokenizer)

text_input = Kino.Input.text("Sentence with mask", default: "The capital of [MASK] is Paris.")
text = Kino.Input.read(text_input)

Nx.Serving.run(serving, text)
phi3_5 = {:hf, "microsoft/Phi-3.5-mini-instruct"}
t5_flant = {:hf, "google/flan-t5-large"}
smollm = {:hf, "HuggingFaceTB/SmolLM-1.7B"}
llama_minitron = {:hf, "nvidia/Llama-3.1-Minitron-4B-Width-Base"}
repo = phi3_5
{:ok, model} = Bumblebee.load_model(repo)
{:ok, tokenizer} = Bumblebee.load_tokenizer(repo)
{:ok, generation_config} = Bumblebee.load_generation_config(repo)
generation_config = Bumblebee.configure(generation_config, max_new_tokens: 256)
serving = Bumblebee.Text.generation(model, tokenizer, generation_config,
        compile: [batch_size: 1, sequence_length: 256],
        stream: true,
        defn_options: [compiler: EXLA]
)
Kino.start_child({Nx.Serving, name: Model, serving: serving})
prompt = "Complete the paragraph: our solar system is"

Nx.Serving.batched_run(Model, prompt) |> Enum.each(&IO.write/1)

@darnahsan I’ve just run this on both CPU and GPU and got this:

After printing “planets.” it terminated immediately (didn’t get stuck).

(The only change I had to make is defn_options: [compiler: EXLA, client: :host] to make sure the serving also runs on CPU)

The generation strategy here is deterministic, so I think you should get the same result. Is there something I missed?

phi3.5 runs well but if you change to smollm or llama you will see the repititions after the first few words. Thats what I encounter

Oh, you mean the model repeats words, not that it gets stuck without output. So that is not a bug, it’s just how the model behaves under the given configuration.

There are several factors that determine the model output. First of all, sometimes the model has a separate “base” and “instruct” checkpoints. Base models are trained on a bunch of text to get “text understanding”, and instruct models are further fine-tuned on conversations and tasks, to make them more usable in a chat-like interaction. For example, there is HuggingFaceTB/SmolLM-1.7B and HuggingFaceTB/SmolLM-1.7B-instruct, you probably want to use the latter.

Next, you want to make sure you use a prompt relevant for the given model. Instruct checkpoints usually have a certain template for specifying the conversation history. With the transformers Python library you can specify a template on the tokenizer, however it use a Python-specific template, so we can’t reliably load it. However, you can find the template by looking for “chat_template” in tokenizer_config.json in the given repo. Sometimes the template is also present in the model readme. Continuing with SmolLM instruct, the template is here. So here is the prompt with template:

prompt = """
<|im_start|>user
Complete the paragraph: our solar system is<|im_end|>\
<|im_start|>assistant
"""

It results in a much better result.

Finally you can make the output non-deterministic (and more creative) by using a sampling strategy for the generation, such as this:

generation_config =
  Bumblebee.configure(generation_config,
    strategy: %{type: :multinomial_sampling, top_p: 0.6}
  )

(Sidenote: there is also :no_repeat_ngram_length to explicitly avoid repetitions, however it is a trade-off, because sometimes there are longer pharses, city names, etc, and preventing the model from reusing them worsens the usability)

1 Like

Thank you for your help.