Using Bumblebee for Text Generation

I’m brand new to AI and using NX/Bumblebee. I have attempted to adapt this example from the bumblebee docs to generate a narrative description of a kids math problem using GPT 2:

  def make_story_question(%MathQuiz.Models.MathQuizItem{} = question) do
    model_name = "openai-community/gpt2"
    {:ok, granite} = Bumblebee.load_model({:hf, model_name})
    {:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, model_name})

    {:ok, generation_config} = Bumblebee.load_generation_config({:hf, model_name})

    serving = Bumblebee.Text.generation(granite, tokenizer, generation_config)

    question_prompt =
      "Write a narrative story question for children for the math problem #{question.first_num} plus #{question.second_num}."
      |> IO.inspect(label: "Prompt")

    # text_input = Kino.Input.text(question_prompt, default: "Tomorrow it will be")
    # text = Kino.Input.read(text_input)

    Nx.Serving.run(serving, question_prompt)
  end

When I run the code above, this function never completes (even after 10 minutes) on a moderately performant desktop. An example output on the text prompt is:

Prompt: “Write a narrative story question for children for the math problem 3 plus 3.”

I suspect that the Nx.Serving.run command is likely opening a server, and never closing. How do I get this function to get the response from the model? Am I doing something stupid basic wrong?

1 Like

Do you have the configuration for EXLA set? You can find it right at the top of that reference link.

I did not - I added the following line to my application.ex start function:

Nx.global_default_backend(EXLA.Backend)

The app ran… but the text generation was really wonky. It always provides one of two answers… either “The answer to the question is” or “No human can do this problem”.

I can keep playing with the model itself… but particularly strange (in that I don’t fully understand what’s happening under the hood) is how this process relates to the Phoenix app I’m using to present the output.

Ideally I’d like this to work in an agentic sense - I send the LLM a prompt, and it gives me a response - and that is the end of the interaction. I’ve had a few different anomalies in how this is playing out, but in a general sense… the page is not fully loading/mounting.

For reference - relevant portions of my live view file:

def mount(_params, _session, socket) do
    # quiz_id = params["quiz_id"]
    IO.puts("Mounting Narrative Component")
    quiz_id = "1"

    quiz =
      case MathQuiz.Quiz.fetch_quiz(quiz_id) do
        {:ok, quiz} -> quiz |> Quiz.shuffle_quiz()
        {:error, _msg} -> :error
        _ -> :error
      end
      |> make_quiz_narrative()

    IO.puts("Narrative Component Created.")

    {:ok,
     socket
     |> assign(quiz_id: quiz_id)
     |> assign(quiz: quiz)}
end

def make_quiz_narrative(%Models.MathQuiz{} = quiz) do
    Map.update!(quiz, :questions, &Enum.map(&1, fn x -> add_narrative_description(x) end))
    |> IO.inspect(label: "Story Maker")
end

def add_narrative_description(%Models.MathQuizItem{} = item) do
    response = Quiz.make_story_question(item)

    text =
      case response do
        %{result: result} -> result.text
        _ -> "N/A"
      end

    Map.put(item, :story_text, text) |> IO.inspect()
end

Is there a different way I should be evaluating the model?

I believe your make_story_question is instantiating your model on each call. You should probably have a names serving in your application tree, calling it by name from your view.

You also likely want assign_async so that the rest of the page can load while the first model call is being evaluated

I appreciate the suggestion - and I’ve been working through this.

I’ve run into two classes of problems that I’m not quite sure what is happening.

Now, I start the server as part of my supervision tree in the application.ex file like so:

children = [
      # ...
      {Nx.Serving, serving: setup_llm(), name: MyLLM},
      {MathQuiz.QuizCache, name: MathQuiz.QuizCache},
      #...
      MathQuizWeb.Endpoint
    ]

def setup_llm() do
    # token = File.read!("token.txt")
    repo = {:hf, "openai-community/gpt2"}

    {:ok, model_info} = Bumblebee.load_model(repo, backend: EXLA.Backend)
    {:ok, tokenizer} = Bumblebee.load_tokenizer(repo)
    {:ok, generation_config} = Bumblebee.load_generation_config(repo)

    generation_config =
      Bumblebee.configure(generation_config,
        max_new_tokens: 256,
        strategy: %{type: :multinomial_sampling, top_p: 0.6}
      )

    Bumblebee.Text.generation(model_info, tokenizer, generation_config,
      compile: [batch_size: 10, sequence_length: 1028],
      stream: true,
      defn_options: [compiler: EXLA]
    )
  end

The content above compiles, and appears to work.

When I run this function:

def make_story_question(%MathQuiz.Models.MathQuizItem{} = question) do
    question_prompt =
      "Write a narrative story question for children for the math problem #{question.first_num} plus #{question.second_num}.  Please provide the response in json format, with the text having the key 'storyText'."
      |> IO.inspect(label: "Prompt")

    Nx.Serving.batched_run(MyLLM, question_prompt) |> IO.inspect(label: "NX Output")
  end

I get the following output on the terminal:

Prompt: "Write a narrative story question for children for the math problem 10 plus 4.  Please provide the response in json format, with the text having the key 'storyText'."
NX Output: #Function<61.70938898/2 in Stream.transform/3>

Is this expected? Looking at the batched run docs, I thought I’d get the output from the LLM (instead of a function or stream).

The second class of error is that every new model I try to bring into this (other than GPT2 from the examples) drives an error where the architecture is unrecognized - is there a resource I should read about how to translate the model documents into a coerced architecture in bumblebee? Example: openai/gpt-oss-20b

I’m not sure why you’re getting a stream out of your function, but I suspect it has to do with stream: true. Try removing that for now, or calling Enum.to_list on the output of batched_run. Definitely open an issue so that we can improve documentation!

The rest of the code looks correct.

Regarding architectures, Bumblebee has a set of supported architectures. I don’t recall if that’s documented, or if you need to open the codebase to get the list.

You can find the list here: bumblebee/lib/bumblebee.ex at main · elixir-nx/bumblebee · GitHub

This is super helpful - thanks.

I can’t believe I missed the stream configuration - tells you I’m working on this side project when I’m way to tired.

I was able to get the app to function - the GPT model just gave me a bunch of bogus references to what looked like malware sites and a plea to ask questions about bitcoin instead. But now I can work to get new models integrated that provide more appropriate responses.

I’m happy to contribute to support documentation - least I can do is to help with writing as a newcomer to the space. Thanks for all of the help above.

2 Likes