Bumblebee: Reloading same model -- seeds? -- multiple times?

vanderlindenma · October 16, 2024, 7:13pm

When I run this block for the first time:

{:ok, model_info} = Bumblebee.load_model({:hf, "distilbert-base-uncased"}, architecture: :base)
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "distilbert-base-uncased"})
 inputs = Bumblebee.apply_tokenizer(tokenizer, "This is a test")
 Axon.predict(model_info.model, model_info.params, inputs).pooled_state

I get

Nx.Tensor<
  f32[1][768]
  EXLA.Backend<host:0, 0.3080042529.4070965263.103946>
  [
    [0.1084553599357605, 0.0384901687502861, 0.3454555869102478, 0.3366532623767853, 0.0, 0.0, 0.0, 0.0, 0.0, 0.24088211357593536, 0.2771815061569214, 0.31721076369285583, 0.0, 0.0, 0.0, 0.1535787135362625, 0.07263995707035065, 0.0, 0.2471829056739807, 0.24521753191947937, 0.038037270307540894, 0.0, 0.33873334527015686, 0.0, 0.08038126677274704, 0.05353151261806488, 0.1563723087310791, 0.0, 0.333274245262146, 0.0, 0.1448795050382614, 0.0, 0.13116799294948578, 0.4072483777999878, 0.0, 0.0, 0.3591342866420746, 0.0, 0.26890164613723755, 0.0, 0.0, 0.36578384041786194, 0.4254622161388397, 0.08779369294643402, 0.0, 0.2617552578449249, 0.0, 0.35009729862213135, 0.47235962748527527, 0.0, ...]
  ]

I assumed that if I ran the same block a second time, I would get the same embedding returned. Instead, re-running the exact same block — including the model and tokenizer loading lines — yields:

#Nx.Tensor<
  f32[1][768]
  EXLA.Backend<host:0, 0.3080042529.4070965263.104538>
  [
    [0.0, 0.0, 0.02861723303794861, 0.22635801136493683, 0.3732376992702484, 0.1636544167995453, 0.0, 0.5278993844985962, 0.32550013065338135, 0.0, 0.0, 0.008047151379287243, 0.0, 0.0, 0.18475914001464844, 0.6037576198577881, 0.0, 0.621748685836792, 0.0, 0.0, 0.3294147849082947, 0.0, 0.0, 0.12289489060640335, 0.0, 0.02070406638085842, 0.0, 0.21100129187107086, 0.24555926024913788, 0.0, 0.0, 0.0415930338203907, 0.4362318515777588, 0.0, 0.0, 0.0, 0.0, 0.0, 0.30100217461586, 0.1799420416355133, 0.4896824359893799, 3.664734831545502e-4, 0.0, 0.0, 0.15452004969120026, 0.0, 0.0, 0.0, 0.08496798574924469, 0.0, ...]
  ]

My questions

Is there a stochastic aspect I am not controlling when re-loading the model and tokenizer?
What aspect of model/tokenizer loading I am not controlling induces the variation?
Is it possible to reload the same “version” (same “seeds”?) of a model over and over to ensure embeddings are consistent from one load to the other?
Or do I have to control other stochastic aspects of the logic? Maybe in the pooling algorithm?

Some more things I tried

To no avail, I tried:

Adding mode: :inference when running Axon.predict(model_info.model, model_info.params, inputs, mode: :inference).pooled_state to no avail => still getting different results on repeated runs.
Setting :rand.seed(:default, {123, 456, 789}) before the Axon.predict call (both with and without mode: :inference

In all cases, I am still getting different results on repeated runs.

Notes

In case that’s relevant, I am using:

config :nx, default_backend: EXLA.Backend
config :nx, :default_defn_options, compiler: EXLA, client: :host

jonatanklosko · October 17, 2024, 3:36am

Hey @vanderlindenma, when loading the model, note the log:

[debug] the following parameters were missing:

  * pooler.output.kernel
  * pooler.output.bias

The Bert model has a pooling layer, but it’s not present in distilbert-base-uncased, so it is initialized with random parameters every time. That’s why the output of pooled_state is not deterministic.

You can have a look at Bumblebee.Text.html.text_embedding/3. If the model was trained without a pooling layer, you may want to set output_attribute: :hidden_state and then specify which pooling to apply with output_pool: :mean_pooling | :cls_token_pooling.

Some repositories on HuggingFace use sentence_transformers, and those have 1_Pooling directory with pooling configuration, from where you can infer which pooling strategy is recommended.

Finally, while technically all base transformer models (including distilbert-base-uncased) generate some embedding representation that could be used, it’s best to use models specifically trained for embedding generation, such as thenlper/gte-small or sentence-transformers/all-MiniLM-L6-v2. See Generating embeddings section in the docs, with a full serving example : )