Model no longer learning - any ideas?

preciz · April 13, 2024, 1:26pm

I trained a model last year that we still use in prod and I committed the code.
The model has >99% accuracy.

If I want to train the same model on the same data now it doesn’t learn, I might have forgot something but I can’t figure out what is the problem. The data also seems fine.

I made a GitHub repo that contains the data and the livebook.

The model is a binary classifier and no matter what I change it always ends up at 0.5 accuracy and the loss is increasing during training continuously.

Could somebody take a look at it and help me realize why the model doesn’t learn?

billylanchantin · April 13, 2024, 4:23pm

Sorry, this is a drive-by comment. I’m not able review the particulars.

You say “The data also seems fine.” I my experience, that’s usually the problem even though I didn’t spot it at first. I suggest two things to sanity check that it’s not the data:

Try to train another, simpler model on the same data. Can a different model find a signal in that data?
Try to train your model on a contrived example. Can that model find a signal when it’s definitely present?

If you find that the answers to both those questions are “yes”, then truly something odd is happening. But in my own work, I find that at least one of those answers is usually “no”.

bdarla · April 14, 2024, 8:50am

I replicated the issue. What seems weird (to me, not really an expert), is that I observe too many NaN values in the tensors in the output of the training step.

Is there any chance that the training data/labels are in a wrong format or that a kind of missing normalisation step is required?

preciz · April 14, 2024, 5:04pm

Sean Moriarity wrote me that I should try axon 0.4.
With axon 0.4.1 the model is learning, the loss is going down during training.

Here is the diff:

<   {:axon, "~> 0.6"},
<   {:nx, "~> 0.7"},
<   {:exla, "~> 0.7"},
---
>   {:axon, "~> 0.4.1"},
>   {:nx, "~> 0.4.2"},
>   {:exla, "~> 0.4.2"},

< optimizer = Polaris.Optimizers.adamw(learning_rate: 1.0e-3)
---
> optimizer = Axon.Optimizers.adamw(1.0e-3)

https://github.com/preciz/not_learning/blob/master/model_learning.livemd

preciz · April 14, 2024, 5:17pm

But now I would be happiest if newest Axon version would also learn. Any ideas on that?

NduatiK · April 15, 2024, 3:48pm

There seems to be something wrong with binary_cross_entropy. Try categorical_cross_entropy for your loss function, it should be equivalent for two outputs:

loss =
  &Axon.Losses.categorical_cross_entropy(
    &1,
    &2,
    reduction: :mean
  )

If the change works on your end, could you report the issue to the Axon Github?

preciz · April 15, 2024, 4:48pm

Awesome, with categorical_cross_entropy and softmax output activation the loss is going down and accuracy is >0.99.
Thank you, I will open an issue on Axon repo then.