I trained a model last year that we still use in prod and I committed the code.
The model has >99% accuracy.
If I want to train the same model on the same data now it doesn’t learn, I might have forgot something but I can’t figure out what is the problem. The data also seems fine.
I made a GitHub repo that contains the data and the livebook.
The model is a binary classifier and no matter what I change it always ends up at 0.5 accuracy and the loss is increasing during training continuously.
Could somebody take a look at it and help me realize why the model doesn’t learn?
Sorry, this is a drive-by comment. I’m not able review the particulars.
You say “The data also seems fine.” I my experience, that’s usually the problem even though I didn’t spot it at first. I suggest two things to sanity check that it’s not the data:
Try to train another, simpler model on the same data. Can a different model find a signal in that data?
Try to train your model on a contrived example. Can that model find a signal when it’s definitely present?
If you find that the answers to both those questions are “yes”, then truly something odd is happening. But in my own work, I find that at least one of those answers is usually “no”.
I replicated the issue. What seems weird (to me, not really an expert), is that I observe too many NaN values in the tensors in the output of the training step.
Is there any chance that the training data/labels are in a wrong format or that a kind of missing normalisation step is required?
There seems to be something wrong with binary_cross_entropy. Try categorical_cross_entropy for your loss function, it should be equivalent for two outputs:
loss =
&Axon.Losses.categorical_cross_entropy(
&1,
&2,
reduction: :mean
)
Awesome, with categorical_cross_entropy and softmax output activation the loss is going down and accuracy is >0.99.
Thank you, I will open an issue on Axon repo then.