Collaborative filtering with Axon, need help explaining bad results

robinmonjo · April 8, 2023, 10:08pm

Hello all,

I’m (human) learning machine learning following Fast AI course. But instead of using Python, I’m using Elixir .
I’m kind of stuck on lecture 7 about collaborative filtering. I think my model is correct. But I can’t explain why it is performing so “bad” in terms of performance and predictions compared to the Pytorch model.

I described my problem in a livebook file available here: collab-filtering-issue.livemd · GitHub

If anyone kind enough to help me out on this one that would be awesome .

benwilson512 · April 9, 2023, 12:11am

Hey @robinmonjo are you able to share the python model / code as well?

robinmonjo · April 9, 2023, 9:41am

Yes sure it’s all in this public Kaggle Jupyter Notebook.

More precisely, the data loader code is here

The model I tried to reproduce is here.

I have 2 ideas of where the performance of my model might differ:

Having 2 inputs might not be very optimal. But I need 2 embeddings, one for users and one for movies. Maybe having one big embeddings and adjusting my input indexes (offsetting the movies ids by the number of users and having a { nb_users x nb_movies, nb_factors } embedding matrix would help
The other difference is that the Pytorch / Fast AI model is trained with fit_one_cycle that does clever stuff with the learning rate.

polvalente · April 9, 2023, 5:23pm

You might be having issues with the nultiply-then-add implementation, which can yield unexpected results due to broadcasting, depending on the input shapes for multiply.

Try using Axon.nx with Nx.dot instead of those 2 layers.

Also, double check the model shapes that their model outputs vs yours

robinmonjo · April 12, 2023, 3:05pm

Thank you for your answers ! I indeed have big shape issues and also my model was outputing a shape that was totally off compared to the labelled data …

I now managed to have good performances ! However I have still 2 things that “bother” me and that I’ll need to figure out:

I need to use a big learning rate ! Something like 30 !
I don’t understand why, but changing the batch size o the training set make the learning rate behave differently.

That’s weird because I expect the gradient * learning rate to be applied to each rows of my embeddings independently so increasing the batch size should not mean adjusting the learning rate .

My journey continue !

seanmor5 · April 13, 2023, 2:52pm

@robinmonjo Batch size and learning rate tend to be strongly correlated. With SGD you are doing a bunch of approximations of the true gradient of your loss function w.r.t. model parameters. The size of your approximation depends on your batch size. Larger batch size means you can be more confident in your approximation which means you can use a larger learning rate. Generally larger batch size == large learning rate