Let's build GPT from scratch w/ Nx and Axon

Hey @theodore :wave:
Last week i started watching the video and implementing the GPT step-by-step following Karpathy’s video. Your livebook has been of great help so far, without which I’d have give up at the multinational distribution implementation.

Here my WIP livebook, I’m still at the training of the Bigram Model.

There are some small differences compared to your, in particular in the forward and loss function implementation, the training with my implementation seems a bit faster for what it matters :man_shrugging: but the generated text is basically the same (love the reproducibility :muscle:).

Also, how did you come up with this part of the code?

# Cap the input sequence length from [t, block size]
context_length = min(t, block_size)
context_range = -context_length..-1
context_slice = acc[[.., context_range]]

I (think) makes total sense since the prediction focus on the last block_size char at max, but I don’t recall Karpathy mentioning that when implementing the generate function, in fact is not present in his version and the generated text is the same when passing the whole acc - feels magic :sparkles: - but it’s slower.

Let’s see if I manage to proceed and finish it, but looking at your livebook, I still have a long way to the end :railway_track:

Again, thank you for putting together this great piece of work :man_bowing:
Best.

3 Likes