Let's build GPT from scratch w/ Nx and Axon


Like many people on Earth, I’ve been trying to understand the transformer architecture. I translated the Python code in Let's build GPT: from scratch, in code, spelled out. - YouTube to Elixir and packaged it in this livebook.

Some first impressions of Elixir’s ML libraries

  • Livebook is very nice
  • Using a functional style to implement models made more sense to my brain

Performance could be better (like using cache), but it should all work.

I hope this is helpful for anyone who’s trying to understand the transformer architecture.



This is incredible. Thank you so much.

1 Like

Thanks a lot! Glad it could help :smiley:

Hey @theodore
I just checked out the livebook, that’s a lot of work, impressive! Thanks for sharing :heart:

I’ll definitely watch the video together with your livebook. Maybe this time I’ll finally grasp the transformer architecture and not stop at the intricate diagram.

Thanks again, cheers!

1 Like

Thanks Nick :smiley: Yea it was quite a grind to figure it out. I actually used some of your notebooks to learn Nx + Axon, so thank you too!


Hey @theodore :wave:
Last week i started watching the video and implementing the GPT step-by-step following Karpathy’s video. Your livebook has been of great help so far, without which I’d have give up at the multinational distribution implementation.

Here my WIP livebook, I’m still at the training of the Bigram Model.

There are some small differences compared to your, in particular in the forward and loss function implementation, the training with my implementation seems a bit faster for what it matters :man_shrugging: but the generated text is basically the same (love the reproducibility :muscle:).

Also, how did you come up with this part of the code?

# Cap the input sequence length from [t, block size]
context_length = min(t, block_size)
context_range = -context_length..-1
context_slice = acc[[.., context_range]]

I (think) makes total sense since the prediction focus on the last block_size char at max, but I don’t recall Karpathy mentioning that when implementing the generate function, in fact is not present in his version and the generated text is the same when passing the whole acc - feels magic :sparkles: - but it’s slower.

Let’s see if I manage to proceed and finish it, but looking at your livebook, I still have a long way to the end :railway_track:

Again, thank you for putting together this great piece of work :man_bowing:


Haha that’s awesome to hear! I appreciate it.

That’s a good question. The short answer is that I added the “fully finished” generate_fn in the beginning so I could reuse it for the singlehead / multihead / fully-finished models :smile:

But, you’re right, the bigram model doesn’t need to truncate the input because the predicted next char doesn’t rely on context. Every char already has a mapping to its likely next char in the [65][65] embedding kernel

  • t is likely to produce h
  • h is likely to produce e

But, if you try out the non-truncating generate_fn on the single head model, you’ll see this gibberish:

Babthat tebrsb udinmipfsm enryarlrent cweetetpenycro oorctsstsnll miiesor in vutt? n

Crcewtoemi beoercstosl;dercta 'incasesptb wnf n f ntr',,kntn,iaessdisuyt asbitsfdcolultoitem i'osamner-mesiidr krestkuof'osdst. soutrjo la citeeemiktinavbeimtidelgkdtom ir, et, n rksyet.eyr? wstnserrrssrdewpnititoe'sirbrst i  ' m lam Tseht. mt diemrt qsspiswtatroira; rfwtocycct psoubefemt mtne' w a sfrrn. setetea acaoslt selfta.

aytdestinteet,eehiyploeyiticosa snun' g

 Itit nisvin eap rohuwprcoa rtto pn rhteteaepevit sho ta wdkilsrtsito ha quvb; lumorldwxetoteri'

 nwsfetenaseaxesllelyeeo! lo; retiameirlrsitto rsnrcihueaaeddrdda , Tt-epinmemeb gdusifkto: tlrremnert J nr ltsb rssaorisscoa .,!enedsasdmwa R I ha rrrf nwt. duiheissyrthtprereeuid. tl yt ea vunikiicsisshmjarrpreoretrei t ira,yeyyera enoaastewrdelmocylidtrnmsU

 verpseratseii? Lit atoedswottea srm' asfoyyila? pviyurnyeyigcyemt wepimirt tfwmaymo stsfolsdulfa I wsunak' vfepaa spktsytteekhopmanb' Lepswa O aegcino;elrmogymua had lei

vs the original

Babther te st bud hean! fad ry sis lr,
an withanu sr mou toous miche erer fo wous in
Lou byoun, I mine sthon omf th Cor Il, oum win m.

Fro hepith sanll hitro Il ous ecist wirs hal st os nchat har hinlost, art st fpivis, be Pans che-Cre, yan!
Th, headise foacer ch orsu thive ivisen prourilay, ksese.


Goodluck with learning the transformer architecture. I think implementing it in elixir helped cement the math behind it more, so even though it took a long time, I found it worthwhile. Anyways I hope that helped!


HI I am new to ML AND AI(practical side of it). I have gone through theory though