Comparing neural network training performance between Elixir and Python

joegiralt · February 3, 2023, 10:08pm

With a wide range of libraries focused on the machine learning market, such as TensorFlow, NumPy, Pandas, Keras, and others, Python has made a name for itself as one of the main programming languages. In February 2021, José Valim and Sean Moriarity published the first version of the Numerical Elixir (Nx) library, a library for tensor operations written in Elixir. Nx aims to allow the language be a good choice for GPU-intensive operations. This work aims to compare the results of Python and Elixir on training convolutional neural networks (CNN) using MNIST and CIFAR-10 datasets, concluding that Python achieved overall better results, and that Elixir is already a viable alternative.

Why would Python achieve “overall better results?” What does that mean? Is the elixir code they used even idiomatic or current?

ChrisYammine · February 3, 2023, 11:00pm

I’m skimming the paper and their citations for the model code used are from the example directory in axon

Their python models are cited to be from here
[10] Available at: GitHub - sallaumen/python_neural_network_labs: Python implementation of MNIST and CIFER10 neural networks

I encourage you to read the paper to answer the question of why and what it’s very short

joegiralt · February 3, 2023, 11:05pm

I read the paper, they said that the python app had better CPU assuage but the elixir app had better memory usage. in the paper summary it states that the python app was better over all.

josevalim · February 3, 2023, 11:21pm

It is also worth noticing that the paper uses Nx v0.2 and a lot has changed since then given it is relatively new technology. In particular, the new Axon version has many improvements on training, so I would be eager to see more recent results and see if those improvements are proven on paper!

ChrisYammine · February 3, 2023, 11:43pm

Ok apologies for assuming you didn’t read anything beyond the summary, but the question of what code was used was answered in the citations

ChrisYammine · February 3, 2023, 11:43pm

I didn’t notice that! Agreed I’d be curious to see the updated results

ityonemo · February 4, 2023, 6:35pm

Looking at the graphs it looks like the time difference can be accounted for by 1) late startup of elixir GPU usage and 2) mysterious gaps in the GPU usage. A SWAG (“scientific wild-ass guess”) here guess that the late startup likely scales based on training set size and not training epochs, but those gaps scale in number based on epoch count. So for a more useful machine learning training problem, it’s likely to scale to somewhere between 15-25% slower (the mnist and cifar very nearly represents an upper bound on the pessimization).

Maybe I missed something but it doesn’t appear that the paper tries to explain what is happening in those gaps… My gut feeling guess that there’s some GPU data shuffling back and forth with the cpu that is blocking progress and probably could be run concurrently towards the end of the first chunk. Don’t know if the python libs proactively figure that out and schedule those data transfers in advance concurrently, would be interesting to find out.

josevalim · February 4, 2023, 9:10pm

We fixed those in the next Axon. We did have some code that would take longer the more epochs you had but it is all fixed now.

We would also recompile the network between epochs, but also fixed.

ityonemo · February 4, 2023, 9:21pm

Beautiful!