With a wide range of libraries focused on the machine learning market, such as TensorFlow, NumPy, Pandas, Keras, and others, Python has made a name for itself as one of the main programming languages. In February 2021, José Valim and Sean Moriarity published the first version of the Numerical Elixir (Nx) library, a library for tensor operations written in Elixir. Nx aims to allow the language be a good choice for GPU-intensive operations. This work aims to compare the results of Python and Elixir on training convolutional neural networks (CNN) using MNIST and CIFAR-10 datasets, concluding that Python achieved overall better results, and that Elixir is already a viable alternative.
Why would Python achieve “overall better results?” What does that mean? Is the elixir code they used even idiomatic or current?
I read the paper, they said that the python app had better CPU assuage but the elixir app had better memory usage. in the paper summary it states that the python app was better over all.
It is also worth noticing that the paper uses Nx v0.2 and a lot has changed since then given it is relatively new technology. In particular, the new Axon version has many improvements on training, so I would be eager to see more recent results and see if those improvements are proven on paper!
Looking at the graphs it looks like the time difference can be accounted for by 1) late startup of elixir GPU usage and 2) mysterious gaps in the GPU usage. A SWAG (“scientific wild-ass guess”) here guess that the late startup likely scales based on training set size and not training epochs, but those gaps scale in number based on epoch count. So for a more useful machine learning training problem, it’s likely to scale to somewhere between 15-25% slower (the mnist and cifar very nearly represents an upper bound on the pessimization).
Maybe I missed something but it doesn’t appear that the paper tries to explain what is happening in those gaps… My gut feeling guess that there’s some GPU data shuffling back and forth with the cpu that is blocking progress and probably could be run concurrently towards the end of the first chunk. Don’t know if the python libs proactively figure that out and schedule those data transfers in advance concurrently, would be interesting to find out.