Looking for interesting topics to write my thesis on, maybe distributed training with Elixir Axon?

jnnks · December 30, 2022, 11:44am

Hi everyone,

I just saw the talk about Axon by Sean Moriarity from ElixirConf 2022 (Youtube Video). In the Current Limitations section (around 28:30) the first bullet point is Distributed Learning (planned). How far are you in the planning process? Are there any concrete implementations ongoing?

I am in the last year of my computer science masters degree and am looking for interesting topics to write my thesis about. I was already experimenting with the idea before the video popped up in my feed. This would also be a topic relevant for my employer, so it’s a win-win situation for me

Let me know, if you are interested.

FYI: @seanmor5 @josevalim

josevalim · December 30, 2022, 12:43pm

Distributed learning may refer to both distributed across multiple GPUs and distributed across machines. In either case, I don’t think there is a lot of work ahead of us or - if I had to guess - not enough for a thesis.

Another area of exploration is additional compilers for Nx. We can target TVM, Google IREE’s and others. There are other ideas, but my wishlist is at home.

jnnks · December 30, 2022, 3:00pm

[…] not enough for a thesis

Interesting, I figured there are quite a lot of sw engineering problems related to the concept. Dataset access over multiple nodes and synchronization for the model training came to mind. However I am not sure how many of these problems are solved by Elixir/Erlang libraries already.
For the scientific aspect of the thesis we have more that enough strategies and optimizations to evaluate connected to either of those problems.

I was thinking of something like SETI@home where multiple low-end devices contribute to a large training task.

josevalim · December 30, 2022, 3:06pm

Hrm, you are right.

What I meant is that the mechanical bits (i.e. having data allocated on different servers or different GPUs) is probably not enough for a thesis (hopefully). But what you can build with it and all of the different ways you can slice algorithms could probably take several ones. Although the second part requires the first one. When would your thesis start?

jnnks · December 30, 2022, 4:04pm

I would like to be done by mid September (handed in, graded). The start is flexible, but I would like to start soon.

The first experiment was with multiple parallel trainings on the same model that synchronize their model state to continuously work on the best model so far. That way I can ignore low level problems for now.

jnnks · January 19, 2023, 8:23pm

The AI professor I was talking to is interested in the suggested topic, but cautioned me to make sure that there is enough scientific value in the experiments and implementations necessary. We can agree that useful data is difficult to generate and connected with lots of work before seeing first results.
I will continue to think about the topic though.

@josevalim Is your wishlist publicly available?

josevalim · January 19, 2023, 8:57pm

Here are some ideas from the wishlist:

Scholar
- You can implement some complex algorithms from scikit-learn using Nx. This is novel because not all algorithms have be written in a format that benefits of a tensor compiler and, if you succeed, you should see value perf benefits (plus GPU compilation). I know a student who worked Affinity Propagation to tensors for example as part of their master thesis (in Europe)
Nx
- TVM compiler backend
- Google IREE backend
Something with reinforcement learning? Double cool if you do something with ALE

This list was used to be so big…

hst337 · January 20, 2023, 12:20pm

I don’t know about Nx, but there’s a great book called Neuroevolution through Erlang. It describes interesting though extremely inefficient approach to evolution algorithms in Erlang. And I think rewriting this book in Nx/Axon would be a great success

ityonemo · January 20, 2023, 3:19pm

If you wrote an nx backend that executes ML tasks over Erlang distribution I think that would be enough for a master’s thesis. You could pilot it in a crude way with just using the cluster for data transfer. Then maybe a second iteration could use a more high-efficiency transport technique (udp + flow?) sctp? And only use the cluster as a control plane. Then profile the two for performance.

I think this would be relatively easy in elixir, just about the right size for a master’s, and brutally hard in, say, python.

Benjamin-Philip · January 21, 2023, 5:55am

Regarding new Nx backends, I actually started writing an Nx backend for ArrayFire. I eventually abandonded the project after I saw that Torchx was barely used compared to EXLA, and that the ArrayFire codebase felt poor - broadcasting support was missing, float16 support was basically broken for the Rust bindings, PRs I’ve made to fix this have still not been merged and more.

Honestly, I am not quite sure if we need a new backend… Everybody seems quite happy with using EXLA. Even Torchx seems quite neglected, let alone a backend that uses a far more obscure project.

However, I did enjoy writing the backend (except when I was wrestling with Rust), and I would love to work on another backend. I actually wanted to write a backend for TVM, but I found out that it aimed at optimizing pretrained models and can’t be used in the same way we use EXLA. IREE seems quite interesting, but the devs mention that it’s still beta.

If you’re interested, we can write a Compiler for TVM, or a Compiler/Backend for IREE together.

ityonemo · January 21, 2023, 6:54am

Synchronization of model state is the problem with distributed training. There is however an unsynchronized state technique that I think would be very cool to implement in elixir: [1608.05343] Decoupled Neural Interfaces using Synthetic Gradients

I think people don’t use it because it’s so hard to rewrite your entire model to take advantage of it. I get the feeling that in elixir you could easily rewrite the model to do this (or, design a graph transform that does this, so it’s plug and play)

josevalim · January 21, 2023, 11:26am

It is awesome that you have been exploring those routes @Benjamin-Philip!

I have been under the impression that the Relay language in Apache TVM could be used as EXLA but, if that’s not the case, I should definitely remove it from the list.

I have also been looking into IREE and it seems it is a bit more lower level and takes XLA instructions (or similar) as inputs. So it would most likely be an option on EXLA and there would be a need to write code that interfaces with IREE buffers (this API is small, generally speaking).

Finally, regarding Torchx, we improve it when we can but I didn’t list it because I don’t think the work would be suitable for thesis. One option could be even to translate Nx.Defn.Expr into Torchscript, so we could compile those. PyTorch 2.0 also seems to be heading towards an approach closer to XLA, where they will be a number of low-level primitives that everyone build on top and those could be explored too. The only concern is that the PyTorch 2.0 compiler, TorchInductor, is written in Python, but the OpenMP and Triton backends bits could likely be reused.

Benjamin-Philip · January 21, 2023, 12:47pm

I’m not entirely sure that Relay is different from EXLA’s instructions. We need to investigate this in more detail.

Benjamin-Philip · January 21, 2023, 12:52pm

For more Nx backed ideas, you can have look at this list: Backend for Apache Arrow - #2 by seanmor5

josevalim · January 22, 2023, 8:42am

It seems to apply to RNNs specifically though. Are you familiar with more general approaches?

In distributed training, can’t each node work on different batches and then propagate the results of those different batches? Or is that also expensive?

There is also a recently new paper on using forward mode for training: [PDF] Gradients without Backpropagation | Semantic Scholar

jackalcooper · January 22, 2023, 10:02am

Generally, if you want to submit this paper to a conf/journal, you could have a look at the other FPs’ works published in this area.
Here are some directions:

Collective communication optimized with functional techniques
Low level code gen with functional primitives
Optimizing pipeline parallel of training
Very dynamic scenarios like reinforce learning/elastic training with fault tolerance in Erlang/Elixir.

ityonemo · January 22, 2023, 3:31pm

No, they also show it for breaking up a generalized feed forward network. The rnn part is because at the time of the paper the biggest models were RNNs and each recurrent unwrapping was a reasonable size to fit on a GPU and it made for a “natural place” to break it up, probably easier on the python transformation to go from backdrop to dni.

In distributed training, can’t each node work on different batches and then propagate the results of those different batches

I don’t recall the reference, but short story is that due to the maths of SGD this doesn’t really work. I think the dimensional search often has small gradients that are highly coupled to the other discovered gradients, so if node A picks one direction for a certain set of weights and node B picks another then the the direction of the remaining weights can be completely scrambled. So when you train async and sum the gradients if they aren’t talking to each other you can converge even more slowly or not at all.

In order to have data-parallel training you must either use a parameter server which incurs huge coordination and data transfer costs, or you have to do ring updates which have an O(n^2) bandwidth cost.

There is a paper out there where they got around it by figuring out a sharding scheme, I think it was ca. 5 years ago. I can look for it… I think Yannic kilcher did a video on it.

Like the DNI technique I believe this isn’t highly used because it’s a pain in python. I think it is really possible for elixir to get a huge edge in ML if we can make application of these two techniques and others composable and easy. You will have a hard time getting ML scientists to admit this but I do believe that the current bestiary of techniques is highly biased by “how easy is it to trick python into doing it” (like, we use dropout normalization because it’s easy).

josevalim · January 22, 2023, 5:32pm

Thanks for clarifying! And what about relying on forward mode gradients only? Any thoughts on that?

ityonemo · January 22, 2023, 5:51pm

It’s pretty new and I don’t know enough about it to say one way or another, though I know hinton has been working on it for a very long time since he has been complaining that the brain probably doesn’t do backprop. (Stanford Seminar - Can the brain do back-propagation? - YouTube, ~6ya)

Also, apologies, I can’t find the sharding article. Hopefully not lost to time

meanderingstream · January 23, 2023, 12:30pm

Have a look at this discussion of distributed training. https://training-transformers-together.github.io/

Deep Learning over the Internet: Training Language Models Collaboratively, https://arxiv.org/pdf/2106.10207.pdf