Can Nx replace Rust for number crunching?

PJUllrich · November 17, 2023, 8:20am

One “downside” of Elixir being a functional programming language is that it’s slow when lots and lots of data has to be transformed. So, what people call “number crunching”. Since data structures are immutable in Elixir, it copies e.g. a huge array every time when we change one of its fields, which is not great if you change a huge array in e.g. a loop. People usually reach out to other languages like Rust through the Rustler library for such work instead. The advantage of Rust is that it can mutate data in-place. So, it doesn’t copy the entire array just to changes one field.

This performance issue was solved for machine learning and other data operational heavy use-cases with Nx and Axon. However, these libraries are meant to run on the GPU and are focused on “AI” use-cases.

Now, I wondered: Would it be possible to reuse them for “number crunching” on the CPU as well? So, could they replace Rust to do number crunching? And which other features of Rust make it waaay faster than Elixir other than the in-place mutations?

My dream would be to simple have to replace “def” with “defn” and it would magically make a function that iterates over a huge array 1000x faster

Edit: Today I learned, that Elixir doesn’t copy the entire map when you change one field! My question still stands though. I updated it with “array” instead “map” (I hope this isn’t optimised like maps are :D)

Aetherus · November 17, 2023, 9:40am

Erlang / Elixir does not copy the whole map when updating a field unless it’s very small (e.g. size <= 32).

LostKobrakai · November 17, 2023, 10:49am

By my naive understanding what makes Nx work fast is the usage of an external backend, which is specifically catered to handle tensor data. So “focused on AI” would really mean focused on tensor based computations and everything that can be built on top of that.

Mutable datastructures is likely an important piece to the speed of transformation, but by my limited knowledge it’s just as much physical layout of data in memory and how you model the data to be able to profit from low level things like e.g. SIMD and likely others I’m not aware of. Languages with unboxed data and/or manual memory management do have much more flexibility on that front than we have on the beam (without breaking out of vm managed memory) – rust is a popular language for that right now, but I’m also seeing zig in that space even if more “niche”.

defn works because it’s meant to and expected to work with tensor based data. I don’t think having a magic defx will ever exist without the need to know what data format actually benefits the modeled computational task.

Edit: Nx backends could also be written in rust I imagine, especially for running on CPU, so it’s also a bit of a apples and oranges comparison

sabiwara · November 17, 2023, 10:53am

As far as I understand, Nx and EXLA will leverage a GPU if available, but can otherwise fallback on the CPU which should still be much faster than immutable data structures.

EXLA — EXLA v0.6.4.

EXLA will pick an available client to allocate and compute tensors, in this order: :cuda, :rocm, :tpu, and :host (CPU).

If you compare Nx with the default backend and with EXLA configured, you should witness some order of magnitude difference in performance.

cigrainger · November 17, 2023, 3:26pm

Nx’s fast backends/compilers are not in Elixir. The BinaryBackend, which is, is slowwwwwwwwwwwwwww. What makes Nx fast is how Defn works, which is explained here: https://github.com/elixir-nx/nx/tree/main/nx#numerical-definitions

You’re able to arbitrarily (mostly) choose a backend/compiler (XLA is the best supported) for these definitions. So, to be clear when it’s running quickly (and it is accelerated on CPU as well as GPU), it’s due to NIFs. They’re just in C++ (see the EXLA code here: https://github.com/elixir-nx/nx/tree/main/exla).

And there is a Rust backend already for Nx GitHub - mimiquate/candlex: An Nx backend for candle machine learning framework

Why can’t you replace def with defn today? Sure, it’s a superset, but I’m not sure I follow the use case here. It runs on CPU and absolutely speeds things up a ton. But it’s because it’s calling a NIF, the same as Rust-based libs.

As it were, you can also use Explorer for many things. Which also stole the idea of pluggable backends from Nx, so even though the main backend is Polars, you can write other backends as well. Including a pure Elixir one if you want.

polvalente · November 17, 2023, 3:43pm

There are a couple of things to consider too.
Both EXLA and Torchx, which are maintained by the Nx team, take a functional approach to implementing the Nx.Backend behaviour, mostly for sanity.

With EXLA, the defns end up being compiled to a specific computational representation that ends up being generally memory efficient. With Torchx, however, this is not exactly the case – at most, we rely on caching results.

It could be the case where you actually want to implement a backend where your tensors are mutable. You would trade-off keeping with the functional expectations with having more memory efficiency.

This would mean that, Nx.put_slice, Nx.indexed_put and Nx.indexed_add would all be really efficient in updating your tensor values, at the expense of being stateful (which opens up a whole new can of worms).

Regarding Elixir’s own memory efficiency, there’s a lot to talk about there. For instance, the BinaryBackend relies on storing data in BEAM binaries. When updating a binary, you don’t actually copy all of the data, much like maps and lists, where you copy only part of the datastructure, but most of it ends up being left as is due to immutability (so 2 different lists which share a tail will actually share said tail, if one was constructed from the other).

Finally, regarding replacing def with defn, a simple replacement would probably not work anyway, due to reasons @LostKobrakai hinted at (SIMD instructions, for the example below):

For a given collection of arrays

source = [10, 20, 30, ...]
indices = [3, 2, 0]
destination = [0, 1, 2, 3, 4, ...]

# expected result [30, 1, 20, 10, 4, ...]

One would generally write something like this in Elixir:

Enum.zip_reduce(source, indices, destination, fn val, idx, dest -> 
  List.replace_at(dest, idx, val)
end)

This is probably one of the most efficient ways if you can only replace a single item at a time. However, in Nx you have Nx.indexed_put, which can do this in a single call:
Nx.indexed_put(destination, indices, source) (let’s assume we’re dealing with tensors already, instead of lists). The Nx version would be a lot more efficient due to SIMD instructions that can move collections of data at once, which means fewer CPU and memory access cycles.

The Elixir code is not easily translateable into the proper Nx code (although you could maybe add some special cases here and there, as soon as you diverge from the simplest cases, you’re bound to end up with more inneficient Nx code).

At best, you’re using a backend that allows for mutability, but that is, by definition, a non-equivalent code to what you started with, so it doesn’t make a lot of sense to say you’ve “sped the code up” in this case.

jackalcooper · June 13, 2024, 2:55am

Now there is a compiler project for number crunching in Elixir! Some example:

cmo · June 13, 2024, 6:13am

Do you think it’s possible for an extra compilation step to handle all the pointer stuff automatically?

jackalcooper · June 13, 2024, 7:01am

Yes, that is what is expected to be added next. Open to proposal and contributions.
To be more specific, what is blocking me now is that I’m not sure how to differentiate an Elixir value binding/match a = 1 and a mutation looks like a = 1 but is actually *a_ptr = 1

samrose · August 10, 2024, 1:06pm

I’ve been interested in exploring a Julia backend for Nx.