Nx appears to be extremely slow for Singular Value Decomposition. Like 50k times slower than Numpy

I’m trying to do SVD on a small financial correlation matrix (16x16), and I’m getting 45 seconds+ to do this in Nx, whereas it takes like, a millisecond in Python.

Here is my Elixir code:

defmodule Svdtest do

  def finurl, do:  "https://raw.githubusercontent.com/vegabook/nxtest/main/priv/data/dfc.csv"

  def url_to_tensor(url) do
    req = HTTPoison.get!(url)
    req.body 
    |> String.split("\n", trim: true) 
    |> Enum.map(fn x -> String.split(x, ",", trim: true) end) 
    |> Enum.map(fn x -> Enum.map(x, fn y -> String.to_float(y) end) end)
    |> Nx.tensor
  end

  def time_svd(tensor) do
    :timer.tc(&Nx.LinAlg.svd/1, [tensor])
  end

  def run_test(url) do
    tensor = url_to_tensor(url)
    {time, {u, s, vt}} = time_svd(tensor) 
    seconds = time / 1000000
    IO.inspect(s)
    IO.puts("Time taken: #{seconds} seconds")
  end

end

And here is the equivalent python:

import requests
import numpy as np
import datetime as dt

finurl = "https://raw.githubusercontent.com/vegabook/nxtest/main/priv/data/dfc.csv"

def test_svd(url = finurl):
    data = requests.get(finurl)
    body = data.content
    rows = list(filter(lambda x: x != "", body.decode().split("\n")))
    numbers = [[float(i) for i in x.split(",")] for x in rows]
    npa = np.array(numbers)
    nowtime = dt.datetime.utcnow()
    svd = np.linalg.svd(npa)
    time_taken = (dt.datetime.utcnow() - nowtime).total_seconds()
    print(time_taken)
    seconds = time_taken
    print(svd[1])
    print(f"Python time taken: {seconds} seconds")

if __name__ == "__main__":
    test_svd()

The results do come out the same, btw, as you will see if you run the code.

This is running on a Raspberry Pi, because it is brutal in showing slow code paths, so there is no GPU backend that I can imagine. I do know that Numpy on python is using Neon vectorised instructions though so this is a best case for it on this hardware.

Code is on github where you can see my mix.exs, etc, but here are my deps:

 defp deps do
    [
      # {:dep_from_hexpm, "~> 0.3.0"},
      # {:dep_from_git, git: "https://github.com/elixir-lang/my_dep.git", tag: "0.1.0"}
      #{:axon, "~> 0.3.0"},
      #{:exla, "~> 0.4.0"},
      {:nx, "~> 0.4.0"},
      {:explorer, "~> 0.4.0"},
      {:csv, "~> 3.0"},
      {:httpoison, "~> 1.8"},
    ]

I did see perusing through the Nx code yesterday that SVD, alone, seems to be written in native Elixir which maybe explains this? I’d really like to move to Elixir for my financial stat arb workflows but need it to be faster than this.

Is it because I’m on the “wrong” hardware, perhaps? Would x86 and or a better backend help, in which case can someone run the code above and let me know the difference you get on said hardware?

Or is it maybe that Nx’s focus is ML/AI and not traditional statistics? Thanks.

1 Like

You’re not using Elixir’s native backend (xla). You’re actually using the BEAM based, BinaryBackend, which is infact very slow. I think it was only written for ease of testing and is basically a “toy”.

What you need to do is uncomment the exla dependency, and configure EXLA as the default backend. More info in the docs.

6 Likes

For the sake of completeness, this will still be (a bit) slower than Python for now. We still need to reimplement Nx.eigh as per Re-implement Nx.LinAlg.eigh as defn · Issue #1027 · elixir-nx/nx · GitHub to get to Jax’s speed.

4 Likes

Another thing I forgot to point out:

The current SVD implementation is written in Nx defn. Although it seems like regular Elixir, the code will actually generate a computation graph with the Nx.Defn.Expr backend, which can then be compiled accordingly.

By default you will fall into the Nx.BinaryBackend for the compiled graph, but if you use EXLA as your compiler, the graph will be compiled by XLA under the hood.

2 Likes

Not a problem. Even if it’s 10x slower that’s fine. Just not 50000x lol. Playing with it now. Small issue is that there isn’t an XLA that’s pre compiled for ARM other than if you have Cuda, or if it’s a MAC. So I’m compiling from source to see what happens (might be very slow on rpi).

So I got some success here, but only on x86:

iex(3)> Svdtest.run_test(Svdtest.finurl)
#Nx.Tensor<
  f32[16]
  EXLA.Backend<host:0, 0.1541237301.966656004.29243>
  [6.730013370513916, 4.988578796386719, 2.5313656330108643, 0.6920415163040161, 0.4245455265045166, 0.18819215893745422, 0.
14191943407058716, 0.07709479331970215, 0.06985975056886673, 0.05112181976437569, 0.038872502744197845, 0.027997685596346855
, 0.016022006049752235, 0.012146148830652237, 0.008691341616213322, 0.0015574685530737042]
>
Time taken: 2.78e-4 seconds

That’s 0.000278 seconds or 278 microseconds after an initial slow (1.5 seconds) run which I assume as XLA compiled the codepath. That’s admittedly the fastest run with averages around 500 microseconds on a small xeon instance.

Python (numpy) on the same instance:

🐘 ttbrowne@logic:~/code/elixir/nxtest/python$ python3 svd_test.py
0.000518
[6.73000923e+00 4.98856965e+00 2.53136193e+00 6.92040501e-01
 4.24544789e-01 1.88191962e-01 1.41919306e-01 7.70945668e-02
 6.98595938e-02 5.11216337e-02 3.88724801e-02 2.79975821e-02
 1.60219956e-02 1.21460127e-02 8.69127912e-03 1.55749019e-03]
Python time taken: 0.000518 seconds
🐘 ttbrowne@logic:~/code/elixir/nxtest/python$ python3 svd_test.py
0.000246
[6.73000923e+00 4.98856965e+00 2.53136193e+00 6.92040501e-01
 4.24544789e-01 1.88191962e-01 1.41919306e-01 7.70945668e-02
 6.98595938e-02 5.11216337e-02 3.88724801e-02 2.79975821e-02
 1.60219956e-02 1.21460127e-02 8.69127912e-03 1.55749019e-03]
Python time taken: 0.000246 seconds
🐘 ttbrowne@logic:~/code/elixir/nxtest/python$ python3 svd_test.py
0.000261
[6.73000923e+00 4.98856965e+00 2.53136193e+00 6.92040501e-01
 4.24544789e-01 1.88191962e-01 1.41919306e-01 7.70945668e-02
 6.98595938e-02 5.11216337e-02 3.88724801e-02 2.79975821e-02
 1.60219956e-02 1.21460127e-02 8.69127912e-03 1.55749019e-03]
Python time taken: 0.000261 seconds

Basically, same speed. Awesome! Here we go moving from Python to Elixir.

Keep in mind that this is on a non-GPU instance, and that Numpy is heavily optimized to use AVX[2] instructions so the fact that Nx is keeping up is big kudos-worthy, though I guess we have Google to thank too.

Interestingly, there are some precision differences in the floating point result. Would have to check out algorithmic differences, before making a judgement.

The only pity is that XLA build is really hard on a small dev board like the raspberry pi, and I was unable to do it, hence moved to the x86 instance as per above. I like coding on an Rpi because it forces algorithmic discipline with its weak power. Also I might use it for edge inference at some stage. Have filed an issue.

5 Likes

The precision issues are also due to the eigh implementation. Because it’s slow, I had to limit the eps for eigh, which ends up affecting the precision for SVD as a whole.

4 Likes

For completeness, the issue for providing pre-compiled binaries for the RPi has just been closed.

On my RPi 4 (8GB RAM), full EXLA compilation takes roughly 2 to 3 min

5 Likes

Great thank you. You can see here that we’ve gone from 50k times slower to only 4x slower, so thank you for the 12500x perf increase haha.

2 Likes