Performance issue with Explorer, Nx, and their combination

Hello everyone,

I’m currently working on a project that involves using Elixir to perform data analysis tasks. As part of this work, I’ve been experimenting with different libraries and tools, including Explorer and Nx.

To get a better sense of how these tools perform, I’ve been running some benchmarks on simple functions like mean, variance, and standard deviation. However, I’ve run into a strange issue when I try to combine these libraries - specifically, when I convert an Explorer.Series to an Nx.tensor and then use Nx functions like Nx.mean.

What I’ve found is that this combined operation is much slower than either operation alone, which seems counterintuitive. I’m not sure what’s causing this issue, but I suspect it could be due to inefficiencies in the conversion process, memory usage, or other performance bottlenecks in the code.

I’m reaching out to the community to see if anyone has experienced similar issues, or has any advice on how to improve the performance of this operation. I’d be grateful for any insights or suggestions you can offer.

Thank you in advance for your help!

defmodule Bench do
  import Nx.Defn

  deftransform mean_nx_series(series) do
    Explorer.Series.to_tensor(series)
    |> Bench.mean_nx()
  end

  defn mean_nx(tensor) do
    Nx.mean(tensor)
  end

  def mean_explorer(series) do
    Explorer.Series.mean(series)
  end
end
bench_means =
  Benchee.run(
    %{
      "explorer_mean" => fn -> Bench.mean_explorer(rand_series) end,
      "nx_mean_s64" => fn -> Bench.mean_nx(rand_tensor_s64) end,
      "nx_mean_s32" => fn -> Bench.mean_nx(rand_tensor_s32) end,
      "nx_mean_s16" => fn -> Bench.mean_nx(rand_tensor_s16) end,
      "nx_mean_of_series" => fn -> Bench.mean_nx(rand_series) end,
      "nx_series_with_deftransform" => fn -> Bench.mean_nx_series(rand_series) end,
      "converting_series_to_nx" => fn -> Explorer.Series.to_tensor(rand_series) end,
      "pre_converting_series_to_nx_nx_mean" => fn -> Explorer.Series.to_tensor(rand_series) |> Bench.mean_nx() end
    },
    warmup: 1,
    time: 2
  )

Results using EXLA cuda backend. Series and Tensor has length 1million.

Operating System: Linux
CPU Information: AMD Ryzen 9 3900X 12-Core Processor
Number of Available Cores: 24
Available memory: 31.24 GB
Elixir 1.14.2
Erlang 25.2

Benchmark suite executing with the following configuration:
warmup: 1 s
time: 2 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 24 s

Benchmarking converting_series_to_nx ...
Benchmarking explorer_mean ...
Benchmarking nx_mean_of_series ...
Benchmarking nx_mean_s16 ...
Benchmarking nx_mean_s32 ...
Benchmarking nx_mean_s64 ...
Benchmarking nx_series_with_deftransform ...
Benchmarking pre_converting_series_to_nx_nx_mean ...

Name                                          ips        average  deviation         median         99th %
converting_series_to_nx                   73.64 K       13.58 μs    ±40.03%       13.57 μs       17.90 μs
nx_mean_s16                                5.56 K      179.73 μs    ±57.99%      156.00 μs      774.57 μs
nx_mean_s32                                4.86 K      205.92 μs    ±51.52%      187.21 μs      775.33 μs
nx_mean_s64                                4.20 K      238.37 μs    ±46.83%      216.76 μs      814.91 μs
explorer_mean                              1.30 K      770.52 μs     ±2.35%      765.21 μs      852.85 μs
pre_converting_series_to_nx_nx_mean      0.0116 K    86121.10 μs    ±14.73%    95941.01 μs    97585.60 μs
nx_series_with_deftransform              0.0112 K    89152.43 μs    ±11.92%    95158.30 μs   101969.55 μs
nx_mean_of_series                        0.0107 K    93060.98 μs    ±10.03%    94988.81 μs   107701.51 μs

Comparison: 
converting_series_to_nx                   73.64 K
nx_mean_s16                                5.56 K - 13.23x slower +166.15 μs
nx_mean_s32                                4.86 K - 15.16x slower +192.34 μs
nx_mean_s64                                4.20 K - 17.55x slower +224.79 μs
explorer_mean                              1.30 K - 56.74x slower +756.94 μs
pre_converting_series_to_nx_nx_mean      0.0116 K - 6341.60x slower +86107.52 μs
nx_series_with_deftransform              0.0112 K - 6564.81x slower +89138.85 μs
nx_mean_of_series                        0.0107 K - 6852.62x slower +93047.40 μs

And I did the same with EXLA cpu as backend:

Operating System: Linux
CPU Information: AMD Ryzen 9 3900X 12-Core Processor
Number of Available Cores: 24
Available memory: 31.24 GB
Elixir 1.14.2
Erlang 25.2

Benchmark suite executing with the following configuration:
warmup: 1 s
time: 2 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 24 s

Benchmarking converting_series_to_nx ...
Benchmarking explorer_mean ...
Benchmarking nx_mean_of_series ...
Benchmarking nx_mean_s16 ...
Benchmarking nx_mean_s32 ...
Benchmarking nx_mean_s64 ...
Benchmarking nx_series_with_deftransform ...
Benchmarking pre_converting_series_to_nx_nx_mean ...

Name                                          ips        average  deviation         median         99th %
converting_series_to_nx                  83012.68      0.0120 ms    ±30.03%      0.0118 ms      0.0164 ms
nx_mean_s64                               3780.90        0.26 ms     ±3.07%        0.26 ms        0.29 ms
explorer_mean                             1302.36        0.77 ms     ±1.84%        0.76 ms        0.85 ms
nx_mean_s16                                912.97        1.10 ms     ±9.29%        1.13 ms        1.30 ms
nx_mean_s32                                826.72        1.21 ms     ±9.29%        1.24 ms        1.43 ms
pre_converting_series_to_nx_nx_mean         11.47       87.19 ms    ±15.18%       97.30 ms       99.32 ms
nx_series_with_deftransform                 11.31       88.45 ms    ±11.97%       95.49 ms       96.91 ms
nx_mean_of_series                           10.88       91.93 ms     ±9.95%       95.05 ms       98.16 ms

Comparison: 
converting_series_to_nx                  83012.68
nx_mean_s64                               3780.90 - 21.96x slower +0.25 ms
explorer_mean                             1302.36 - 63.74x slower +0.76 ms
nx_mean_s16                                912.97 - 90.93x slower +1.08 ms
nx_mean_s32                                826.72 - 100.41x slower +1.20 ms
pre_converting_series_to_nx_nx_mean         11.47 - 7237.96x slower +87.18 ms
nx_series_with_deftransform                 11.31 - 7342.16x slower +88.43 ms
nx_mean_of_series                           10.88 - 7630.95x slower +91.91 ms
1 Like

What versions of Nx, EXLA and Explorer are you using?
It would also be interesting to see what’s the definition of the inputs you’re using as well.

For defn, I’d set long warmup as well, just so we can fully eliminate any initialization or compilation times from the measurements

1 Like

These are the versions used in this test:

Mix.install(
  [
    {:explorer, "~>0.5.6"},
    {:nx, "~> 0.5.2"},
    {:exla, "~> 0.5.2"},
    {:benchee, "~> 1.1.0"}
  ],
  system_env: [
    XLA_TARGET: "cuda118"
  ]
)

Before I was using Explorer 0.5.2 and Nx 0.5.1 and the results were very similar.

The inputs are random tensors or a random series.

rand_tensor_s16
#Nx.Tensor<
  s16[1000000]
  EXLA.Backend<cuda:0, 0.3848215277.360316982.67444>
  [6899, 2127, 2266, 5280, 6570, 4454, 9774, 5811, 2073, 391, 4742, 0, 5959, 535, 5421, 4487, 6503, 9878, 136, 3112, 7397, 4534, 9984, 4255, 7582, 4878, 3731, 840, 1090, 1739, 9907, 2214, 4650, 1645, 3259, 7433, 2875, 1216, 6472, 9170, 4651, 2634, 8160, 8559, 9748, 7056, 1912, 218, 5767, 4991, ...]
>
rand_series
#Explorer.Series<
  Polars[1000000]
  integer [593, 67, 63, 147, 3108, 143, 4173, 2146, 5643, 7, 1282, 47740, 502, 12066, 24226, 3866,
   1551, 16989, 2352, 640, 1419, 35244, 16448, 19726, 474, 9537, 6013, 6554, 13, 6, 93, 71, 200,
   1831, 27114, 35652, 562, 1252, 350, 8696, 1376, 146, 84409, 88, 232, 140230, 257, 1633, 2373,
   129, ...]
>

This is the tensor after convertion from series:

Explorer.Series.to_tensor(rand_series)
#Nx.Tensor<
  s64[1000000]
  EXLA.Backend<cuda:0, 0.3848215277.360316982.67467>
  [593, 67, 63, 147, 3108, 143, 4173, 2146, 5643, 7, 1282, 47740, 502, 12066, 24226, 3866, 1551, 16989, 2352, 640, 1419, 35244, 16448, 19726, 474, 9537, 6013, 6554, 13, 6, 93, 71, 200, 1831, 27114, 35652, 562, 1252, 350, 8696, 1376, 146, 84409, 88, 232, 140230, 257, 1633, 2373, 129, ...]
>

Results increasing warmup and time to 5 sec.

Operating System: Linux
CPU Information: AMD Ryzen 9 3900X 12-Core Processor
Number of Available Cores: 24
Available memory: 31.24 GB
Elixir 1.14.2
Erlang 25.2

Benchmark suite executing with the following configuration:
warmup: 5 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 1.33 min

Benchmarking converting_series_to_nx ...
Benchmarking explorer_mean ...
Benchmarking nx_mean_of_series ...
Benchmarking nx_mean_s16 ...
Benchmarking nx_mean_s32 ...
Benchmarking nx_mean_s64 ...
Benchmarking nx_series_with_deftransform ...
Benchmarking pre_converting_series_to_nx_nx_mean ...

Name                                          ips        average  deviation         median         99th %
converting_series_to_nx                   74.30 K       13.46 μs    ±44.85%       13.56 μs       15.17 μs
nx_mean_s16                                6.96 K      143.59 μs    ±25.35%      138.99 μs      184.70 μs
nx_mean_s32                                6.09 K      164.33 μs    ±21.69%      161.64 μs      212.03 μs
nx_mean_s64                                5.02 K      199.06 μs    ±20.14%      188.72 μs      267.11 μs
explorer_mean                              1.31 K      763.56 μs     ±0.53%      762.62 μs      770.92 μs
pre_converting_series_to_nx_nx_mean      0.0110 K    90947.54 μs    ±13.22%    94835.72 μs   104831.06 μs
nx_series_with_deftransform              0.0109 K    91932.58 μs    ±12.77%   100133.80 μs   103927.18 μs
nx_mean_of_series                        0.0103 K    97187.90 μs    ±11.27%   101938.17 μs   105582.17 μs

Comparison: 
converting_series_to_nx                   74.30 K
nx_mean_s16                                6.96 K - 10.67x slower +130.13 μs
nx_mean_s32                                6.09 K - 12.21x slower +150.87 μs
nx_mean_s64                                5.02 K - 14.79x slower +185.60 μs
explorer_mean                              1.31 K - 56.73x slower +750.10 μs
pre_converting_series_to_nx_nx_mean      0.0110 K - 6757.24x slower +90934.08 μs
nx_series_with_deftransform              0.0109 K - 6830.42x slower +91919.12 μs
nx_mean_of_series                        0.0103 K - 7220.88x slower +97174.44 μs
1 Like

Ok, so there are a few things we need to discuss.
Because we’re dealing with GPU data transfer, I’d expect the conversion from Explorer to Nx to take up a bit of time, since the Explorer data lives on the CPU RAM and we need to transform it first into a Nx tensor (done with zero-copy) and then transfer that into the GPU (which takes some time).

Explorer by itself is slower because it doesn’t use the GPU, while we see similar speeds for Nx itself (because of the GPU usage).

From the “converting_series_to_nx” measurement we can see that the transfer from the CPU to the GPU is taking roughly 5~10% of the median time that Nx execution takes, so that’s something to take into account as well.

This doesn’t explain why the other defn executions are so slow. As a final sanity check, please make sure that EXLA is set as your Nx.Defn compiler

Unfortunately, I do not have CUDA on hand right now, but as a final sanity check

1 Like