Why is stacking lots of `EXLA.Backend` tensors slow?

Hello!

This is probably a newbie question. But I noticed that this code that stacks takes 56 seconds the first time I run it:

for(_ <- 1..10_000, do: Nx.broadcast(0, {1024}))
|> Nx.stack()

while stacking in BinaryBackend then transferring to EXLA only takes 0.4 seconds:

zero = Nx.tensor([0], backend: Nx.BinaryBackend)
for(_ <- 1..10_000, do: Nx.broadcast(zero, {1024}))
|> Nx.stack()
|> Nx.backend_transfer(EXLA.Backend)

The first version also takes an additional minute whenever the length of the list changes. Why is that? I noticed the current function was stuck on EXLA.NIF.mlir_compile/7. Is there a new version of stack being compiled for each input list size?

1 Like

Confirmed on my laptop with NVIDIA GeForce RTX 4070 Mobile GPU.

When running

for(_ <- 1..10_000, do: Nx.broadcast(0, {1024}))
|> Nx.stack(name: :articles)

the GPU memory is always near full, and I killed that run after about 2 minutes.

Here’s what $ nvidia-smi showed to me:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 ...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   46C    P8              3W /   35W |    7213MiB /   8188MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      5627      C   ....17.0-otp-26/.mix/escripts/livebook       7204MiB |
+-----------------------------------------------------------------------------------------+

Then I totally restarted the livebook, and run

zero = Nx.tensor([0], backend: Nx.BinaryBackend)
for(_ <- 1..10_000, do: Nx.broadcast(zero, {1024}))
|> Nx.stack(name: :articles)
|> Nx.backend_transfer(EXLA.Backend)

It finished within 3 seconds.

Elixir Version: 1.17.0 OTP 26
Livebook Version: 0.12.1
Nx Version: 0.7.2
EXLA Version: 0.7.2

Thank you all. Fixed in main: Do not compile stack/concatenate expressions · elixir-nx/nx@12491ab · GitHub

5 Likes

Wow! Thanks all for the verification & for the fix!