Torchx: OOM (out of memory) on Windows

Hi!

I am using Elixir 1.18.3 on OTP 27.
Using torchx version 0.10 compiled using LIBTORCH_TARGET=cu128 and LIBTORCH_VERSION=2.9.0

I am trying to hunt down a serious memory leak / no-free situation when using Torchx.
This is very easy to reproduce using the code below.

The issue is that the virtual memory increases in eternity.

GPU memory is released at intervals as I expect libtorch to do keeping the load at a acceptable level:

I was hunting down a memory leak in the Ortex library (I thought) but it turned out the problem was in Torchx which I used to prepare som data I sent into Ortex.

defmodule Mix.Tasks.Simple do
  @impl Mix.Task
  def run(_args) do
    10000
    |> runner()
  end

  defp runner(0), do: :ok

  defp runner(ittr) do
    Task.async(fn ->
      tensor =
        Nx.broadcast(0, {1, 3, 640, 640})
        |> Nx.backend_transfer(Nx.default_backend())
    end)
    |> Task.await()

    Process.sleep(50)

    runner(ittr - 1)
  end
end

I have also tried to use :erlang.garbage_collect to try releasing the memory but that does nothing.
I also, as you can see put the code into a Task as I thought maybe the memory was “trapped” in my parent process.
This also did nothing.

Anyone have any ideas?

I would really like to run this in WSL and use EXLA instead, but due to peripherals I have to stay in Windows.

1 Like

Please report this as an issue on the Nx repository.

Do you know if this bug happens on Linux or Mac too?

edit:

After re-reading your code, I have a follow-up question: Where did you call :erlang.garbage_collect? What happens if you call it right after Task.await?

1 Like

I will try to build for linux and check.

Will create an issue.

Thanks for the reply!

I tried both inside the Task and also after the Task.await

Thanks. I’d have expected either to have worked.
Also try to use Nx.backend_deallocate inside the task, so we can see if there’s a chance that function itself is busted.

I have also tried using the backend_deallocation function with same result :slight_smile:

Currently trying to build torchx for WSL to check for same problem there.

Sofar in my test this does not behave the same way on linux.
I will try to downgrade libtorch on the windows test as the latest version for Linux is 2.7.1

Using htop, both VIRT, RES and SHR stayed stable during the test.

I can confirm that this only happens on Windows if my understanding of Linux htop is correct.

I also tried for libtorch 2.7.1 now, and it was the same as 2.8.0

1 Like

It’s very likely this is a Windows-only bug. Please report this as an issue with a summary of these findings!

Issue created here:

Maybe someone else on Windows could verify my test? :slight_smile:

1 Like

Just to update the thread here, Torchx on main has been refactored to use elixir-nx/fine for the NIFs. This both fixes the bug and makes it easier to maintain the NIFs!

3 Likes

Great @polvalente ! :partying_face::partying_face:

1 Like