I was hunting down a memory leak in the Ortex library (I thought) but it turned out the problem was in Torchx which I used to prepare som data I sent into Ortex.
defmodule Mix.Tasks.Simple do
@impl Mix.Task
def run(_args) do
10000
|> runner()
end
defp runner(0), do: :ok
defp runner(ittr) do
Task.async(fn ->
tensor =
Nx.broadcast(0, {1, 3, 640, 640})
|> Nx.backend_transfer(Nx.default_backend())
end)
|> Task.await()
Process.sleep(50)
runner(ittr - 1)
end
end
I have also tried to use :erlang.garbage_collect to try releasing the memory but that does nothing.
I also, as you can see put the code into a Task as I thought maybe the memory was “trapped” in my parent process.
This also did nothing.
Anyone have any ideas?
I would really like to run this in WSL and use EXLA instead, but due to peripherals I have to stay in Windows.
Please report this as an issue on the Nx repository.
Do you know if this bug happens on Linux or Mac too?
edit:
After re-reading your code, I have a follow-up question: Where did you call :erlang.garbage_collect? What happens if you call it right after Task.await?
Thanks. I’d have expected either to have worked.
Also try to use Nx.backend_deallocate inside the task, so we can see if there’s a chance that function itself is busted.
Sofar in my test this does not behave the same way on linux.
I will try to downgrade libtorch on the windows test as the latest version for Linux is 2.7.1
Using htop, both VIRT, RES and SHR stayed stable during the test.
Just to update the thread here, Torchx on main has been refactored to use elixir-nx/fine for the NIFs. This both fixes the bug and makes it easier to maintain the NIFs!