TorchX backend crashing against PyTorch 2.0

vgrechin · July 30, 2023, 7:26am

I compile PyTorch 2.0.1 and trying to run Axon supplied notebook from its hex docs:
https://hexdocs.pm/axon/accelerating_axon.html

TorchX backend crashing here after some period, while Exla compiler and backend work very fast.
Is there a chance it may be fixed by downgrading to PyTorch 1.x?

Erlang 26.0.2, Elixir 1.15.4, Livebook 0.10.0

vgrechin · July 30, 2023, 8:12am

Seem it was a memory issue. I had to run Exla and TorchX backends in different notebooks.

vgrechin · August 9, 2023, 5:35pm

Unfortunately, I had to downgrade my PyTorch installation to 1.13.1 because it of this line on 2.0.1

– USE_CUDNN is set to 0. Compiling without cuDNN support

while 1.13.1 produces much clearer log
– Found CUDNN: /usr/lib/aarch64-linux-gnu/libcudnn.so
– Found cuDNN: v8.6.0 (include: /usr/include, library: /usr/lib/aarch64-linux-gnu/libcudnn.so)

I’m 100% sure that the issue of PyTorch, but even when I tried to build 2.0.1 with this flags I still encounter the problem with cuDNN
-DUSE_CUDNN=1 -DCAFFE2_USE_CUDNN=1

But someone may take advantage of using PyTorch without cuDNN, that’s pure preference