For the last year, I’ve been running Nx/EXLA with a 3060 eGPU very nicely. This week I had to reinstall my OS (Ubuntu 22.04), and now I can’t get Nx working!!!
Other GPU programs (ollama, nvtop) are working fine. But not Nx/EXLA.
Not sure what I’m doing wrong here & appreciate any help or clues!!!
nvidia-smi output:
Sat Nov 9 12:40:23 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:04:00.0 Off | N/A |
| 0% 39C P8 14W / 170W | 4MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
My driver install script:
#!/usr/bin/env bash
vsn="550" # 550, 565
# Add Nvidia package repository
# sudo apt update
# sudo apt install -y software-properties-common
# sudo add-apt-repository ppa:graphics-drivers/ppa -y
# sudo apt update
# Install Packages
# Install the driver
sudo apt install -yq nvidia-driver-$vsn
# Install CUDA and related packages
sudo apt install -yq \
nvidia-compute-utils-$vsn \
nvidia-utils-$vsn \
nvidia-cuda-toolkit \
nvidia-cuda-dev \
nvidia-gds
# Install cuDNN (verify version compatibility)
sudo apt install -yq \
libcudnn9-cuda-12 -yq \
libcudnn9-dev-cuda-12 -yq \
libcudnn9-samples -yq
# Monitoring tools
sudo snap install nvtop
sudo apt install -y pciutils usbutils
My elxir test script:
#!/usr/bin/env elixir
IO.puts "---"
Mix.install(
[
{:nx, "~> 0.6.1"},
{:exla, "~> 0.6.1"}
],
config: [
nx: [
default_backend: EXLA.Backend,
default_defn_options: [compiler: EXLA]
],
exla: [
default_client: :cuda,
clients: [
host: [platform: :host],
cuda: [platform: :cuda]
]
]
],
system_env: [
XLA_TARGET: "cuda120"
]
)
IO.puts("AAA")
# Test with a very simple operation
a = Nx.tensor([1, 2, 3])
IO.puts "CCC"
b = Nx.tensor([4, 5, 6])
IO.puts "DDD"
Nx.add(a, b)
IO.puts "EEE"
The script output:
---
2024-11-09 12:43:02.758296: I external/tsl/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
AAA
12:43:02.851 [info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
12:43:02.854 [info] XLA service 0x7fa7f8352ea0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
12:43:02.854 [info] StreamExecutor device (0): NVIDIA GeForce RTX 3060, Compute Capability 8.6
12:43:02.855 [info] Using BFC allocator.
12:43:02.855 [info] XLA backend allocating 11359951257 bytes on device 0 for BFCAllocator.
CCC
DDD
12:43:03.025 [error] There was an error before creating cudnn handle (302): cudaGetErrorName symbol not found. : cudaGetErrorString symbol not found.
** (RuntimeError) DNN library initialization failed. Look at the errors above for more details.
(exla 0.6.4) lib/exla/computation.ex:92: EXLA.Computation.unwrap!/1
(exla 0.6.4) lib/exla/computation.ex:61: EXLA.Computation.compile/4
(stdlib 6.1.1) timer.erl:590: :timer.tc/2
(exla 0.6.4) lib/exla/defn.ex:430: anonymous fn/11 in EXLA.Defn.compile/8
(exla 0.6.4) lib/exla/defn/locked_cache.ex:36: EXLA.Defn.LockedCache.run/2
(stdlib 6.1.1) timer.erl:590: :timer.tc/2
(exla 0.6.4) lib/exla/defn.ex:406: EXLA.Defn.compile/8
(exla 0.6.4) lib/exla/defn.ex:270: EXLA.Defn.__compile__/4