CUDA/CUDNNN Installation - had to reinstall Ubuntu 22.04 now I can’t get Nx working (cudaGetErrorName)

For the last year, I’ve been running Nx/EXLA with a 3060 eGPU very nicely. This week I had to reinstall my OS (Ubuntu 22.04), and now I can’t get Nx working!!!

Other GPU programs (ollama, nvtop) are working fine. But not Nx/EXLA.

Not sure what I’m doing wrong here & appreciate any help or clues!!!

nvidia-smi output:

Sat Nov  9 12:40:23 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        Off |   00000000:04:00.0 Off |                  N/A |
|  0%   39C    P8             14W /  170W |       4MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

My driver install script:

#!/usr/bin/env bash

vsn="550"  # 550, 565

# Add Nvidia package repository

# sudo apt update
# sudo apt install -y software-properties-common
# sudo add-apt-repository ppa:graphics-drivers/ppa -y
# sudo apt update

# Install Packages

# Install the driver
sudo apt install -yq nvidia-driver-$vsn

# Install CUDA and related packages
sudo apt install -yq \
    nvidia-compute-utils-$vsn \
    nvidia-utils-$vsn \
    nvidia-cuda-toolkit \
    nvidia-cuda-dev \
    nvidia-gds

# Install cuDNN (verify version compatibility)
sudo apt install -yq \
    libcudnn9-cuda-12 -yq \
    libcudnn9-dev-cuda-12 -yq \
    libcudnn9-samples -yq

# Monitoring tools
sudo snap install nvtop
sudo apt install -y pciutils usbutils

My elxir test script:

#!/usr/bin/env elixir

IO.puts "---"

Mix.install(
  [
    {:nx, "~> 0.6.1"},
    {:exla, "~> 0.6.1"}
  ],
  config: [
    nx: [
      default_backend: EXLA.Backend,
      default_defn_options: [compiler: EXLA]
    ],
    exla: [
      default_client: :cuda,
      clients: [
        host: [platform: :host],
        cuda: [platform: :cuda]
      ]
    ]
  ],
  system_env: [
    XLA_TARGET: "cuda120"
  ]
)

IO.puts("AAA")

# Test with a very simple operation
a = Nx.tensor([1, 2, 3])
IO.puts "CCC"
b = Nx.tensor([4, 5, 6])
IO.puts "DDD"
Nx.add(a, b)

IO.puts "EEE"

The script output:

---
2024-11-09 12:43:02.758296: I external/tsl/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
AAA

12:43:02.851 [info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355

12:43:02.854 [info] XLA service 0x7fa7f8352ea0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

12:43:02.854 [info]   StreamExecutor device (0): NVIDIA GeForce RTX 3060, Compute Capability 8.6

12:43:02.855 [info] Using BFC allocator.

12:43:02.855 [info] XLA backend allocating 11359951257 bytes on device 0 for BFCAllocator.
CCC
DDD

12:43:03.025 [error] There was an error before creating cudnn handle (302): cudaGetErrorName symbol not found. : cudaGetErrorString symbol not found.
** (RuntimeError) DNN library initialization failed. Look at the errors above for more details.
    (exla 0.6.4) lib/exla/computation.ex:92: EXLA.Computation.unwrap!/1
    (exla 0.6.4) lib/exla/computation.ex:61: EXLA.Computation.compile/4
    (stdlib 6.1.1) timer.erl:590: :timer.tc/2
    (exla 0.6.4) lib/exla/defn.ex:430: anonymous fn/11 in EXLA.Defn.compile/8
    (exla 0.6.4) lib/exla/defn/locked_cache.ex:36: EXLA.Defn.LockedCache.run/2
    (stdlib 6.1.1) timer.erl:590: :timer.tc/2
    (exla 0.6.4) lib/exla/defn.ex:406: EXLA.Defn.compile/8
    (exla 0.6.4) lib/exla/defn.ex:270: EXLA.Defn.__compile__/4

I don’t know if that solves the problem but here it says XLA_TARGET should be cuda or cuda12, in your script it’s cuda120.

Thanks for having a look!! Yes, the latest version of NX wants cuda12. Here’s the revised script:

#!/usr/bin/env elixir

IO.puts "---"

Mix.install(
  [
    # {:nx, "~> 0.6.1"},
    # {:exla, "~> 0.6.1"}
    {:nx, "~> 0.9"},
    {:exla, "~> 0.9"}
  ],
  config: [
    nx: [
      default_backend: EXLA.Backend,
      default_defn_options: [compiler: EXLA]
    ],
    exla: [
      default_client: :cuda,
      clients: [
        host: [platform: :host],
        cuda: [platform: :cuda]
      ]
    ]
  ],
  system_env: [
    XLA_TARGET: "cuda12"
  ]
)

IO.puts("AAA")

# Test with a very simple operation
a = Nx.tensor([1, 2, 3])
IO.puts "BBB"
b = Nx.tensor([4, 5, 6])
IO.puts "CCC"
Nx.add(a, b)

IO.puts "DDD"

Here’s the error I get with that configuration…

---
2024-11-09 13:42:25.473674: I xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
AAA
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1731188545.569804    9379 cuda_executor.cc:1040] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1731188545.570011    9351 service.cc:146] XLA service 0x7f81404da590 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1731188545.570029    9351 service.cc:154]   StreamExecutor device (0): NVIDIA GeForce RTX 3060, Compute Capability 8.6
I0000 00:00:1731188545.570245    9351 se_gpu_pjrt_client.cc:889] Using BFC allocator.
I0000 00:00:1731188545.570270    9351 gpu_helpers.cc:114] XLA backend allocating 11359951257 bytes on device 0 for BFCAllocator.
I0000 00:00:1731188545.570287    9351 gpu_helpers.cc:154] XLA backend will use up to 1262216806 bytes on device 0 for CollectiveBFCAllocator.
I0000 00:00:1731188545.570378    9351 cuda_executor.cc:1040] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
BBB
CCC

13:42:25.746 [error] There was an error before creating cudnn handle (302): Error loading CUDA libraries. GPU will not be used. : Error loading CUDA libraries. GPU will not be used.

13:42:25.748 [error] There was an error before creating cudnn handle (302): Error loading CUDA libraries. GPU will not be used. : Error loading CUDA libraries. GPU will not be used.
** (RuntimeError) DNN library initialization failed. Look at the errors above for more details.
    (exla 0.9.1) lib/exla/mlir/module.ex:147: EXLA.MLIR.Module.unwrap!/1
    (exla 0.9.1) lib/exla/mlir/module.ex:124: EXLA.MLIR.Module.compile/5
    (stdlib 6.1.1) timer.erl:590: :timer.tc/2
    (exla 0.9.1) lib/exla/defn.ex:432: anonymous fn/14 in EXLA.Defn.compile/8
    (exla 0.9.1) lib/exla/mlir/context_pool.ex:10: anonymous fn/3 in EXLA.MLIR.ContextPool.checkout/1
    (nimble_pool 1.1.0) lib/nimble_pool.ex:462: NimblePool.checkout!/4
    (exla 0.9.1) lib/exla/defn/locked_cache.ex:36: EXLA.Defn.LockedCache.run/2
    (stdlib 6.1.1) timer.erl:590: :timer.tc/2

Oh right, the previous script wasn’t on the latest Nx version.

Did you try running nvcc --version as suggested in the xla readme to check what’s the actual version?

Sorry, I don’t think I can really help you. Did you reinstall Ubuntu 22.04? So it worked for you before with Ubuntu 22.04, or did you upgrade from an earlier version of Ubuntu?

Same OS before and after the re-install - ubuntu 22.04. Thanks very much for reading and giving feedback!!!