CUDA/CUDNNN Installation - had to reinstall Ubuntu 22.04 now I can’t get Nx working (cudaGetErrorName)

AndyL · November 9, 2024, 8:45pm

For the last year, I’ve been running Nx/EXLA with a 3060 eGPU very nicely. This week I had to reinstall my OS (Ubuntu 22.04), and now I can’t get Nx working!!!

Other GPU programs (ollama, nvtop) are working fine. But not Nx/EXLA.

Not sure what I’m doing wrong here & appreciate any help or clues!!!

nvidia-smi output:

Sat Nov  9 12:40:23 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        Off |   00000000:04:00.0 Off |                  N/A |
|  0%   39C    P8             14W /  170W |       4MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

My driver install script:

#!/usr/bin/env bash

vsn="550"  # 550, 565

# Add Nvidia package repository

# sudo apt update
# sudo apt install -y software-properties-common
# sudo add-apt-repository ppa:graphics-drivers/ppa -y
# sudo apt update

# Install Packages

# Install the driver
sudo apt install -yq nvidia-driver-$vsn

# Install CUDA and related packages
sudo apt install -yq \
    nvidia-compute-utils-$vsn \
    nvidia-utils-$vsn \
    nvidia-cuda-toolkit \
    nvidia-cuda-dev \
    nvidia-gds

# Install cuDNN (verify version compatibility)
sudo apt install -yq \
    libcudnn9-cuda-12 -yq \
    libcudnn9-dev-cuda-12 -yq \
    libcudnn9-samples -yq

# Monitoring tools
sudo snap install nvtop
sudo apt install -y pciutils usbutils

My elxir test script:

#!/usr/bin/env elixir

IO.puts "---"

Mix.install(
  [
    {:nx, "~> 0.6.1"},
    {:exla, "~> 0.6.1"}
  ],
  config: [
    nx: [
      default_backend: EXLA.Backend,
      default_defn_options: [compiler: EXLA]
    ],
    exla: [
      default_client: :cuda,
      clients: [
        host: [platform: :host],
        cuda: [platform: :cuda]
      ]
    ]
  ],
  system_env: [
    XLA_TARGET: "cuda120"
  ]
)

IO.puts("AAA")

# Test with a very simple operation
a = Nx.tensor([1, 2, 3])
IO.puts "CCC"
b = Nx.tensor([4, 5, 6])
IO.puts "DDD"
Nx.add(a, b)

IO.puts "EEE"

The script output:

---
2024-11-09 12:43:02.758296: I external/tsl/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
AAA

12:43:02.851 [info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355

12:43:02.854 [info] XLA service 0x7fa7f8352ea0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

12:43:02.854 [info]   StreamExecutor device (0): NVIDIA GeForce RTX 3060, Compute Capability 8.6

12:43:02.855 [info] Using BFC allocator.

12:43:02.855 [info] XLA backend allocating 11359951257 bytes on device 0 for BFCAllocator.
CCC
DDD

12:43:03.025 [error] There was an error before creating cudnn handle (302): cudaGetErrorName symbol not found. : cudaGetErrorString symbol not found.
** (RuntimeError) DNN library initialization failed. Look at the errors above for more details.
    (exla 0.6.4) lib/exla/computation.ex:92: EXLA.Computation.unwrap!/1
    (exla 0.6.4) lib/exla/computation.ex:61: EXLA.Computation.compile/4
    (stdlib 6.1.1) timer.erl:590: :timer.tc/2
    (exla 0.6.4) lib/exla/defn.ex:430: anonymous fn/11 in EXLA.Defn.compile/8
    (exla 0.6.4) lib/exla/defn/locked_cache.ex:36: EXLA.Defn.LockedCache.run/2
    (stdlib 6.1.1) timer.erl:590: :timer.tc/2
    (exla 0.6.4) lib/exla/defn.ex:406: EXLA.Defn.compile/8
    (exla 0.6.4) lib/exla/defn.ex:270: EXLA.Defn.__compile__/4

joelpaulkoch · November 9, 2024, 9:30pm

I don’t know if that solves the problem but here it says XLA_TARGET should be cuda or cuda12, in your script it’s cuda120.

AndyL · November 9, 2024, 9:46pm

Thanks for having a look!! Yes, the latest version of NX wants cuda12. Here’s the revised script:

#!/usr/bin/env elixir

IO.puts "---"

Mix.install(
  [
    # {:nx, "~> 0.6.1"},
    # {:exla, "~> 0.6.1"}
    {:nx, "~> 0.9"},
    {:exla, "~> 0.9"}
  ],
  config: [
    nx: [
      default_backend: EXLA.Backend,
      default_defn_options: [compiler: EXLA]
    ],
    exla: [
      default_client: :cuda,
      clients: [
        host: [platform: :host],
        cuda: [platform: :cuda]
      ]
    ]
  ],
  system_env: [
    XLA_TARGET: "cuda12"
  ]
)

IO.puts("AAA")

# Test with a very simple operation
a = Nx.tensor([1, 2, 3])
IO.puts "BBB"
b = Nx.tensor([4, 5, 6])
IO.puts "CCC"
Nx.add(a, b)

IO.puts "DDD"

Here’s the error I get with that configuration…

---
2024-11-09 13:42:25.473674: I xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
AAA
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1731188545.569804    9379 cuda_executor.cc:1040] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1731188545.570011    9351 service.cc:146] XLA service 0x7f81404da590 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1731188545.570029    9351 service.cc:154]   StreamExecutor device (0): NVIDIA GeForce RTX 3060, Compute Capability 8.6
I0000 00:00:1731188545.570245    9351 se_gpu_pjrt_client.cc:889] Using BFC allocator.
I0000 00:00:1731188545.570270    9351 gpu_helpers.cc:114] XLA backend allocating 11359951257 bytes on device 0 for BFCAllocator.
I0000 00:00:1731188545.570287    9351 gpu_helpers.cc:154] XLA backend will use up to 1262216806 bytes on device 0 for CollectiveBFCAllocator.
I0000 00:00:1731188545.570378    9351 cuda_executor.cc:1040] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
BBB
CCC

13:42:25.746 [error] There was an error before creating cudnn handle (302): Error loading CUDA libraries. GPU will not be used. : Error loading CUDA libraries. GPU will not be used.

13:42:25.748 [error] There was an error before creating cudnn handle (302): Error loading CUDA libraries. GPU will not be used. : Error loading CUDA libraries. GPU will not be used.
** (RuntimeError) DNN library initialization failed. Look at the errors above for more details.
    (exla 0.9.1) lib/exla/mlir/module.ex:147: EXLA.MLIR.Module.unwrap!/1
    (exla 0.9.1) lib/exla/mlir/module.ex:124: EXLA.MLIR.Module.compile/5
    (stdlib 6.1.1) timer.erl:590: :timer.tc/2
    (exla 0.9.1) lib/exla/defn.ex:432: anonymous fn/14 in EXLA.Defn.compile/8
    (exla 0.9.1) lib/exla/mlir/context_pool.ex:10: anonymous fn/3 in EXLA.MLIR.ContextPool.checkout/1
    (nimble_pool 1.1.0) lib/nimble_pool.ex:462: NimblePool.checkout!/4
    (exla 0.9.1) lib/exla/defn/locked_cache.ex:36: EXLA.Defn.LockedCache.run/2
    (stdlib 6.1.1) timer.erl:590: :timer.tc/2

joelpaulkoch · November 9, 2024, 10:19pm

Oh right, the previous script wasn’t on the latest Nx version.

Did you try running nvcc --version as suggested in the xla readme to check what’s the actual version?

Sorry, I don’t think I can really help you. Did you reinstall Ubuntu 22.04? So it worked for you before with Ubuntu 22.04, or did you upgrade from an earlier version of Ubuntu?

AndyL · November 9, 2024, 11:20pm

Same OS before and after the re-install - ubuntu 22.04. Thanks very much for reading and giving feedback!!!

AndyL · November 16, 2024, 11:57pm

The problem was: the Nvidia installer loaded in incompatible version of nvcc (version 11 vs version 12). Now everything is sorted out. Wading into the Nvidia dependencies, the multitude of repositories and libraries, the terrible documentation and support sites, has got to be one of the most awful tasks in computing.

> aptitude search nvidia | wc -l
16684

That’s just for the drivers, not including the userspace modules (cuda, cudnn)