Error when loading model with cuda EXLA client using Bumblebee

urielfcampos · September 20, 2023, 1:49am

Hi everyone !

I managed to get bumblebee up and running on WSL2 using my CPU and decided to try and use my GPU for it and got everything installed according to Nvidia tutorials but when i try to load the model used in bumblebee’s example i get the following error:

iex(1)> {:ok, model_info} = Bumblebee.load_model({:hf, "bert-base-uncased"})

22:25:13.622 [info] could not open file to read NUMA node: /sys/bus/pci/devices/0000:0a:00.0/numa_node
Your kernel may have been built without NUMA support.

22:25:13.622 [info] XLA service 0x7fb97456c520 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

22:25:13.622 [info]   StreamExecutor device (0): NVIDIA GeForce RTX 3080, Compute Capability 8.6

22:25:13.622 [info] Using BFC allocator.

22:25:13.622 [info] XLA backend will use up to 8589515161 bytes on device 0 for BFCAllocator.

22:25:13.950 [error] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

22:25:13.951 [error] Memory usage: 9446621184 bytes free, 10736893952 bytes total.
** (RuntimeError) DNN library initialization failed. Look at the errors above for more details.
    (exla 0.6.1) lib/exla/computation.ex:92: EXLA.Computation.unwrap!/1
    (exla 0.6.1) lib/exla/computation.ex:61: EXLA.Computation.compile/4
    (stdlib 5.0.2) timer.erl:270: :timer.tc/2
    (exla 0.6.1) lib/exla/defn.ex:430: anonymous fn/11 in EXLA.Defn.compile/8
    (exla 0.6.1) lib/exla/defn/locked_cache.ex:36: EXLA.Defn.LockedCache.run/2
    (stdlib 5.0.2) timer.erl:270: :timer.tc/2
    (exla 0.6.1) lib/exla/defn.ex:406: EXLA.Defn.compile/8
    iex:1: (file)

I did some research where it was pointed it might be a OOM error and i tried playing with the preallocate and memory_fraction options for the cuda EXLA client but alas nothing worked.
Also found some issues on tensorflow mentioning an option allow_growth but i don’t think that’s relevant.
Anyone went through something similar ?

jonatanklosko · September 27, 2023, 3:27pm

Sorry for the late reply.

[error] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

Do you have cuDNN installed? What CUDA and cuDNN versions od you use?

urielfcampos · September 29, 2023, 1:48am

sorry, forgot to reply.

I have cuDNN 8.9.5 and CUDA 12.2 installed on WSL2 following NVIDIA’s guides

urielfcampos · October 1, 2023, 2:23am

Just to close the loop on this, i misunderstood the nvidia instructions on installing cuDNN so that was the problem Thanks everyone !

NVIDIA’s deb package only creates a local apt repo to install cuDNN from, so you still need to run apt-get install libcudnn8/, apt-get install libcudnn8-dev and apt-get install libcudnn8-samples

jonatanklosko · October 2, 2023, 8:06am

Ohh that makes sense, I couldn’t really work out what else it could be : )

FTR if someone runs into this, here are a couple checks for Ubuntu/Debian:

# Verify CUDA version
nvcc --version
# Verify cuDNN version, make sure it's installed and that the package matches CUDA version
apt-cache policy libcudnn8 | head -n 3
# Check drivers and CUDA support
nvidia-smi

coderhour · March 10, 2024, 6:12pm

I have exactly the same issue and can’t find the solution for a few days. I tried on another PC and same issue. And I do have cudnn9 installed. Any one has idea? Thanks a lot.

➜ ~ apt-cache policy libcudnn9-cuda-12
libcudnn9-cuda-12:
Installed: 9.0.0.312-1
Candidate: 9.0.0.312-1
Version table:
*** 9.0.0.312-1 600
600 file:/var/cudnn-local-repo-ubuntu2204-9.0.0 Packages
100 /var/lib/dpkg/status

coderhour · March 10, 2024, 11:04pm

I finally figure out it. For the future readers:

Originally, I download the latest cuda 12 and cudnn 9 from Nvidia official site which cause the issue. The fix is to use cudnn 8 (latest 8 works for me). I guess it’s because the XLA binary is compiled for cudnn8.

csokun · March 12, 2024, 1:10am

Thanks! for sharing it helps me resolve my issue.