Error when loading model with cuda EXLA client using Bumblebee

Hi everyone !

I managed to get bumblebee up and running on WSL2 using my CPU and decided to try and use my GPU for it and got everything installed according to Nvidia tutorials but when i try to load the model used in bumblebee’s example i get the following error:

iex(1)> {:ok, model_info} = Bumblebee.load_model({:hf, "bert-base-uncased"})

22:25:13.622 [info] could not open file to read NUMA node: /sys/bus/pci/devices/0000:0a:00.0/numa_node
Your kernel may have been built without NUMA support.

22:25:13.622 [info] XLA service 0x7fb97456c520 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

22:25:13.622 [info]   StreamExecutor device (0): NVIDIA GeForce RTX 3080, Compute Capability 8.6

22:25:13.622 [info] Using BFC allocator.

22:25:13.622 [info] XLA backend will use up to 8589515161 bytes on device 0 for BFCAllocator.

22:25:13.950 [error] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

22:25:13.951 [error] Memory usage: 9446621184 bytes free, 10736893952 bytes total.
** (RuntimeError) DNN library initialization failed. Look at the errors above for more details.
    (exla 0.6.1) lib/exla/computation.ex:92: EXLA.Computation.unwrap!/1
    (exla 0.6.1) lib/exla/computation.ex:61: EXLA.Computation.compile/4
    (stdlib 5.0.2) timer.erl:270: :timer.tc/2
    (exla 0.6.1) lib/exla/defn.ex:430: anonymous fn/11 in EXLA.Defn.compile/8
    (exla 0.6.1) lib/exla/defn/locked_cache.ex:36: EXLA.Defn.LockedCache.run/2
    (stdlib 5.0.2) timer.erl:270: :timer.tc/2
    (exla 0.6.1) lib/exla/defn.ex:406: EXLA.Defn.compile/8
    iex:1: (file)

I did some research where it was pointed it might be a OOM error and i tried playing with the preallocate and memory_fraction options for the cuda EXLA client but alas nothing worked.
Also found some issues on tensorflow mentioning an option allow_growth but i don’t think that’s relevant.
Anyone went through something similar ?

1 Like

Sorry for the late reply.

[error] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

Do you have cuDNN installed? What CUDA and cuDNN versions od you use?

sorry, forgot to reply.

I have cuDNN 8.9.5 and CUDA 12.2 installed on WSL2 following NVIDIA’s guides

Just to close the loop on this, i misunderstood the nvidia instructions on installing cuDNN so that was the problem :sweat: Thanks everyone !

NVIDIA’s deb package only creates a local apt repo to install cuDNN from, so you still need to run apt-get install libcudnn8/, apt-get install libcudnn8-dev and apt-get install libcudnn8-samples

Ohh that makes sense, I couldn’t really work out what else it could be : )

FTR if someone runs into this, here are a couple checks for Ubuntu/Debian:

# Verify CUDA version
nvcc --version
# Verify cuDNN version, make sure it's installed and that the package matches CUDA version
apt-cache policy libcudnn8 | head -n 3
# Check drivers and CUDA support
nvidia-smi
1 Like