Creating whisper model with Smart Cell fails

Geegee · November 18, 2024, 6:27pm

The livebook reports the following


06:26:28.007 [error] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

06:26:28.020 [error] Memory usage: 327155712 bytes free, 3901685760 bytes total.

** (RuntimeError) DNN library initialization failed. Look at the errors above for more details.
    (exla 0.9.1) lib/exla/mlir/module.ex:147: EXLA.MLIR.Module.unwrap!/1
    (exla 0.9.1) lib/exla/mlir/module.ex:124: EXLA.MLIR.Module.compile/5
    (stdlib 6.1.2) timer.erl:590: :timer.tc/2
    (exla 0.9.1) lib/exla/defn.ex:432: anonymous fn/14 in EXLA.Defn.compile/8
    (exla 0.9.1) lib/exla/mlir/context_pool.ex:10: anonymous fn/3 in EXLA.MLIR.ContextPool.checkout/1
    (nimble_pool 1.1.0) lib/nimble_pool.ex:462: NimblePool.checkout!/4
    (exla 0.9.1) lib/exla/defn/locked_cache.ex:36: EXLA.Defn.LockedCache.run/2
    #cell:grt4sk6uht7mjljj:1: (file)

The terminal running the livebook reports the following

06:24:02.969 [info] Downloading a precompiled XLA archive for target x86_64-linux-gnu-cuda12

06:24:45.488 [info] Successfully downloaded the XLA archive

06:25:25.831 [debug] Downloading NIF from https://github.com/elixir-nx/tokenizers/releases/download/v0.5.1/libex_tokenizers-v0.5.1-nif-2.15-x86_64-unknown-linux-gnu.so.tar.gz

06:25:26.877 [debug] NIF cached at /home/geegee/.cache/rustler_precompiled/precompiled_nifs/libex_tokenizers-v0.5.1-nif-2.15-x86_64-unknown-linux-gnu.so.tar.gz and extracted to /home/geegee/.cache/mix/installs/elixir-1.17.3-erts-15.1.2/f93a18a2a17ed8ae75e079a1309969e9/_build/dev/lib/tokenizers/priv/native/libex_tokenizers-v0.5.1-nif-2.15-x86_64-unknown-linux-gnu.so.tar.gz
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1731734769.813825 1168221 cuda_executor.cc:1040] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1731734769.815001 1165283 service.cc:146] XLA service 0x7c1eb4062aa0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1731734769.815020 1165283 service.cc:154]   StreamExecutor device (0): NVIDIA GeForce GTX 1650 Ti, Compute Capability 7.5
I0000 00:00:1731734769.815287 1165283 se_gpu_pjrt_client.cc:889] Using BFC allocator.
I0000 00:00:1731734769.815320 1165283 gpu_helpers.cc:114] XLA backend allocating 3511517184 bytes on device 0 for BFCAllocator.
I0000 00:00:1731734769.815342 1165283 gpu_helpers.cc:154] XLA backend will use up to 390168575 bytes on device 0 for CollectiveBFCAllocator.
I0000 00:00:1731734769.815417 1165283 cuda_executor.cc:1040] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355

06:26:09.840 [error] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

06:26:09.840 [error] Memory usage: 327155712 bytes free, 3901685760 bytes total.

06:26:09.840 [error] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

06:26:09.840 [error] Memory usage: 327155712 bytes free, 3901685760 bytes total.

06:26:28.007 [error] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

06:26:28.020 [error] Memory usage: 327155712 bytes free, 3901685760 bytes total.

06:26:28.020 [error] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

06:26:28.020 [error] Memory usage: 327155712 bytes free, 3901685760 bytes total.
CUDNN_STATUS_INTERNAL_ERRORlspci -k -d ::03xx

I am running Arch linux and following information
regarding the nvidia card:_

01:00.0 3D controller: NVIDIA Corporation TU117M [GeForce GTX 1650 Ti Mobile] (rev a1)
	Subsystem: Dell Device 097d
	Kernel driver in use: nvidia
	Kernel modules: nouveau, nvidia_drm, nvidia

Maybe it is just not possible with this type of graphics card?
Let me know if I should add any information as I am rather new to this.

jonatanklosko · November 19, 2024, 5:42am

@Geegee which CUDA and cuDNN versions do you have, and which :exla version are you using?

Geegee · November 19, 2024, 10:23am

The notebook dependencies are:

{:exla, ">= 0.0.0"}

Just from reading in the terminal in which I start the notebook I can not figure out
what :exla version it exactly downloads but it says the target is x86_64-linux-gnu-cuda12.
I infer from that that the cuda version is 12. What other ways can I use to get more insight what is happening?

jonatanklosko · November 19, 2024, 10:26am

You can run Application.spec(:exla)[:vsn] to check.

Regarding CUDA and cuDNN I meant the versions installed in your OS. For CUDA you can run nvcc --version, it should say something like release x.y. For cuDNN you may need to consult your system package manager to see which one is installed.

Geegee · November 20, 2024, 8:21am

Okay perfect. Turns out I didn’t have cudNN installed. Now it works wonderfully!