Bumblebee Stable Fusion example displayed two black images when running on Kaggle P100 GPU

androidpdn · November 3, 2023, 1:14am

I created a Jupter Notebook to run livebook on Kaggle. The livebook app came up fine. However, when running the Stable Fusion example in the Bumblebee document, I got the following error outputs. The similar errors were repeated multiple times. After sometime, the livebook rendered two images which are all black.

00:53:43.947 [error] Results mismatch between different convolution algorithms. This is likely a bug/unexpected loss of precision in cudnn.
(f32[4,320,64,64]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,4,64,64]{3,2,1,0}, f32[320,4,3,3]{3,2,1,0}, f32[320]{0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target=“__cudnn$convBiasActivationForward”, backend_config={“conv_result_scale”:1,“activation_mode”:“kNone”,“side_input_scale”:0,“leakyrelu_alpha”:0} for eng15{k5=1,k6=0,k7=1,k10=4} vs eng15{k5=1,k6=0,k7=1,k10=1}

00:53:43.947 [error] Device: Tesla P100-PCIE-16GB
00:53:43.947 [error] Platform: Compute Capability 6.0
00:53:43.947 [error] Driver: 11040 (470.161.3)
00:53:43.947 [error] Runtime:
00:53:43.947 [error] cudnn version: 8.9.0
00:53:44.076 [error] Difference at 0: 1275.08, expected 319.329
00:53:44.076 [error] Difference at 1: 1930.93, expected 483.046
00:53:44.076 [error] Difference at 2: 1919.27, expected 480.283
00:53:44.076 [error] Difference at 3: 1925.31, expected 481.502
00:53:44.076 [error] Difference at 4: 1925.62, expected 481.76
00:53:44.076 [error] Difference at 5: 1905.29, expected 476.723
00:53:44.076 [error] Difference at 6: 1934.25, expected 484.341
00:53:44.076 [error] Difference at 7: 1917.44, expected 479.993
00:53:44.077 [error] Difference at 8: 1917.32, expected 480.025
00:53:44.077 [error] Difference at 9: 1957.9, expected 489.557

Here is the coda and cuddn related information:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
libcudnn8:
Installed: 8.9.0.131-1+cuda11.8
Candidate: 8.9.4.25-1+cuda12.2

Any idea on what went wrong?

robinmonjo · December 4, 2023, 4:51pm

Hey,

I have a similar error, trying to run the unet model from the stable diffusion repository. It works ok on my computer but when ran on Kaggle, with XLA_TARGET=cuda118 I also have those errors.
Did you find what is wrong ?

robinmonjo · December 4, 2023, 5:22pm

Seems to work with the T4 GPU

androidpdn · December 5, 2023, 2:34am

It’s great to know that it works on T4 GPU. No, I haven’t figured out why it does work on P100