Compiling XLA with CUDA 11.2 support needed

sallaumen · September 21, 2021, 7:27pm

Hello, I have been trying to compile EXLA, to use it with Nx and Axon. Unfortunately, I’ve been having issues compiling it to work along with CUDA and cuDNN SDK.

I’ve installed NVIDIA drivers, CUDA, and cuDNN with the following tutorial:
TUTORIAL
And I’ve already tested that it correctly worked with their reference python code.

Now I’ve been trying to compile XLA to work with my machine’s current CUDA version (11.2). But after trying my best I’m still having some issues. For now, I’ve been receiving the following error:

tavano@tavano-os:~/git/xla$ iex -S mix
Erlang/OTP 24 [erts-12.0.4] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [jit]

mkdir -p /home/tavano/.cache/xla_extension/tf-54dee6dd8d47b6e597f4d3f85b6fb43fd5f50f82 && \
        cd /home/tavano/.cache/xla_extension/tf-54dee6dd8d47b6e597f4d3f85b6fb43fd5f50f82 && \
        git init && \
        git remote add origin https://github.com/tensorflow/tensorflow.git && \
        git fetch --depth 1 origin 54dee6dd8d47b6e597f4d3f85b6fb43fd5f50f82 && \
        git checkout FETCH_HEAD
Initialized empty Git repository in /home/tavano/.cache/xla_extension/tf-54dee6dd8d47b6e597f4d3f85b6fb43fd5f50f82/.git/
From https://github.com/tensorflow/tensorflow
 * branch              54dee6dd8d47b6e597f4d3f85b6fb43fd5f50f82 -> FETCH_HEAD
Note: switching to 'FETCH_HEAD'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 54dee6dd Fix shape arguments passed in local_client.
rm -f /home/tavano/.cache/xla_extension/tf-54dee6dd8d47b6e597f4d3f85b6fb43fd5f50f82/tensorflow/compiler/xla/extension && \
        ln -s "/home/tavano/git/xla/extension" /home/tavano/.cache/xla_extension/tf-54dee6dd8d47b6e597f4d3f85b6fb43fd5f50f82/tensorflow/compiler/xla/extension && \
        cd /home/tavano/.cache/xla_extension/tf-54dee6dd8d47b6e597f4d3f85b6fb43fd5f50f82 && \
        bazel build --define "framework_shared_object=false" -c opt   --config=cuda //tensorflow/compiler/xla/extension:xla_extension && \
        mkdir -p /home/tavano/git/xla/cache/build/ && \
        cp -f /home/tavano/.cache/xla_extension/tf-54dee6dd8d47b6e597f4d3f85b6fb43fd5f50f82/bazel-bin/tensorflow/compiler/xla/extension/xla_extension.tar.gz /home/tavano/git/xla/cache/build/xla_extension-x86_64-linux-cuda111.tar.gz
Starting local Bazel server and connecting to it...
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=0 --terminal_columns=80
INFO: Reading rc options for 'build' from /home/tavano/.cache/xla_extension/tf-54dee6dd8d47b6e597f4d3f85b6fb43fd5f50f82/.bazelrc:
  Inherited 'common' options: --experimental_repo_remote_exec
INFO: Reading rc options for 'build' from /home/tavano/.cache/xla_extension/tf-54dee6dd8d47b6e597f4d3f85b6fb43fd5f50f82/.bazelrc:
  'build' options: --define framework_shared_object=true --java_toolchain=@tf_toolchains//toolchains/java:tf_java_toolchain --host_java_toolchain=@tf_toolchains//toolchains/java:tf_java_toolchain --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --spawn_strategy=standalone -c opt --announce_rc --define=grpc_no_ares=true --noincompatible_remove_legacy_whole_archive --noincompatible_prohibit_aapt1 --enable_platform_specific_config --define=with_xla_support=true --config=short_logs --config=v2 --define=no_aws_support=true --define=no_hdfs_support=true
INFO: Found applicable config definition build:short_logs in file /home/tavano/.cache/xla_extension/tf-54dee6dd8d47b6e597f4d3f85b6fb43fd5f50f82/.bazelrc: --output_filter=DONT_MATCH_ANYTHING
INFO: Found applicable config definition build:v2 in file /home/tavano/.cache/xla_extension/tf-54dee6dd8d47b6e597f4d3f85b6fb43fd5f50f82/.bazelrc: --define=tf_api_version=2 --action_env=TF2_BEHAVIOR=1
INFO: Found applicable config definition build:cuda in file /home/tavano/.cache/xla_extension/tf-54dee6dd8d47b6e597f4d3f85b6fb43fd5f50f82/.bazelrc: --repo_env TF_NEED_CUDA=1 --crosstool_top=@local_config_cuda//crosstool:toolchain --@local_config_cuda//:enable_cuda
INFO: Found applicable config definition build:linux in file /home/tavano/.cache/xla_extension/tf-54dee6dd8d47b6e597f4d3f85b6fb43fd5f50f82/.bazelrc: --copt=-w --host_copt=-w --define=PREFIX=/usr --define=LIBDIR=$(PREFIX)/lib --define=INCLUDEDIR=$(PREFIX)/include --define=PROTOBUF_INCLUDE_PATH=$(PREFIX)/include --cxxopt=-std=c++14 --host_cxxopt=-std=c++14 --config=dynamic_kernels
INFO: Found applicable config definition build:dynamic_kernels in file /home/tavano/.cache/xla_extension/tf-54dee6dd8d47b6e597f4d3f85b6fb43fd5f50f82/.bazelrc: --define=dynamic_loaded_kernels=true --copt=-DAUTOLOAD_DYNAMIC_KERNELS
DEBUG: /home/tavano/.cache/bazel/_bazel_tavano/80b3fb99ea1bab987a9581bca23d819b/external/tf_runtime/third_party/cuda/dependencies.bzl:51:10: The following command will download NVIDIA proprietary software. By using the software you agree to comply with the terms of the license agreement that accompanies the software. If you do not agree to the terms of the license agreement, do not use the software.
INFO: Repository local_config_cuda instantiated at:
  /home/tavano/.cache/xla_extension/tf-54dee6dd8d47b6e597f4d3f85b6fb43fd5f50f82/WORKSPACE:15:14: in <toplevel>
  /home/tavano/.cache/xla_extension/tf-54dee6dd8d47b6e597f4d3f85b6fb43fd5f50f82/tensorflow/workspace2.bzl:1099:19: in workspace
  /home/tavano/.cache/xla_extension/tf-54dee6dd8d47b6e597f4d3f85b6fb43fd5f50f82/tensorflow/workspace2.bzl:90:19: in _tf_toolchains
Repository rule cuda_configure defined at:
  /home/tavano/.cache/xla_extension/tf-54dee6dd8d47b6e597f4d3f85b6fb43fd5f50f82/third_party/gpus/cuda_configure.bzl:1443:33: in <toplevel>
ERROR: An error occurred during the fetch of repository 'local_config_cuda':
   Traceback (most recent call last):
        File "/home/tavano/.cache/xla_extension/tf-54dee6dd8d47b6e597f4d3f85b6fb43fd5f50f82/third_party/gpus/cuda_configure.bzl", line 1396, column 38, in _cuda_autoconf_impl
                _create_local_cuda_repository(repository_ctx)
        File "/home/tavano/.cache/xla_extension/tf-54dee6dd8d47b6e597f4d3f85b6fb43fd5f50f82/third_party/gpus/cuda_configure.bzl", line 977, column 35, in _create_local_cuda_repository
                cuda_config = _get_cuda_config(repository_ctx, find_cuda_config_script)
        File "/home/tavano/.cache/xla_extension/tf-54dee6dd8d47b6e597f4d3f85b6fb43fd5f50f82/third_party/gpus/cuda_configure.bzl", line 666, column 30, in _get_cuda_config
                config = find_cuda_config(repository_ctx, find_cuda_config_script, ["cuda", "cudnn"])
        File "/home/tavano/.cache/xla_extension/tf-54dee6dd8d47b6e597f4d3f85b6fb43fd5f50f82/third_party/gpus/cuda_configure.bzl", line 643, column 41, in find_cuda_config
                exec_result = _exec_find_cuda_config(repository_ctx, script_path, cuda_libraries)
        File "/home/tavano/.cache/xla_extension/tf-54dee6dd8d47b6e597f4d3f85b6fb43fd5f50f82/third_party/gpus/cuda_configure.bzl", line 637, column 19, in _exec_find_cuda_config
                return execute(repository_ctx, [python_bin, "-c", decompress_and_execute_cmd])
        File "/home/tavano/.cache/xla_extension/tf-54dee6dd8d47b6e597f4d3f85b6fb43fd5f50f82/third_party/remote_config/common.bzl", line 230, column 13, in execute
                fail(
Error in fail: Repository command failed
Could not find any cudnn.h, cudnn_version.h matching version '' in any subdirectory:
        ''
        'include'
        'include/cuda'
        'include/*-linux-gnu'
        'extras/CUPTI/include'
        'include/cuda/CUPTI'
of:
        '/lib'
        '/lib/i386-linux-gnu'
        '/lib/x86_64-linux-gnu'
        '/usr'
        '/usr/lib/x86_64-linux-gnu/libfakeroot'
        '/usr/local/cuda'
        '/usr/local/cuda-11.2/targets/x86_64-linux/lib'
INFO: Found applicable config definition build:cuda in file /home/tavano/.cache/xla_extension/tf-54dee6dd8d47b6e597f4d3f85b6fb43fd5f50f82/.bazelrc: --repo_env TF_NEED_CUDA=1 --crosstool_top=@local_config_cuda//crosstool:toolchain --@local_config_cuda//:enable_cuda
ERROR: @local_config_cuda//:enable_cuda :: Error loading option @local_config_cuda//:enable_cuda: Repository command failed
Could not find any cudnn.h, cudnn_version.h matching version '' in any subdirectory:
        ''
        'include'
        'include/cuda'
        'include/*-linux-gnu'
        'extras/CUPTI/include'
        'include/cuda/CUPTI'
of:
        '/lib'
        '/lib/i386-linux-gnu'
        '/lib/x86_64-linux-gnu'
        '/usr'
        '/usr/lib/x86_64-linux-gnu/libfakeroot'
        '/usr/local/cuda'
        '/usr/local/cuda-11.2/targets/x86_64-linux/lib'

make: *** [Makefile:28: /home/tavano/git/xla/cache/build/xla_extension-x86_64-linux-cuda111.tar.gz] Error 2
** (Mix) Could not compile with "make" (exit status: 2).
You need to have gcc and make installed. If you are using
Ubuntu or any other Debian-based system, install the packages
"build-essential". Also install "erlang-dev" package if not
included in your Erlang/OTP version. If you're on Fedora, run
"dnf group install 'Development Tools'".

It seems like it is not being able to find cudnn.h from my system, which is currently located at /usr/local/cuda/include/cudnn.h. I’ve already tried to create a soft-link from cudnn.h to the path /usr/local/cuda but still no sign of working.

OS: Ubuntu
OS version: 20.04.2 LTS (Focal Fossa)
.tool-versions:

erlang 24.0.6
elixir 1.12.3-otp-24
python 3.8.0
bazel 3.7.2

my gcc version: gcc (Ubuntu 8.4.0-3ubuntu2) 8.4.0

Nvidia driver info:

NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2

cuDNN version: 8.1.1
Video card: GeForce GTX 1050 TI
Currently, I’ve also enabled these XLA/EXLA var envs:

XLA_BUILD=true
XLA_TARGET=cuda111
EXLA_TARGET=cuda

Could please someone try to help me? I’ve been really struggling to find Elixir XLA compilation guides and to debug my own issue.

josevalim · September 21, 2021, 8:31pm

Hi @sallaumen!

The XLA project only compiles XLA through Bazel, so when trying to debug, you can expand for tips compiling XLA in general.

Quick question: we do ship precompiled images for cuda 11.1+. Do those work for you? You can try it out by removing the XLA_BUILD env var.

seanmor5 · September 21, 2021, 9:37pm

Hi @sallaumen can you try setting TF_CUDA_VERSION='11.2' or TF_CUDA_VERSION='11' and see if that helps?

sallaumen · September 22, 2021, 12:15am

Hey guys, thank you for your help, it seems to have worked, I’ve already run some algorithms like MNIST and CIFAR10 here in my machine, using EXLA as a compiler, using AXON code examples, and it worked.

Unfortunately, it seems to not be processing in my GPU and still running on CPU only, but it is already an improvement.

Tomorrow I’ll investigate more and check deeper if it is actually not using GPU or if it is working as expected, and I’ll return with more info about this other issue, ok?

Please, still don’t close this thread, I would like to contribute with my implementation issues and try helping to improve EXLA/XLA documentation if you guys don’t mind.

Thx a lot, for now, @seanmor5 @josevalim , I’ve already done more progress today than in the last months trying to run EXLA with Nx.

josevalim · September 22, 2021, 8:21am

Note the EXLA_TARGET applies only to EXLA development. You can definitely remove it. Instead you need to change the config/config.exs and set the default platform to be cuda:

config :exla, :clients, default: [platform: :cuda]

sallaumen · September 22, 2021, 5:58pm

Hey @josevalim your config idea was actually what my code needed, I just needed to set this env to make dev environment work with cuda platform.

Just to sum up, the 2 modifications my code needed to work as expected with cuda were:

@seanmor5 advice, setting TF_CUDA_VERSION='11.2' and Valim’s advice to set config :exla, :clients, default: [platform: :cuda] at my config.configs.exs

Thank you guys again for your help, I’ve been studying Nx for some time by now and I’m creating my university final paper on it. Dunno if you guys are interested but these are my project’s repos:

I’ll soon fix the read-me from my project explaining completely how to make EXLA work with CUDA.
Also, I really missed some more documentation behind this XLA + CUDA process, it surely was not easy to do, but now with your help and my own researches it really looks easy to make it work, IMHO it was just a little lack of documentation
Any help I can do to contribute with Nx or XLA, just contact me, at least for documentation I might have some good ideas from my own experience and what I’ve missed during my own researches.

Thx for all

vegabook · January 5, 2023, 1:32pm

Obliquely related question: will EXLA compile with a Cuda 10.x version? That’s the latest version that the Jetson Nano supports so I thought I’d give compilation a try. If so, is TF_CUDA_VERSION the environment variable that needs to be set, and is that all I need to do other than to set XLA_BUILD=true ?

Here is my Cuda version on the Nano:

tbrowne@nano:~$ cat /usr/local/cuda/version.txt
CUDA Version 10.2.300

seanmor5 · January 5, 2023, 1:49pm

You can try to compile with CUDA 10.2, but the version of TF we use I believe dropped support for anything before CUDA 11.x. There may be a chance it still works, just with some unsupported ops or other issues