Trying to get EXLA working on Jetson Orin AGX box with CUDA 11.8

I’ve been trying for the past week to get a running version of Livebook on a Jetson Orin AGX box (integrated GPUs) The box only has CUDA version 11.8 and 11.4 installed.

Whenever I try to run the EXLA tests to just ensure basic CUDA works, I get this error:

dave@CHCHE-ORIN-01:~/work/nx/exla$ mix test
Using libexla.so from /home/dave/.cache/xla/exla/elixir-1.17.2-erts-14.2.5-xla-0.5.1-exla-0.6.4-6c7e3kyqmrq4l2ogbwoouzxmw4/libexla.so
make: '/home/dave/work/nx/exla/_build/test/lib/exla/priv/libexla.so' is up to date.

08:37:06.137 [info] domain=elixir.xla file=xla/stream_executor/cuda/cuda_gpu_executor.cc line=880  could not open file to read NUMA node: /sys/bus/pci/devices/0000:00:00.0/numa_node
Your kernel may have been built without NUMA support.

08:37:06.145 [info] domain=elixir.xla file=xla/service/service.cc line=168  XLA service 0xffff54002c80 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

08:37:06.146 [info] domain=elixir.xla file=xla/service/service.cc line=176    StreamExecutor device (0): Orin, Compute Capability 8.7

08:37:06.146 [info] domain=elixir.xla file=xla/pjrt/gpu/se_gpu_pjrt_client.cc line=633  Using BFC allocator.

08:37:06.146 [info] domain=elixir.xla file=xla/pjrt/gpu/gpu_helpers.cc line=105  XLA backend allocating 25646193049 bytes on device 0 for BFCAllocator.
Running ExUnit with seed: 164975, max_cases: 16
Excluding tags: [:platform, :integration, :multi_device, :conditional_inside_map_reduce]
Including tags: [platform: :cuda]
  2) test range randint (EXLA.NxRandomTest)
     test/exla/random_test.exs:10
     ** (RuntimeError) Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.func.launch' failed: Failed to get stream's capture status: the provided PTX was compiled with an unsupported toolchain.; current tracing scope: fusion; current profiling annotation: XlaModule:#hlo_module=_Function_20.55097802_1_in_Nx.Random.___defn_key____.12,program_id=5#.
     code: key = Nx.Random.key(127)
     stacktrace:
       (exla 0.6.4) lib/exla/executable.ex:56: EXLA.Executable.unwrap!/1
       (exla 0.6.4) lib/exla/executable.ex:19: EXLA.Executable.run/3
       (exla 0.6.4) lib/exla/defn.ex:346: EXLA.Defn.maybe_outfeed/7
       (stdlib 5.2.3) timer.erl:270: :timer.tc/2
       (exla 0.6.4) lib/exla/defn.ex:283: anonymous fn/7 in EXLA.Defn.__compile__/4
       (nx 0.6.4) lib/nx/defn.ex:443: Nx.Defn.do_jit_apply/3
       test/exla/random_test.exs:11: (test)


08:37:14.783 [warning] domain=elixir.xla file=xla/service/gpu/runtime/support.cc line=58  Intercepted XLA runtime error:
INTERNAL: Failed to get stream's capture status: the provided PTX was compiled with an unsupported toolchain.

08:37:14.783 [error] domain=elixir.xla file=xla/pjrt/pjrt_stream_executor_client.cc line=2614  Execution of replica 0 failed: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.func.launch' failed: Failed to get stream's capture status: the provided PTX was compiled with an unsupported toolchain.; current tracing scope: add; current profiling annotation: XlaModule:#hlo_module=test.5,program_id=0#.

I set XLA_BUILD=true so that it actually builds XLA first.
I managed to get XLA to build (with bazel) but this doesn’t seem to fix this “PTX” mismatch issue.

I’m hoping it is just some config env var or build env var that I’m setting wrong that is causing this.

for context here is my env vars I have set that are relevant:

export PATH=/usr/local/cuda-11.8/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64:$LD_LIBRARY_PATH
export XLA_FLAGS=--xla_gpu_cuda_data_dir=/usr/local/cuda-11.8
export XLA_TARGET=cuda
export EXLA_TARGET=cuda
export XLA_BUILD=true
export TMP=/var/tmp
export TF_CUDA_VERSION='11.8'

Please try nx/exla main, it uses a much more recent version of XLA : )

Oh man, thanks! I totally missed that detail, as I arrived to that repo via a link somewhere, and didn’t think to check is wasn’t the latest.

BTW the docker build method uses cuda 10.2 in the container. Is that expected, even though the cuda version the host is running is 11.8?

Building (via mix test) on main branch now gives this error.

  201 | constexpr FormatConversionCharSet ExtractCharSet(ArgConvertResult<C>) {
      | ^~~~~~~~~~~~~~
external/com_google_absl/absl/strings/internal/str_format/arg.h:201:1: note:   template argument deduction/substitution failed:
external/com_google_absl/absl/strings/internal/str_format/arg.h:403:43: note:   couldn’t deduce template parameter ‘C’
  403 |   return absl::str_format_internal::ExtractCharSet(ConvResult{});
      |        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~
./xla/stream_executor/gpu/gpu_executor.h: In member function ‘virtual absl::lts_20230802::StatusOr<std::unique_ptr<stream_executor::MemoryAllocation> > stream_executor::gpu::GpuExecutor::HostMemoryAllocate(uint64_t)’:
./xla/stream_executor/gpu/gpu_executor.h:190:93: error: no matching function for call to ‘StrFormat(const char [41], uint64_t&)’
  190 |       return absl::InternalError(
      |                                                                                             ^
external/com_google_absl/absl/strings/str_format.h:354:1: note: candidate: ‘template<class ... Args> std::string absl::lts_20230802::StrFormat(absl::lts_20230802::FormatSpec<Args ...>&, const Args& ...)’
  354 | ABSL_MUST_USE_RESULT std::string StrFormat(const FormatSpec<Args...>& format,
      | ^~~~~~~~~
external/com_google_absl/absl/strings/str_format.h:354:1: note:   substitution of deduced template arguments resulted in errors seen above
Target //xla/extension:xla_extension failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 1284.415s, Critical Path: 108.17s
INFO: 6623 processes: 3577 internal, 3046 local.
FAILED: Build did NOT complete successfully
make: *** [Makefile:26: /home/dave/.cache/xla/0.7.1/cache/build/xla_extension-aarch64-linux-gnu-cuda.tar.gz] Error 1

How can I pass the --verbose_failures flag as the XLA build suggests so that I can get more info? This build is triggered via mix test/compile, so not sure how I can pass args to a dependencies build.

BTW the docker build method uses cuda 10.2

Oh the images in the nx/exla repo hasn’t been updated, we actually no longer support cuda 10.2.

even though the cuda version the host is running is 11.8

FTR in case you run computation in a container, I believe the host cuda version doesn’t matter, the one in the container is used (i.e. the cuda/cudnn libraries).

Building (via mix test) on main branch now gives this error.

Actually, I’ve just realised that on nx/exla main we already bumped the requirement to CUDA 12 (to be more precise, the relevant packages is actually elixir-nx/xla). Is there a reason you can’t use CUDA 12? If so, you could try nx/exla 0.7.2 (the latest released version).