Getting errors for unknown :rocm client passed, despite no explicit reference anywhere

Hey everyone. Trying to get a simple EXLA app going here and running into issues with EXLA recognizing CUDA. All the standard nvcc & nvidia-smi & cuDNN commands seem to come up good and I can run some Python apps with everything, so its wired up correctly someway or somehow down there.

But, when trying to compile I keep getting this error and it is just really confusing me. I have only a slight understanding of what :rocm even is, let alone would choose to supply it as a command yet here I am getting it as an error. I was just hoping that maybe someone with some deeper knowledge might be able to offer a possible explanation or suggestion as to what to do next here.

I will post some relevant config and can always post more, but I am also just curious why this is even happening at all.

Thanks all

config :nx, default_backend: EXLA.Backend
config :nx, :default_defn_options, [compiler: EXLA, client: :cuda]
config :exla, :clients, cuda: [platform: :cuda], default: [platform: :cuda]
XLA_TARGET=cuda120
EXLA_TARGET=cuda

PYTHON_BIN_PATH=/usr/bin/python3.10
TF_CUDA_VERSION='12.2'
ELIXIR_ERL_OPTIONS="+sssdio 128"

& the error message reads:

** (Mix) Could not start application emer_phx: exited in: EmerPhx.Application.start(:normal, [])
    ** (EXIT) an exception was raised:
        ** (RuntimeError) unknown client :rocm given as :preferred_clients. If you plan to use :cuda or :rocm, make sure the XLA_TARGET environment variable is appropriately set. Currently it is set to "cuda120"
            (exla 0.6.1) lib/exla/client.ex:34: anonymous fn/3 in EXLA.Client.default_name/0
            (elixir 1.15.6) lib/enum.ex:4279: Enum.find_list/3
            (exla 0.6.1) lib/exla/client.ex:31: EXLA.Client.default_name/0
            (exla 0.6.1) lib/exla/backend.ex:154: EXLA.Backend.client_and_device_id/1
            (exla 0.6.1) lib/exla/backend.ex:44: EXLA.Backend.from_binary/3
            (bumblebee 0.4.2) lib/bumblebee/conversion/pytorch/loader.ex:79: Bumblebee.Conversion.PyTorch.Loader.object_resolver/1
            (unpickler 0.1.0) lib/unpickler.ex:828: Unpickler.resolve_object/2
            (unpickler 0.1.0) lib/unpickler.ex:818: anonymous fn/2 in Unpickler.finalize_stack_items/2
            (elixir 1.15.6) lib/map.ex:957: Map.get_and_update/3
            (elixir 1.15.6) lib/map.ex:999: Map.get_and_update!/3
            (unpickler 0.1.0) lib/unpickler.ex:818: anonymous fn/2 in Unpickler.finalize_stack_items/2
            (elixir 1.15.6) lib/enum.ex:1819: Enum."-map_reduce/3-lists^mapfoldl/2-0-"/3
            (unpickler 0.1.0) lib/unpickler.ex:525: Unpickler.load_op/3
            (bumblebee 0.4.2) lib/bumblebee/conversion/pytorch/loader.ex:53: Bumblebee.Conversion.PyTorch.Loader.load_zip!/1
            (bumblebee 0.4.2) lib/bumblebee/conversion/pytorch.ex:48: anonymous fn/2 in Bumblebee.Conversion.PyTorch.load_params!/4
            (elixir 1.15.6) lib/enum.ex:1693: Enum."-map/2-lists^map/1-1-"/2
            (bumblebee 0.4.2) lib/bumblebee/conversion/pytorch.ex:47: anonymous fn/4 in Bumblebee.Conversion.PyTorch.load_params!/4
            (bumblebee 0.4.2) lib/bumblebee.ex:518: Bumblebee.load_params/5
            (bumblebee 0.4.2) lib/bumblebee.ex:490: Bumblebee.load_model/2
            (emer_phx 0.1.0) lib/emer_phx/application.ex:16: EmerPhx.Application.start/2

Its kind of funny how just asking can sometimes reveal things you missed. So I did finally hone in on that second line where it seems it is just using :rocm as the default client? If so, any idea as to how I can change that. Also, wouldn’t that just be :host?

That message also makes me want to set XLA_TARGET to :cuda but I am pretty sure that is not right either.

Ugh

Try removing the line where you configure the clients. It seems that line is having some weird interaction with other default configuration

Yeah I think that was one of my first inclinations but then ended up getting me here, to which I thought was the solution that first issue :slight_smile:

Its been a mess of a fun time.

Here is the error I get with that change.

** (Mix) Could not start application emer_phx: EmerPhx.Application.start(:normal, []) returned an error: shutdown: failed to start child: EmerPhx.Serving
    ** (EXIT) shutdown: failed to start child: Nx.Serving
        ** (EXIT) exited in: GenServer.call(EXLA.Client, {:client, :cuda, [platform: :cuda]}, :infinity)
            ** (EXIT) an exception was raised:
                ** (RuntimeError) Could not find registered platform with name: "cuda". Available platform names are: Host Interpreter
                    (exla 0.6.1) lib/exla/client.ex:195: EXLA.Client.unwrap!/1
                    (exla 0.6.1) lib/exla/client.ex:176: EXLA.Client.build_client/2
                    (exla 0.6.1) lib/exla/client.ex:136: EXLA.Client.handle_call/3
                    (stdlib 5.1) gen_server.erl:1113: :gen_server.try_handle_call/4
                    (stdlib 5.1) gen_server.erl:1142: :gen_server.handle_msg/6
                    (stdlib 5.1) proc_lib.erl:241: :proc_lib.init_p_do_apply/3

Ensure that XLA_TARGET is actually being read. You can check this by looking at the download URL for :xla

Will do. In the prior message it states it was being set to cuda120 but made it seem that is the wrong input, and even maybe the wrong format (its wants an atom?)

How might I go about checking the URL of :xla?

Thanks

Actually, lets just assume its not being set or is being set incorrectly. If that were so, how might just set it in the correct context or whatever is needed to override the other places I have it set?

Thanks!!

You can see the URL while mix deps.get or Mix.install does it’s work.

For mix deps.get or Mix.install in plain iex, just exporting the env vars in your shell should work.

For Mix.install on a livebook, you can set the env var in the livebook main settings page OR use the system_env option

➜  emer_phx git:(master) ✗ elixir mix_test.exs 
Resolving Hex dependencies...
Resolution completed in 0.068s
New:
  elixir_make 0.7.7
  xla 0.5.1
* Getting xla (Hex package)
* Getting elixir_make (Hex package)
==> elixir_make
Compiling 6 files (.ex)
Generated elixir_make app
==> xla
Compiling 2 files (.ex)
Generated xla app
done
  "exla": {:hex, :exla, "0.6.1", "a4400933a04d018c5fb508c75a080c73c3c1986f6c16a79bbfee93ba22830d4d", [:make, :mix], [{:elixir_make, "~> 0.6", [hex: :elixir_make, repo: "hexpm", optional: false]}, {:nx, "~> 0.6.1", [hex: :nx, repo: "hexpm", optional: false]}, {:telemetry, "~> 0.4.0 or ~> 1.0", [hex: :telemetry, repo: "hexpm", optional: false]}, {:xla, "~> 0.5.0", [hex: :xla, repo: "hexpm", optional: false]}], "hexpm", "f0e95b0f91a937030cf9fcbe900c9d26933cb31db2a26dfc8569aa239679e6d4"},

I’m not even sure we’re looking at the same thing anymore :frowning:

I don’t see any URLs anywhere

Here is my shell

➜  emer_phx git:(master) ✗ echo $XLA_TARGET 
cuda120

Just kinda at my wits end here. Any gut feeling or indication on what the main issue is here? Is this an Erlang thing? WSL?

We are. Looks like you’re using Mix.install. Try setting the force: true option so that it bypasses cache

Yeah I’m just trying everything to try and see a URL. Let me see. Thanks!

➜  elixir elixir mix_test.exs
Mix.install/2 using /home/ar3rz/.cache/mix/installs/elixir-1.15.6-erts-14.1/89026495d0ad6015ebac9fc2a1805bc7
Resolving Hex dependencies...
Resolution completed in 0.059s
New:
  elixir_make 0.7.7
  xla 0.5.1
* Getting xla (Hex package)
* Getting elixir_make (Hex package)
==> elixir_make
Compiling 6 files (.ex)
Generated elixir_make app
==> xla
Compiling 2 files (.ex)
Generated xla app
Done
➜  elixir

Is the idea here that it is downloading an incorrect version based on not being able to see the XLA_TARGET path? Just trying to follow. Thanks!

I ran this and it seems to be downloading a different version, so this might be the right track

Mix.install(
  [:nx, :exla],
  config: [
    nx: [default_backend: EXLA]
  ],
  system_env: [
    XLA_TARGET: "cuda120"
  ]
)

Yeah, by looking at the URL, which should show up in those logs on a clean install (i.e. force: true), we’ll be able to see if you’re actually getting the xla cuda artifact

This last config looks right to me.

➜  elixir elixir mix_test.exs
Resolving Hex dependencies...
Resolution completed in 0.167s
New:
  complex 0.5.0
  elixir_make 0.7.7
  exla 0.6.1
  nx 0.6.2
  telemetry 1.2.1
  xla 0.5.1
* Getting nx (Hex package)
* Getting exla (Hex package)
* Getting elixir_make (Hex package)
* Getting telemetry (Hex package)
* Getting xla (Hex package)
* Getting complex (Hex package)
===> Analyzing applications...
===> Compiling telemetry
==> complex
Compiling 2 files (.ex)
Generated complex app
==> nx
Compiling 32 files (.ex)
Generated nx app
==> elixir_make
Compiling 6 files (.ex)
Generated elixir_make app
==> xla
Compiling 2 files (.ex)
Generated xla app

10:51:08.471 [info] Found a matching archive (xla_extension-x86_64-linux-gnu-cuda120.tar.gz), going to download it
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  267M  100  267M    0     0  4680k      0  0:00:58  0:00:58 --:--:-- 3474k

10:52:07.094 [info] Successfully downloaded the XLA archive
==> exla
Unpacking /home/ar3rz/.cache/xla/0.5.1/cache/download/xla_extension-x86_64-linux-gnu-cuda120.tar.gz into /home/ar3rz/.cache/mix/installs/elixir-1.15.6-erts-14.1/98ded4499e8f54890afdeefa2c09b7f8/deps/exla/cache
g++ -fPIC -I/home/ar3rz/.asdf/installs/erlang/26.1/erts-14.1/include -Icache/xla_extension/include -O3 -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -shared -std=c++17 -w -DLLVM_VERSION_STRING= c_src/exla/exla.cc c_src/exla/exla_nif_util.cc c_src/exla/exla_client.cc -o cache/libexla.so -Lcache/xla_extension/lib -lxla_extension -Wl,-rpath,'$ORIGIN/xla_extension/lib'
Caching libexla.so at /home/ar3rz/.cache/xla/exla/elixir-1.15.6-erts-14.1-xla-0.5.1-exla-0.6.1-hx7mdmwou7dt2g5ikcpteeiiqa/libexla.so
Compiling 21 files (.ex)
Generated exla app
Done

You can set default_backend: {EXLA, client: :cuda} to default to the GPU too

And that’s the correct download!

1 Like

Oh I didn’t know that. That could prove useful!

Just glad its at least starting to make a little more sense

Arghh

Was back to original error. It kept unpacking a diff XLA version:

Unpacking /home/ar3rz/.cache/xla/0.5.1/cache/download/xla_extension-x86_64-linux-gnu-cpu.tar.gz into /home/ar3rz/elixir/...

I am guessing that should match what I got during the Mix.install output, correct? That was

Unpacking /home/ar3rz/.cache/xla/0.5.1/cache/download/xla_extension-x86_64-linux-gnu-cuda120.tar.gz into  ...

So I erased my XLA cache, killed all the deps and reran mix deps.get and mix.deps clean --exla etc…

Now running into this. I have not really looked into this one yet too much, because quite frankly I’m kinda irritated with the whole process. I’ll be back though after I poke around a bit hopefully post a solution, but if not to ask another round of annoying questions for everyone :slight_smile:

I really do appreciate your help though @polvalente

Here is what I am facing now:

➜  emer_phx git:(master) ✗ mix deps.compile exla --force
==> exla
g++ -fPIC -I/home/ar3rz/.asdf/installs/erlang/26.1/erts-14.1/include -Icache/xla_extension/include -O3 -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -shared -std=c++17 -w -DLLVM_VERSION_STRING= c_src/exla/exla.cc c_src/exla/exla_nif_util.cc c_src/exla/exla_client.cc -o cache/libexla.so -Lcache/xla_extension/lib -lxla_extension -Wl,-rpath,'$ORIGIN/xla_extension/lib'
In file included from c_src/exla/exla.cc:3:
c_src/exla/exla_nif_util.h:12:10: fatal error: xla/xla_data.pb.h: No such file or directory
   12 | #include "xla/xla_data.pb.h"
      |          ^~~~~~~~~~~~~~~~~~~
compilation terminated.
In file included from c_src/exla/exla_nif_util.cc:1:
c_src/exla/exla_nif_util.h:12:10: fatal error: xla/xla_data.pb.h: No such file or directory
   12 | #include "xla/xla_data.pb.h"
      |          ^~~~~~~~~~~~~~~~~~~
compilation terminated.
In file included from c_src/exla/exla_client.h:8,
                 from c_src/exla/exla_client.cc:1:
c_src/exla/exla_nif_util.h:12:10: fatal error: xla/xla_data.pb.h: No such file or directory
   12 | #include "xla/xla_data.pb.h"
      |          ^~~~~~~~~~~~~~~~~~~
compilation terminated.
make: *** [Makefile:57: cache/libexla.so] Error 1
could not compile dependency :exla, "mix compile" failed. Errors may have been logged above. You can recompile this dependency with "mix deps.compile exla --force", update it with "mix deps.update exla" or clean it with "mix deps.clean exla"
==> emer_phx
** (Mix) Could not compile with "make" (exit status: 2).
You need to have gcc and make installed. If you are using
Ubuntu or any other Debian-based system, install the packages
"build-essential". Also install "erlang-dev" package if not
included in your Erlang/OTP version. If you're on Fedora, run
"dnf group install 'Development Tools'".

Just taking stab to try to get a fresh XLA version

Thanks again!!!