Installing EXLA in ubuntu 20

Hello, I have been trying to get Exla installed for use with Nx and Axon. I am running Ubuntu 20 and I have read through the instructions and installed exla’s system dependencies (build-essential, erlang-dev, bazel 3.7.2, python3 and numpy, direnv) but when I compile exla I get this error:

ERROR: /home/karang/.cache/bazel/_bazel_karang/0cbb144c3d68d1f180f564ed331d591d/external/llvm-project/llvm/BUILD:46:18: Executing genrule @llvm-project//llvm:config_gen failed (Exit 1): bash failed: error executing command /bin/bash -c ... (remaining 1 argument(s) skipped)
unknown command: python3. Perhaps you have to reshim?
----------------
Note: The failure of target //third_party/llvm:expand_cmake_vars (with exit code 1) may have been caused by the fact that it is running under Python 3 instead of Python 2. Examine the error to determine if that appears to be the problem. Since this target is built in the host configuration, the only way to change its version is to set --host_force_python=PY2, which affects the entire build.

If this error started occurring in Bazel 0.27 and later, it may be because the Python toolchain now enforces that targets analyzed as PY2 and PY3 run under a Python 2 and Python 3 interpreter, respectively. See https://github.com/bazelbuild/bazel/issues/7899 for more information.
----------------
Target //tensorflow/compiler/xla/exla:libexla.so failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 9.398s, Critical Path: 8.88s
INFO: 96 processes: 16 internal, 80 local.
FAILED: Build did NOT complete successfully
FAILED: Build did NOT complete successfully
make: *** [Makefile:32: all] Error 1
could not compile dependency :exla, "mix compile" failed. You can recompile this dependency with "mix deps.compile exla", update it with "mix deps.update exla" or clean it with "mix deps.clean exla"
==> ml_test
** (Mix) Could not compile with "make" (exit status: 2).
You need to have gcc and make installed. If you are using
Ubuntu or any other Debian-based system, install the packages
"build-essential". Also install "erlang-dev" package if not
included in your Erlang/OTP version. If you're on Fedora, run
"dnf group install 'Development Tools'".

At first it couldn’t find python so I used the asdf direnv trick but now it says it can’t find python3.

here is my .tools-version

erlang 24.0.1
elixir 1.12.1-otp-24
python 3.9.5
bazel 3.7.2

Has anyone managed to get EXLA working in ubuntu? Thank you!

How did you install python3? The error message says it is not available and you have to reshim it, which may be a requirement depending on the tool you used to install it.

@karang it cant find python3 - maybe this section in the exla readme can be helpful nx/exla at main · elixir-nx/nx · GitHub

Python and asdf

Bazel cannot find python installed via the asdf version manager by default. asdf uses a function to lookup the specified version of a given binary, this approach prevents Bazel from being able to correctly build EXLA. The error is unknown command: python. Perhaps you have to reshim?. There are two known workarounds:

  1. Use a separate installer or explicitly change your $PATH to point to a Python installation (note the build process looks for python, not python3). For example, on Homebrew on macOS, you would do:
export PATH=/usr/local/opt/python@3.9/libexec/bin:/usr/local/bin:$PATH
mix deps.compile
  1. Use the asdf direnv plugin to install direnv 2.20.0. direnv along with the asdf-direnv plugin will explicitly set the paths for any binary specified in your project’s .tool-version files.

After doing any of the steps above, it may be necessary to clear the build cache by removing ~/.cache/exla.

do keep us posted if it works, and/or readme needs updating…

Thanks guys! I used asdf to install python and method 2 as per above to to deal with it. I installed direnv and set the tools versions in my mix project. I just cleared the cache and tried again and it gave me the same error. So I tried to remove the python plugin from asdf but this does not remove the shims and then bazel complains that the shims don’t exist. So i reinstalled python with asdf and double check my direnv setup to make sure the PATH is set correctly and bazel still can’t find python3. I’ll let you know if I get it to work but so far method 2 has not worked for me and I am not sure which directory I can add to my path if I wanted to use method 1.

2 Likes

I’m running into the same errors using Ubuntu 20.04, RocM + Tensorflow, asdf managed Python and Bazel as described above. I’ve also tried aliasing ‘python’ to ‘python3’ but same scenario.

I’m on Ubuntu 20.04 too. I think I’ve progressed further than @MrDoops but the behaviour did surprise me a little. mix deps.compile on a new project with Nx, EXLA and Axon as dependencies generated this output. I’m not sure if it finished, I let it run for about 4 hours!

Just not sure what I should have expected to happen.

Oh, I can’t attest to this being helpful but I did export PYTHON_BIN_PATH=/usr/bin/python3.8

[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
[2 / 5,456] Linking external/com_google_protobuf/protoc [for host]; 0s local … (4 actions, 3 running)
[39 / 5,456] Compiling tensorflow/core/util/test_log.pb.cc [for host]; 0s local … (3 actions, 2 running)
[78 / 5,456] Compiling tensorflow/core/util/test_log.pb.cc [for host]; 1s local … (4 actions, 3 running)
[94 / 5,456] Compiling tensorflow/core/util/test_log.pb.cc [for host]; 3s local … (4 actions running)
[95 / 5,456] Compiling tensorflow/core/util/test_log.pb.cc [for host]; 4s local … (4 actions, 3 running)
[102 / 5,456] Compiling tensorflow/core/util/test_log.pb.cc [for host]; 6s local … (4 actions, 3 running)
[103 / 5,456] Compiling tensorflow/core/util/test_log.pb.cc [for host]; 7s local … (4 actions running)
[103 / 5,456] Compiling tensorflow/core/util/test_log.pb.cc [for host]; 9s local … (4 actions running)
[104 / 5,456] Compiling tensorflow/core/util/test_log.pb.cc [for host]; 11s local … (4 actions running)
[105 / 5,456] Compiling tensorflow/core/util/test_log.pb.cc [for host]; 13s local … (4 actions running)
[107 / 5,456] Compiling tensorflow/core/util/test_log.pb.cc [for host]; 16s local … (4 actions running)
[108 / 5,456] Compiling tensorflow/core/util/test_log.pb.cc [for host]; 18s local … (4 actions running)
[114 / 5,456] Compiling tensorflow/core/framework/graph_transfer_info.pb.cc [for host]; 6s local … (4 actions running)
[129 / 5,456] Compiling tensorflow/core/framework/summary.pb.cc [for host]; 2s local … (4 actions running)
[131 / 5,456] Compiling tensorflow/core/example/example_parser_configuration.pb.cc [for host]; 5s local … (4 actions running)
[135 / 5,456] Compiling tensorflow/core/protobuf/error_codes.pb.cc [for host]; 1s local … (4 actions running)
[138 / 5,456] Compiling tensorflow/core/framework/step_stats.pb.cc [for host]; 6s local … (4 actions running)
[150 / 5,456] Compiling tensorflow/core/framework/node_def.pb.cc [for host]; 4s local … (4 actions running)
[168 / 5,456] Compiling tensorflow/core/framework/function.pb.cc [for host]; 8s local … (4 actions running)
[170 / 5,456] Compiling tensorflow/core/framework/function.pb.cc [for host]; 16s local … (4 actions running)
[174 / 5,456] Compiling tensorflow/core/protobuf/config.pb.cc [for host]; 20s local … (4 actions running)
[183 / 5,456] Compiling tensorflow/core/protobuf/rewriter_config.pb.cc [for host]; 9s local … (4 actions running)
[188 / 5,456] Compiling tensorflow/core/protobuf/meta_graph.pb.cc [for host]; 14s local … (4 actions running)
[203 / 5,456] Compiling tensorflow/core/profiler/protobuf/xplane.pb.cc [for host]; 10s local … (4 actions running)
[221 / 5,456] Generating code from table: lib/Target/X86/X86.td @llvm-project//llvm:X86CommonTableGen__gen_dag_isel_genrule; 5s local … (4 actions running)
[548 / 5,523] Compiling llvm-project/mlir/lib/IR/BuiltinTypes.cpp [for host]; 10s local … (4 actions running)
[560 / 5,523] Compiling llvm-project/mlir/lib/IR/Operation.cpp [for host]; 10s local … (4 actions running)
[576 / 5,523] Compiling llvm-project/mlir/lib/IR/AsmPrinter.cpp [for host]; 14s local … (4 actions running)
[785 / 5,872] Compiling llvm-project/mlir/tools/mlir-linalg-ods-gen/mlir-linalg-ods-yaml-gen.cpp [for host]; 14s local … (4 actions running)
[811 / 5,872] Compiling tensorflow/core/framework/lookup_interface.cc [for host]; 7s local … (4 actions running)
[821 / 5,872] Compiling tensorflow/core/framework/tensor_util.cc [for host]; 9s local … (4 actions running)
[836 / 5,872] Compiling tensorflow/core/util/batch_util.cc [for host]; 12s local … (4 actions running)
[848 / 5,872] Compiling tensorflow/core/util/batch_util.cc [for host]; 66s local … (4 actions running)
[868 / 5,872] Compiling tensorflow/core/framework/common_shape_fns.cc [for host]; 11s local … (4 actions running)
[894 / 5,872] Compiling tensorflow/core/lib/io/record_reader.cc [for host]; 7s local … (4 actions running)
[928 / 5,874] Compiling tensorflow/core/framework/device_factory.cc [for host]; 14s local … (4 actions running)
[1,033 / 5,989] Compiling tensorflow/core/ops/nn_ops.cc [for host]; 17s local … (4 actions running)
[1,135 / 6,076] Compiling tensorflow/core/ops/math_ops.cc [for host]; 25s local … (4 actions running)
[1,193 / 6,076] Compiling tensorflow/core/platform/default/env.cc; 12s local … (4 actions, 3 running)
[1,257 / 6,076] Compiling tensorflow/stream_executor/stream.cc; 24s local … (4 actions, 3 running)
[1,308 / 6,076] Compiling tensorflow/core/framework/shape_inference.cc; 15s local … (4 actions, 3 running)
[1,359 / 6,076] Compiling tensorflow/core/util/batch_util.cc; 91s local … (4 actions, 3 running)
[1,406 / 6,076] Compiling tensorflow/compiler/xla/service/hlo_computation.cc; 15s local … (4 actions, 3 running)
[1,448 / 6,076] Compiling tensorflow/compiler/xla/service/compiler.cc; 27s local … (4 actions, 3 running)
[1,549 / 6,076] Compiling llvm-project/llvm/lib/ProfileData/InstrProf.cpp; 7s local … (4 actions, 3 running)
[1,662 / 6,076] Compiling llvm-project/llvm/lib/Analysis/InstCount.cpp; 9s local … (4 actions, 3 running)
[1,774 / 6,076] Compiling llvm-project/llvm/lib/Transforms/Utils/CodeExtractor.cpp; 21s local … (4 actions, 3 running)
[1,880 / 6,076] Compiling llvm-project/llvm/lib/Transforms/Scalar/LoopLoadElimination.cpp; 16s local … (4 actions running)
[1,985 / 6,076] Compiling tensorflow/compiler/mlir/tensorflow/transforms/rewrite_tpu_embedding_ops.cc; 85s local … (4 actions running)
[2,070 / 6,076] Compiling tensorflow/compiler/mlir/tensorflow/transforms/cluster_ops_by_policy_pass.cc; 16s local … (4 actions, 3 running)
[2,173 / 6,076] Compiling llvm-project/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp; 84s local … (4 actions running)
[2,294 / 6,076] Compiling llvm-project/llvm/lib/Target/X86/X86FastISel.cpp; 29s local … (4 actions running)
[2,438 / 6,076] Compiling tensorflow/compiler/tf2xla/cc/ops/xla_ops.cc; 13s local … (4 actions, 3 running)
[2,672 / 6,077] Compiling tensorflow/compiler/xla/service/dynamic_dimension_inference.cc; 23s local … (4 actions, 3 running)
[2,841 / 6,077] Compiling llvm-project/llvm/lib/Target/PowerPC/PPCFastISel.cpp; 16s local … (4 actions, 3 running)
[3,279 / 6,077] Compiling tensorflow/core/graph/algorithm.cc; 12s local … (4 actions, 3 running)
[3,992 / 6,077] Compiling tensorflow/core/kernels/list_kernels.cc; 48s local … (4 actions running)
[4,241 / 6,077] Compiling mkl_dnn_v1/src/cpu/x64/jit_sse41_1x1_convolution.cpp; 8s local … (4 actions, 3 running)
[4,521 / 6,077] Compiling tensorflow/core/kernels/mkl/mkl_conv_ops.cc; 87s local … (4 actions, 3 running)

That’s the correct output, EXLA takes a really long to compile (it has to compile a lot of TensorFlow). When it finishes you’ll see an output with something like Linking libexla.so.

1 Like

Hey, that’s good to know, finishing off this morning :slight_smile:

This really helped me! It is now compiling! All the fans are running!

2 Likes

It seems now bazel can find python after running the above command but a new problem has arisen. When exla starts compiling I start up htop to watch my memory and cpu usage and I see all cores running at close to 100% and then at some point (always different) RAM will get filled up and the compilation crashes. I have 16G ram and 16 cores on my machine this should be enough right? How can I avoid running out of memory during exla compilation? Thank you!!

Check here: Memory-saving Mode - Bazel main

EXLA looks for BAZEL_FLAGS so you can set any of those flags by setting BAZEL_FLAGS as an environment variable

1 Like

You can also lower the number of jobs using the —jobs= flag

1 Like

Thanks for the tip. I think now the problem is I am not setting the flags properly. I tried this:

$ export BAZEL_FLAGS=--discard_analysis_cache,--nokeep_state_after_build,--notrack_incremental_state,--jobs=8,--host_jvm_args=-Xmx8g

But it did not seem to make a difference. I tried some different combinations of the above all to no avail. All cores were running (I thought only 8 should be active) and it still ran out of memory. I also tried to set it like this:

$ BAZEL_FLAGS=--discard_analysis_cache,--nokeep_state_after_build,--notrack_incremental_state,--jobs=8,--host_jvm_args=-Xmx8g mix deps.compile exla

But that didn’t seem to work either. Is this the correct way to set the BAZEL_FLAGS variable. Thanks!

Is it possible to use a precompiled xla library? I’ve found this repo but somewhy still cannot run my models on gpu. When I am trying to explicitly specify EXLA compiler in the code, I get the following error:

01:01:51.192 [error] GenServer EXLA.Client terminating
** (RuntimeError) Could not find registered platform with name: "cuda". Available platform names are: Host
    (exla 0.1.0-dev) lib/exla/client.ex:153: EXLA.Client.unwrap!/1
    (exla 0.1.0-dev) lib/exla/client.ex:134: EXLA.Client.build_client/2
    (exla 0.1.0-dev) lib/exla/client.ex:94: EXLA.Client.handle_call/3
    (stdlib 3.16.1) gen_server.erl:721: :gen_server.try_handle_call/4
    (stdlib 3.16.1) gen_server.erl:750: :gen_server.handle_msg/6
    (stdlib 3.16.1) proc_lib.erl:226: :proc_lib.init_p_do_apply/3

I’ve an exla dependency in my mix.exs:

{:exla, "~> 0.1.0-dev", github: "elixir-nx/nx", sparse: "exla"}

I’ve set up configuration in the config/config.exs:

config :nx, :default_defn_options, [compiler: EXLA, client: :cuda]
config :exla, :clients, cuda: [platform: :cuda], default: [platform: :cuda]

I’ve the following environment variables as well:

XLA_BUILD=true
XLA_TARGET=cuda111
EXLA_TARGET=cuda
TF_CUDA_VERSION='11.2'

And off course I’ve installed cuda and cudnn so it is recognized by the tensorflow library. Command python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))" produces following output:

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Do I miss something?

Remove XLA_BUILD=true and it should pick up a precompiled version based on your XLA_TARGET. If it doesn’t work, please post the full results of mix deps.get plus mix compile. :slight_smile: