I had Exla working at one point, but now it keeps crashing, and I can't figure out why

,

So I was trying to migrate some python code that I was running into my elixir code. My set up is that I have Elixir running most of the code, and rustler/rust bindings that run rust-bert for one model, and some python code that runs for another one. My goal was to bring as much as possible into Elixir. I found a model that I could run in Elixir, that was performing better than the one in Python, but there seemed to be a crossing of wires since both the Rust code, and the Elixir Nx code were using PyTorch. It seemed like whichever one ran first would get the lock on PyTorch, and the second one wouldn’t work(on the Rust side i’d see it was a error like this: Nif not loaded ... symbol not found in flat namespace '__ZN3c1019UndefinedTensorImpl10_singletonE'.

When I tried to instead run the Elixir side on Exla, I could get it to work breifly in a completely clean project, but not all together with everything else. When I load the model I get this output:

[info] TfrtCpuClient created.
[debug] the following PyTorch parameters were unused:

  * bert.embeddings.position_ids
  * bert.pooler.dense.bias
  * bert.pooler.dense.weight

Which seems strange because I’m trying to target EXLA, not PyTorch. To confirm right before I ran this operation, I ran the following:

iex(3)> Nx.default_backend
{EXLA.Backend, [client: :host]}

Though the tensors do state EXLA:

f32[768]
EXLA.Backend<host:0,...

The model is started as follows:

{:ok, model} = Bumblebee.load_model({:hf, "vblagoje/bert-english-uncased-finetuned-pos"})
  # This is only coming from local because there's no fast tokenizer "tokenizer.json" file on HF, but loading it in Python, then dumping to a file created one.
  {:ok, tokenizer} = Bumblebee.load_tokenizer({:local, "priv/tokenizer"})
  model = Bumblebee.Text.token_classification(model, tokenizer, aggregation: :same)

And called like:

Nx.Serving.run(model, "This is a test")

Which gives me the following:

libc++abi: terminating with uncaught exception of type std::out_of_range: Span::at failed bounds check
[1]    67025 abort      iex --erl "-kernel shell_history enabled" -S mix phx.server

I have tried:

  • Trying a different model, where both model and tokenizer are directly from HF, same result
  • Scorched earth removal of pytorch, tensorflow, all deps, _build, the works, and then rebuilding, same result
  • Completely removing rustler or anything else. Even trying in a completely clean project with just nx/bumblebee/axon/exla/jason, to try it there, same result.
  • restarting the computer, same result.

I’m not even sure how to debug where it’s crashing from.

Here are my mix.exs deps:

      {:axon, "~> 0.5.1"},
      {:bumblebee, "~> 0.2.0"},
      {:nx, "~> 0.5.1"},
      {:exla, "~> 0.5.1", sparse: "exla"},
      {:jason, "~> 1.0"}

Anyone have an insight or tips of what I could try?

1 Like

Can you provide a Livebook or a single .exs file that reproduces this? Then I can at least run it here and let you know if it is something specific to your machine or not.

Also, double check your ENV vars just in case.

1 Like

Hey @jose! I’ll take a look and see if I can get something running on a livebook, or single file to reproduce the issue. I suspect it’s a “my machine”, rather than a library issue, so I suspect it’ll be hard, I was just wondering if this was an error anyone had seen.

But for now, I’ve moved to a production server. AWS G4dn.xl. Nvidia T4. I think I must be doing something wrong with my config.

I have 5 nonsense sentences I copied from a reddit post. The byte sizes are sents = [1079, 131, 109, 120, 130]

Yes, these are certainly not the most rigorous or fair benchmarks, but it was just to get a rough idea of if my config/settings in Elixir are good, which they appear not to be, so I’m working on how to get them dialed in. I’m sure if anyone can help, you might be able to at least point me in a good direction.

Basically I’m just running through the 5 sentences, 10 times on each system. They’re both running on the same system.

In Python:

model = BertForTokenClassification.from_pretrained("vblagoje/bert-english-uncased-finetuned-pos")
tokenizer = AutoTokenizer.from_pretrained("vblagoje/bert-english-uncased-finetuned-pos")
model     = model.to(device)
def test():
  start = time.time()
  with torch.no_grad():
    for i in range(10):
      for s in sents:
        inputs = tokenizer(s, return_tensors="pt").to(device)
        outputs = model(**inputs)
        logits = outputs.logits
        predicted_token_class_ids = logits.argmax(-1)
        predicted_tokens_classes = [model.config.id2label[t.item()] for t in predicted_token_class_ids[0]]  
  return time.time() - start

calling test() here gives me 0.5248661041259766. I logged it on a separate run to make sure it wasn’t fast because it was throwing an error or something:

['PRON', 'ADV', 'PUNCT', 'SCONJ', 'DET', 'ADJ', 'NOUN', 'VERB', 'VERB', 'ADP', 'NOUN', 'NOUN', 'ADP', 'PRON', 'PUNCT', 'CCONJ', 'DET', 'ADJ', 'NOUN', 'VERB', 'DET', 'ADJ', 'NOUN', 'ADP', 'DET', 'ADJ', 'ADJ', 'ADV', 'ADJ', 'NOUN', 'ADP', 'PRON', 'NOUN', 'PUNCT', 'CCONJ', 'CCONJ', 'DET', 'ADJ', 'ADJ', 'NOUN', 'NOUN', 'VERB', 'ADP', 'DET', 'ADJ', 'NOUN', 'PUNCT', 'PRON', 'VERB', 'PRON', 'ADV', 'ADP', 'DET', 'ADJ', 'NOUN', 'ADP', 'DET', 'VERB', 'VERB', 'NOUN', 'PUNCT', 'CCONJ', 'PUNCT', 'SCONJ', 'PRON', 'VERB', 'ADV', 'ADP', 'DET', 'NOUN', 'PUNCT', 'DET', 'NUM', 'ADJ', 'NOUN', 'AUX', 'VERB', 'ADP', 'PRON', 'PUNCT', 'ADV', 'PRON', 'VERB', 'DET', 'NOUN', 'ADP', 'DET', 'ADJ', 'NOUN', 'ADP', 'DET', 'NOUN', 'PUNCT', 'CCONJ', 'VERB', 'ADJ', 'ADP', 'DET', 'ADJ', 'ADJ', 'ADV', 'ADV', 'ADV', 'ADJ', 'NOUN', 'ADP', 'DET', 'NOUN', 'CCONJ', 'NOUN', 'PUNCT', 'ADV', 'PRON', 'VERB', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'PUNCT', 'PRON', 'VERB', 'PRON', 'ADP', 'PRON', 'ADJ', 'NOUN', 'PUNCT', 'CCONJ', 'DET', 'NOUN', 'ADP', 'DET', 'ADJ', 'NOUN', 'PRON', 'VERB', 'CCONJ', 'VERB', 'VERB', 'PRON', 'PUNCT', 'SCONJ', 'PRON', 'VERB', 'ADP', 'PRON', 'ADP', 'DET', 'NOUN', 'ADP', 'NOUN', 'PUNCT', 'CCONJ', 'ADV', 'PUNCT', 'PRON', 'NOUN', 'PUNCT', 'ADV', 'NOUN', 'VERB', 'ADP', 'VERB', 'PRON', 'NOUN', 'PUNCT', 'CCONJ', 'NOUN', 'CCONJ', 'NOUN', 'VERB', 'PART', 'VERB', 'ADP', 'PRON', 'NOUN', 'CCONJ', 'VERB', 'PRON', 'NOUN', 'PUNCT', 'ADP', 'DET', 'NOUN', 'ADP', 'DET', 'ADJ', 'NOUN', 'PUNCT', 'ADV', 'PRON', 'ADV', 'VERB', 'ADP', 'NOUN', 'PUNCT', 'INTJ', 'PUNCT', 'AUX', 'PRON', 'AUX', 'VERB', 'DET', 'NOUN', 'NOUN', 'PUNCT', 'AUX', 'VERB', 'ADP', 'NOUN', 'DET', 'PRON', 'AUX', 'VERB', 'ADV', 'ADJ', 'CCONJ', 'ADJ', 'ADP', 'PRON', 'PUNCT', 'SCONJ', 'PRON', 'AUX', 'AUX', 'DET', 'NOUN', 'ADP', 'PRON', 'NOUN', 'PUNCT', 'SCONJ', 'PRON', 'NOUN', 'AUX', 'DET', 'NOUN', 'ADP', 'DET', 'ADJ', 'PROPN', 'PUNCT', 'PUNCT']

But it’s definitely generating tokens.

In Elixir:

> Nx.default_backend({EXLA.Backend, client: :cuda})
> Nx.default_backend                               
{EXLA.Backend, [client: :cuda]}
> Nx.Defn.default_options
[compiler: EXLA, client: :cuda]
defmodule Test do
  @sents [
    # same as before, cut for brevity, but can add them if wanted
  ]
  def run do
     {:ok, model} = Bumblebee.load_model({:hf, "vblagoje/bert-english-uncased-finetuned-pos"})
     # Only reason this is coming from a local file instead of :hf, is because the fast tokenizer is not on HF hub
     # This local file is just a dump from HF python lib `tokenizer.save('path')`
     # Contents: config.json  special_tokens_map.json  tokenizer.json  tokenizer_config.json  vocab.txt
     {:ok, tokenizer} = Bumblebee.load_tokenizer({:local, "./tokenizer"})
     model = Bumblebee.Text.token_classification(model, tokenizer, aggregation: :same)
     :timer.tc(fn() ->
       Enum.each(1..10, fn _ ->
         Enum.each(@sents, fn sent ->
           Nx.Serving.run(model, sent)
         end)
       end)
     end) |> elem(0) |> Kernel./(1_000_000)
   end
 end
iex(7)> Test.run()

02:00:52.993 [info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
 
02:00:52.993 [info] XLA service 0x7f81a80d9a00 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

02:00:52.993 [info]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5

02:00:52.993 [info] Using BFC allocator.

02:00:52.993 [info] XLA backend allocating 14019467673 bytes on device 0 for BFCAllocator.
 
02:00:53.785 [debug] the following PyTorch parameters were unused:

  * bert.embeddings.position_ids
  * bert.pooler.dense.bias
  * bert.pooler.dense.weight

23.547926

My ~/.bashrc which I ran both source ~/.bashrc and exec $SHELL after, and confirmed config with echo $XLA_TARGET:

export LIBTORCH_TARGET="cu116"
export ELIXIR_ERL_OPTIONS="+sssdio 128"
export XLA_TARGET="cuda118"
export EXLA_TARGET="cuda"

I’m not sure if the cuda116 is bad in there. I’m running cuda 11.8 with driver 520, and cudnn 8.7.0.84. cuda116 was the highest listed for Torchx. Not sure if that matters if I’m using EXLA, but I notice the following PyTorch parameters... in the logs, so I assume PyTorch is still involved.

I’m trying to figure out what could be missing in the Elixir side to get it closer to the Python side.

I would also love to write up a blog post after I’ve been through this process of launching in production. The docs, as always with Elixir, have been super helpful and accessible for a ML newbie like me, but I have found them and the general blogosphere support still thin with info, which is to be expected at this early stage, but I saw there was a note to look at EXLA compiler flags, and defn opts, but there wasn’t any general advice about what’s a good starting place, and the EXLA flags were quite long, so I didn’t even know where to start.