Nx vs. Python performance for sentence-transformer encoding

Thank you for the suggestion!

I think Nx.global_default_backend(EXLA.Backend) might already do this? At least I don’t measure any real difference when setting this on my serving. In general I think that the serving is not the limiting factor. I added a script that does not use the Nx.Serving at all, basically just calling Axon.predict and the performance is very similar (nx_axon.exs).

I also tried with different batch sizes and batch timeouts, but again without any measurable differences.

Concerning sequence lengths: that shouldn’t be an issue here as the benchmark is always encoding the same sentence, but good to know!

The main question I have is if there is some bottleneck with EXLA and the dirty NIF schedulers maybe?

Here you can see the scheduler usage while running the benchmark 3 times for 10 seconds. Looks like only one dirty cpu scheduler is used at a time, although which one changes. I’m no export on NIFs at all, so maybe that’s some common knowledge, but if Nx can only use one dirty scheduler at a time, this might become a bottleneck in other cases as well? Only speculations on my side though.

1 Like