I’ve got a project where I need to encode sentences using a sentence-transformer model. Currently, I’m using Python and the sentence-transformer package, but as the rest of the project is in Elixir I’d like to switch to Nx instead.
Using Bumblebee and Axon, I already built a small proof of concept and with the recent addition of a text embedding serving to Bumblebee, I wanted to do some quick benchmark to see how many encodes I can achieve on my CPU.
tl;dr: with a simple Python script I can achieve ~115 encodes per second with ~350% CPU load (-> ~4 Cores) on my MacBook Pro (M1 Max) and ~190 encodes per second when starting two separate Python processes (nearly full CPU utilization). Using Elixir and Nx I can only achieve ~55 encodes per second, while the average latency is more than double. Elixir also only achieves ~300% CPU usage. Starting multiple BEAM instances I can get to ~95 encodes per second with full CPU utilization.
The last point is the main one I’m interested in: there seems to be some kind of bottleneck that prevents me from achieving a similar performance to Python using only a single BEAM process. Has someone an idea why that’s the case? (It’s very possible that I’m just doing something wrong!). I expected the BEAM to be able to use all cores for encoding.
Apart from that, it seems like with full CPU utilization, I can only achieve half of the encode performance of Python using Nx, so there seem to be other factors in play too.
One thing I noticed is that it does not look like you are setting the compiler for your Nx serving, so you are losing a lot of optimizations there. Try setting defn_options: [compiler: EXLA] when creating the serving
I’m also not sure what the batch size you’re setting is. You can fiddle with higher and lower batch sizes to see if it improves latency.
Servings also have some built in latency, im not familiar with how the benchmark works but you can fiddle with batch timeout settings to achieve better latency as well.
Finally, if the server sends sequences of different lengths, you eat a compilation cost with every request. You should set a static sequence length
I think Nx.global_default_backend(EXLA.Backend) might already do this? At least I don’t measure any real difference when setting this on my serving. In general I think that the serving is not the limiting factor. I added a script that does not use the Nx.Serving at all, basically just calling Axon.predict and the performance is very similar (nx_axon.exs).
I also tried with different batch sizes and batch timeouts, but again without any measurable differences.
Concerning sequence lengths: that shouldn’t be an issue here as the benchmark is always encoding the same sentence, but good to know!
The main question I have is if there is some bottleneck with EXLA and the dirty NIF schedulers maybe?
Here you can see the scheduler usage while running the benchmark 3 times for 10 seconds. Looks like only one dirty cpu scheduler is used at a time, although which one changes. I’m no export on NIFs at all, so maybe that’s some common knowledge, but if Nx can only use one dirty scheduler at a time, this might become a bottleneck in other cases as well? Only speculations on my side though.
Running 1m test @ http://127.0.0.1:5001
8 threads and 32 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 25.69ms 2.21ms 91.61ms 95.06%
Req/Sec 156.34 12.48 202.00 86.03%
74832 requests in 1.00m, 9.21MB read
Requests/sec: 1245.80
Transfer/sec: 156.94KB
So pretty significant speed up just compiling the serving. There are some other config options you can mess with, but you probably won’t get much more of a speed up than that
Oh wow, that’s indeed a very significant difference. I now also now what I did wrong: I tried to set the defn_options in the child specification instead of the serving function…
I’ll update the repo with the updated results later. Thank you!
To be fair to Python, I realized that this is probably just because the sequence length was limited to 8. I am pretty sure that the sequence length the python library uses is 128 (see sentence-transformers/all-MiniLM-L6-v2 · Hugging Face):
The sequence length was limited to 128 tokens.
When using a sequence length of 128, Bumblebee+EXLA achieves ~120 encodes per second, which is basically the same performance as the Python server.
This brings us back to what I was wondering in Nx vs. Python performance for sentence-transformer encoding - #3 by steffend the BEAM with Bumblebee+EXLA does not seem to be able to fully utilize all CPU cores. If I start two instances of the nx_serving script on different ports and then execute 2 instances of the benchmark, I can achieve ~190 encodes per second (with a reverse proxy it strangely gets slower).
Optimizing the Python code by using a dedicated WSGI server to run on 8 processes gunicorn -w 8 -b 0.0.0.0:5001 simple:app instead of the flask development server:
batch_timeout and batch_size is going to impact on the latency and memory usage, so I recommend playing with those numbers if you haven’t yet. Does the Python version have anything along those lines?
EXLA assumes a computation will use all cores and it puts a lock around it. You can set XLA_FLAGS=--xla_force_host_platform_device_count=8 and it will start several CPU devices. You can then pass partitions: true to your Nx.Serving (in the child spec/sup tree). I am hoping this will at least allow you to use all cores within a single BEAM instance.
Yes, I already played with the batch settings and 32 seems to be a good batch size for the sequence length of 128. I did not play with the batch timeout yet, but latency is not my focus currently.
Also, please double check that both operations return the final data, as frameworks (both Elixir and Python) can return the output tensors without the computation fully concluding.
Finally, please double check if the SentenceTransformer is indeed padding. IIRC padding is not applied on PyTorch if you are not batching.
Yes, indeed that fixes the particular error. Thank you for looking into this!
Interestingly, the performance is still the same with 8 local devices (~117 encodes/second), though the scheduler usage in the observer looks much messier:
When I find the time I will also try to compare the results of the Python and Elixir code. I have a Livebook that compute the same cosine similarities as Python using Bumblebee+Axon (no serving, as the mean pooling of the serving has some issues - Bumblebee.Text.TextEmbedding output_pool crashes · Issue #216 · elixir-nx/bumblebee · GitHub). When I have more results, I’ll update the repo and this thread.
@steffend@jonatanklosko@seanmor5 I have been thinking about this and it is clear that we are more performant but forcing a certain sequence length is going to be an issue because we are always working with the worst case.
I can think of two solutions to the problem. Both are based on allowing multiple sequence lengths. For example, instead of 128, we could say 16, 32, 64, 96, and 128. If we do so, we have two options:
Allow multiple sequence lengths in the same batch and then pad to the highest. For example, if we get 18, 23, 42, 55, and 90 on a batch, we will pad to 96.
Allow multiple batch keys. In the example above, 18 and 23 go to the “32-padding batch”. 42 and 55 go to the “64-padding batch” and 90 goes to the “96-padding batch”. Each batch have their own size and individual timeouts. This means better performance but you will need to balance the batch size and batch timeout accordingly (if the timeout is high, it is more likely you will always hit the timeout).
I am thinking the batch keys approach makes the most sense but I would love to hear your thoughts.
I’ve been running some tests comparing the results more thoroughly this week and will probably post an update tomorrow. I can confirm that EXLA performs better than Python when using the full sequence length. I also started playing with CUDA on AWS, but there I still need to run some more tests.
To measure the impact of the sequence length, I adapted my serving to always tokenize twice. One time with the full sequence length and then again limited to the actual sequence length of the input. The encode/second graph looks like this for EXLA (x-axis sequence length, y-axis encodes/sec):
And finally I’m attaching the Livebook I used to generate these graphs.
All in all, Elixir and EXLA perform well. The only thing remaining is that I could not get the CPU to be fully loaded with EXLA (the same for CUDA).
The first one seems similar to what Python does, always using the longest input sequence length, if I understood that right.
I’ve been thinking about the following: couldn’t we also allow a dynamic sequence length and just in time compile when we first get an input with a specific sequence length? Further requests should then be compiled. As the sequence length is finite, this would mean that one could either pre-compile every sequence length or “warmup” the serving.
I’ve been thinking about the following: couldn’t we also allow a dynamic sequence length and just in time compile when we first get an input with a specific sequence length? Further requests should then be compiled. As the sequence length is finite, this would mean that one could either pre-compile every sequence length or “warmup” the serving.
We can do that for sure but it means you may compile the program several times. But it is something I will consider while exploring these ideas.
Having multiple variants sounds great! Both 1. and 2. make certain tradeoffs and which is better depends on the length distribution. If longer inputs are rare, then using 2. it will hit batch timeout and we will pad with empty batch items, while we may as well put some shorter inputs there. But then note that we pad on the client as part of tokenization and it impacts all of the input tensors (input ids, attention mask), but padding to higher length means we need to pad on the server. With 2. we always know what length to pad to.