Recently, I am investigating the possibility of building out a LLaMa2 70B system which needs to be able to scale to thousands of requests per minute. This will require a lot of hardware, I realize, but for now I’m just exploring. I’ve tried using HuggingFace’s Text Generation Interface (aka TGI) to serve the model requests and been able to get decent performance with the right kind of hardware due to its continuous batching of requests, as well as been able to test on smaller machines by using Quantization (even though this seems to be not recommended for inference as its much slower).
I am wondering, since I don’t see issues on the Bumblebee GitHub regarding the kinds of features offered by TGI, are any of them implemented or planned to be implemented? Features I’d be looking for are things like:
- FlashAttention v2 & PagedAttention
- Continuous Batching
- Token Streaming
- Sharding a single model across multiple GPUs (this is probably an Nx thing, though I haven’t tried it)
- Quantization for running on smaller machines (i.e. a dev environment)
Some of these concepts are mentioned here for example:
Hey @mgwidmann! Many deployment considerations are built into
Nx.Serving. A couple thins it does for you:
- batching inference requests
- if you have multiple hosts in a connected cluster, the work is load balanced transparently
- if you have multiple GPUs on the given host, you can easily load balance the work between them too
- it supports streaming (for example,
Bumblebee.Text.generation builds a serving instance and all you need to do is pass
- encapsulates model pre/post processing (so it can be treated as an end-to-end pipeline in a way)
You can look at Llama — Bumblebee v0.4.2 for an example of using LLM with Bumblebee.
As for the other points:
- optimised attentions - we don’t have it yet, but that’s something we plan on supporting in Bumblebee
- sharding a single model across GPUs - for that we don’t have an abstraction yet
- quantization - there is an ongoing work in the EXLA compiler to use MLIR, which should enable quantization
Thanks a lot, the use case I was looking to compare with would be something like LLaMa2 70B sharded across 8 A100 GPUs since that model cannot fit within a single GPU, especially unquantized.