High Scale Performance of LLMs -- Needed features?

mgwidmann · September 29, 2023, 3:47am

Recently, I am investigating the possibility of building out a LLaMa2 70B system which needs to be able to scale to thousands of requests per minute. This will require a lot of hardware, I realize, but for now I’m just exploring. I’ve tried using HuggingFace’s Text Generation Interface (aka TGI) to serve the model requests and been able to get decent performance with the right kind of hardware due to its continuous batching of requests, as well as been able to test on smaller machines by using Quantization (even though this seems to be not recommended for inference as its much slower).

I am wondering, since I don’t see issues on the Bumblebee GitHub regarding the kinds of features offered by TGI, are any of them implemented or planned to be implemented? Features I’d be looking for are things like:

FlashAttention v2 & PagedAttention
Continuous Batching
Token Streaming
Sharding a single model across multiple GPUs (this is probably an Nx thing, though I haven’t tried it)
Quantization for running on smaller machines (i.e. a dev environment)

Some of these concepts are mentioned here for example:

https://vilsonrodrigues.medium.com/serving-falcon-models-with-text-generation-inference-tgi-5f32005c663b

jonatanklosko · October 2, 2023, 8:22am

Hey @mgwidmann! Many deployment considerations are built into Nx, specifically Nx.Serving. A couple thins it does for you:

batching inference requests
if you have multiple hosts in a connected cluster, the work is load balanced transparently
if you have multiple GPUs on the given host, you can easily load balance the work between them too
it supports streaming (for example, Bumblebee.Text.generation builds a serving instance and all you need to do is pass stream: true)
encapsulates model pre/post processing (so it can be treated as an end-to-end pipeline in a way)

You can look at Llama — Bumblebee v0.4.2 for an example of using LLM with Bumblebee.

As for the other points:

optimised attentions - we don’t have it yet, but that’s something we plan on supporting in Bumblebee
sharding a single model across GPUs - for that we don’t have an abstraction yet
quantization - there is an ongoing work in the EXLA compiler to use MLIR, which should enable quantization

mgwidmann · October 5, 2023, 7:40pm

Thanks a lot, the use case I was looking to compare with would be something like LLaMa2 70B sharded across 8 A100 GPUs since that model cannot fit within a single GPU, especially unquantized.