Understanding Nx.Serving partitions and distribution

Wanted to start by saying thank you to the maintainers, this is such an awesome project. I am really interested in using Nx to serve deployments for production use cases. I’m having trouble understanding this part of the documentation about how partitions and distribution work. I’m hoping someone can clarify.

First, I think distributed makes more sense to me. I can start a node in my cluster, and configure it to serve a model. I can then setup a second node that is configured not to run the model. The second node will be able to make requests to the model process on the first node automatically as long as the nodes are connected via libcluster or similar. This makes sense in a scenario where I want to run my model with GPU nodes, and some worker (http server, kafka consumer) with CPU nodes. Is this understanding correct? What other use cases are there for this?

Second, partitions make less sense to me. Does this mean that if I have N GPU’s available, it will load N instances of the model, each with its own GPU and then load balance between them? It would be helpful if someone could share some concrete examples of when this option is useful and when its not.

Thank you in advance for any feedback.

1 Like

Reading back my own post, I feet that I was rambling a bit. Wondering if anyone can explain how partitioning works and what it’s used for?