I have a service that receives millions of requests per second and I need to answer each request concurrently.
To achieve this I have a pool of workers where each worker holds on to a permanent connection to a server - this way the connections are always ready to use.
Challenge
The challenge here is in choosing which worker I want to use.
Balancer
Normally this would be done with a GenServer balancer, like poolboy - a GenServer that picks which worker you can use.
But this approach has a big issue - the balancer is a bottleneck and according to our benchmark tests, this is a serious bottleneck. The balancer simply cannot hold so many requests.
Free for all
So remove the middle man. Let the client pick a worker. The client knows how to generate a worker id and knows how to access one as well. This approach is the one we are going with right now.
But it has a problem - multiple clients can pick the same worker. This in on itself is not a problem. The problem is that the Worker’s mailbox will grow without bound until it explodes - and this is a problem.
Enter locks
To solve the issue from the Free for all approach we decided to go with locks using the Registry. When a client requests a Worker, the worker locks himself to other clients. The lock is not immediate, so this solution is prone to race conditions, but each Worker can handle a dozen clients without a problem. The fight here is in making sure the mailbox doesn’t grow without bound.
What do you think?
Having in mind the goal here is to keep the worker’s mailbox from growing without bound, two other solutions occurred to me:
Check the workers mailbox and decide on what to do. However I was told that checking other processe’s mailbox directly was dangerous and ill advised, so I left that solution.
Back pressure control via GenServer.call. Currently the client uses GenServer.cast so it can continue working (the client doesn’t really care about a response). However, we also discarded this approach because it would lock all clients, all the time. This is not acceptable.
What other ideas do you have?
What are your opinions?
I’d consider looking into GenStage if you have problems with keeping up to the workload. Also this has good arguments for proper back-pressure: https://ferd.ca/queues-don-t-fix-overload.html
Do you have any article recommendations for GenStage?
I quite like the idea of defining a system’s operational limits, but when you are the one responsible for those limits (because you have a worker balancer that is a bottleneck and can be removed) then I am not sure the article applies. It does make perfect sense if your bottleneck is not under your control, like an external service, for example.
I’m not sure why only your balancer is your bottleneck though? Why would it’s message queue grow if your workers are fast enough? For GenStage I suggest to start with it’s docs.
That’s because you have 1 process that needs to pick and redirect a workload meant for thousands of processes. 1 process alone cannot handle it, even if the workers are fast enough.
GenStage is simply about the last (or any other) Consumer in the pipeline being able to propagate back pressure all the way back to the Producer at the beginning of the pipeline so that no single stage in the pipeline gets overwhelmed.
The chain is only as strong as its weakest link. So the design of the pipeline and any one of its stages will still constrain its performance limits. So there is no guarantee that events (requests) won’t just pile up in front of the Producer at the beginning of the pipeline.