Strategies to keep process mailboxes from exploding?

Background

I have a service that receives millions of requests per second and I need to answer each request concurrently.

To achieve this I have a pool of workers where each worker holds on to a permanent connection to a server - this way the connections are always ready to use.

Challenge

The challenge here is in choosing which worker I want to use.

Balancer

Normally this would be done with a GenServer balancer, like poolboy - a GenServer that picks which worker you can use.

But this approach has a big issue - the balancer is a bottleneck and according to our benchmark tests, this is a serious bottleneck. The balancer simply cannot hold so many requests.

Free for all

So remove the middle man. Let the client pick a worker. The client knows how to generate a worker id and knows how to access one as well. This approach is the one we are going with right now.

But it has a problem - multiple clients can pick the same worker. This in on itself is not a problem. The problem is that the Worker’s mailbox will grow without bound until it explodes - and this is a problem.

Enter locks

To solve the issue from the Free for all approach we decided to go with locks using the Registry. When a client requests a Worker, the worker locks himself to other clients. The lock is not immediate, so this solution is prone to race conditions, but each Worker can handle a dozen clients without a problem. The fight here is in making sure the mailbox doesn’t grow without bound.

What do you think?

Having in mind the goal here is to keep the worker’s mailbox from growing without bound, two other solutions occurred to me:

  1. Check the workers mailbox and decide on what to do. However I was told that checking other processe’s mailbox directly was dangerous and ill advised, so I left that solution.
  2. Back pressure control via GenServer.call. Currently the client uses GenServer.cast so it can continue working (the client doesn’t really care about a response). However, we also discarded this approach because it would lock all clients, all the time. This is not acceptable.

What other ideas do you have?
What are your opinions?

I’d consider looking into GenStage if you have problems with keeping up to the workload. Also this has good arguments for proper back-pressure: https://ferd.ca/queues-don-t-fix-overload.html

4 Likes

Do you have any article recommendations for GenStage?

I quite like the idea of defining a system’s operational limits, but when you are the one responsible for those limits (because you have a worker balancer that is a bottleneck and can be removed) then I am not sure the article applies. It does make perfect sense if your bottleneck is not under your control, like an external service, for example.

I’m not sure why only your balancer is your bottleneck though? Why would it’s message queue grow if your workers are fast enough? For GenStage I suggest to start with it’s docs.

That’s because you have 1 process that needs to pick and redirect a workload meant for thousands of processes. 1 process alone cannot handle it, even if the workers are fast enough.

Handling Overload (2016-11-24)


GenStage is simply about the last (or any other) Consumer in the pipeline being able to propagate back pressure all the way back to the Producer at the beginning of the pipeline so that no single stage in the pipeline gets overwhelmed.

The chain is only as strong as its weakest link. So the design of the pipeline and any one of its stages will still constrain its performance limits. So there is no guarantee that events (requests) won’t just pile up in front of the Producer at the beginning of the pipeline.

4 Likes

An example of checking the mailbox is done by lager to switch between sync and async to act as backpressure:

Another solution is Fred’s pobox library https://github.com/ferd/pobox

As you mention a pool isn’t a fix in and of itself but I figured I’d mention there is a fast pool option now with persistent_term, I’ve been playing with this for opencensus stats recording, https://github.com/census-instrumentation/opencensus-erlang/blob/de37fc236febfd7d3c79b14d33974a3f09474176/src/oc_stat_collectors.erl#L38, but the main idea here is to spread across schedulers to reduce lock contention on the mailboxes when recording a stat, not so much to prevent overflow of a process, though it helps scale.

4 Likes