What do you think could be the best way to garbage collect in Bandit?

mtrudel · March 6, 2024, 10:21pm

Bandit author here.

We’re currently working up a solution to the oft-reported issue whereby Bandit’s memory consumption increases over time when HTTP/1 connections are kept open via keepalive (an issue exacerbated by load balancers, who will reuse a single connection for a LONG time). Because Bandit uses the same process that is handling the TCP connection to run the Plug stack for each subsequent HTTP request, the end result is that a single process may end up handling any number of HTTP requests, and memory usage balloons as a result.

Thanks to the tireless efforts of @ianko, we’ve managed to isolate a couple of approaches, which we’re discussing here. The gist of it is that we have two solutions which are roughly equivalent in terms of efficacy:

Explicitly call :erlang.garbage_collect() between every separate HTTP request on a single connection.
Set fullsweep_after: 0 on the handler process in order to make every minor sweep be a full sweep of the old heap.

@ianko has provided pretty exhaustive evidence that these two approaches solve the problem and both have a negligible effect on performance. However, if you look towards the end of the discussion, you’ll see something that surprised me; the fullsweep_after approach actually ended up using significantly more CPU to accomplish the same outcome.

So, my questions to any VM wizards in the audience are these:

Do you have any guidance about preferring to use explicit GC calls vs tuning fullsweep_after and letting the VM figure it out? Explicitly trying to do the VM’s job for it seems heavy handed, but the evidence seems to suggest it’s the more performant option in this case. Advice welcome.
Any advice about how to tune either approach? On the explicit GC side, I’ve defined a config option to specify we should only GC every ‘n’ requests, but this again feels like a pretty coarse way to do the VM’s job. On the fullsweep_after side, we could use values other than 0, but picking and choosing values for this feels like stabbing in the dark.

I’d love any advice y’all are able to provide on this.

sorentwo · March 6, 2024, 11:19pm

Have you considered using Process.hibernate rather than triggering an explicit sweep?

Bump the “GC after” config to a less frequent value, say 50-100, and trigger a hibernate. That will still cause a GC and minimize memory if the connection goes unused briefly.

A similar approach has worked well for pubsub heavy processes that accumulated binary garbage in Oban.

mtrudel · March 7, 2024, 12:03am

Not as such, no. The reason being that there’s another wrinkle that I forgot to mention, which is that I’m trying to minimize the time spent between subsequent requests. For load balancers it’s not too big of a deal (they don’t generally block on a single upstream connection, at least not in the typical case), but for browser clients making a bunch of queued up requests on a single connection, any time spent between subsequent requests is ~more or less directly visible as latency on the subsequent request, so hibernation is very much not the right thing to do there (unless I’m mistaken?).

The being said, there are many keepalive cases where hibernation makes sense, though likely only after a certain delay (ie: if the client hasn’t sent a subsequent request in say 5 seconds, it’s likely that the browser is just holding the connection open for possible subsequent use, but nothing is currently being requested. A hibernation makes perfect sense in this case). I’ve added this to my plans for Bandit after the next protocol refactor (so it’ll likely land later this year). Thanks for the idea!

derek-zhou · March 7, 2024, 1:07am

How about just quit after serving a preset number of requests? like a thousand. The reverse proxy or load balancer will reconnect and give birth to a new process. Calling :erlang.garbage_collect() feels like tuning to a particular behavior of the VM. On the other hand, quitting after serving its term has been used since forever, Apache still does this.

mtrudel · March 7, 2024, 3:14am

There’s been an option for this for a while (http_1_options: [max_requests: 1000]), but you’ll still see a stair step memory usage.

derek-zhou · March 7, 2024, 3:34am

Stair steps that keep going up, or they are more like saw tooth? If they are saw tooth, I’d not be worried; at least the user has a way to trade the height of the saw tooth with performance by tuning this number.

mtrudel · March 7, 2024, 3:50am

Sawtooth, sorry.

To be clear, I’m not personally worried - memory use that ends up being resolved by a full sweep GC is totally fine & not a performance issue. I know that this isn’t actually solving much of anything. However, it does end up showing up as memory bloat on people’s telemetry charts, and that will end up being a support burden (as well as being the root of inevitable myths about how ‘Bandit isn’t as good with memory’).

jjcarstens · March 7, 2024, 4:24am

My 2¢ is clearing memory after each request does not seem like a bad option to have. It’s very “C like” to make the process owner accountable for memory (in a way)

That said, I haven’t researched :erlang.garbage_collect and if it affects only the current process or is a global request. If global, maybe not the best idea

dimitarvp · March 7, 2024, 4:29am

I am against pokes in 99% of the cases but this time I think it’s justified if we summon @rvirding, @garazdawi and @bjorng. Sincere apologies to them if I am mistaken.

If you can’t make this problem go away I’d actually think about doing a double fan-out i.e. have these processes spawn other, much shorter-lived processes, each of which represents a single request, whereas the spawning processes represent connections – and you said they are prone to be long-lived due to keep-alive policies which, ahem, spawned this problem in the first place.

You can also minimize latency there by keeping a pool of pre-spawned several sub-processes for each connection process, and expand that pool in conditions of heavy load.

BTW when you said in OP that both your suggested approaches only introduce minimal latency, how much % we’re talking? Also what absolute numbers? I’ve looked at the graphs in the GitHub thread but I can’t intuit much from them (i.e. in some of them the latency looks like +20% more, but in most it looks like there’s no difference?).

Though I’ll agree with some of the commenters in the GitHub thread that even if explicit GC works it still feels like a hack / workaround. But this is the real world, we have to do compromises. As a guy involved in a greenfield project where I chose Bandit over Cowboy I wouldn’t be against the explicit GC as a final solution if nothing else turns up.

garazdawi · March 7, 2024, 8:22pm

To me, doing a manual garbage collect seems like a good idea for this type of scenario. It will be much easier for the application code to know when it is a good time to do a GC then it is for the system. Another solution would be to spawn a process per request (as mentioned before in this thread), but that has other tradeoffs in performance and memory usage.

I’m not surprised that fullsweep_after is more expensive as it removes the old generation of the heap, which means that any long lived data will be copied in each GC, while if you do it manually, that data will only be copied when the manual GC is done.

Speaking of the old heap, maybe it would make sense for you to only trigger a minor gc? If you call :erlang.garbage_collect(self(), [{:type, :minor}]) it will only collect the young generation, and maybe that is enough? Or you could try to couple that with setting fullsweep_after to some low value that is not 0. Very hard to know what will be effective as it depends a lot on what the process is doing.

mtrudel · March 7, 2024, 10:40pm

I don’t think so (though i’m far from well informed on the internals of the VM). The reason I say this is because I understand fullsweep_after: 0 to basically turn all minor GC’s into full sweeps but not actually induce any more GC cycles overall. Noting that adding fullsweep_after: 0 solves the memory growth, I believe implies that minor GC’s aren’t enough to alleviate the issue; major GC’s are needed.

sleipnir · March 8, 2024, 8:50pm

In Spawn we check the mailbox size before hibernating in order to minimize the effects of extra latency.
We also use full_sweep_after 10 instead of 0. These changes (We also compress our internal state, but that’s beside the point in your scenario.) helped us drastically reduce memory usage while keeping performance good enough.

mtrudel · March 9, 2024, 12:36am

I’m definitely going to be dong a deep pass on long-lived process management, but I’ve a few more things in the queue ahead of that. Hoping to hit hibernation & revisit this GC work with the hindsight of real-world use later on this summer. Being really careful to not lock down anything in our API in the meantime (specifically, the config flag for this is marked ‘experimental’)

I’ll definitely be looking at your approach in Spawn for inspiration. Thanks for the links!

mtrudel · March 9, 2024, 1:05am

Bandit 1.3.0 just went out with the explicit GC fix mentioned here (and a default of GC’ing every 5 requests). I’m not committing to any stable public interface into this yet; the relevant config option to tune this is marked ‘experimental’ as I reserve the right to change how we accomplish this based on feedback (I’m planning on revisiting this somewhere in the second half of 2024).

Feedback / real world experience with this change is welcome!