How to troubleshoot increased memory usage in Bandit?

scottmessinger · February 22, 2024, 2:31am

We just put it into production (Bandit 1.2.2) and are also seeing increased memory usage (about 50% more). We upgraded several libraries, so I don’t want to definitively say it was from Bandit, but it does seem likely.

Side note: I think it’s incredibly impressive for mtrudel to build the whole thing and have it be drop in compatible. Very cool library. That said, we might switch back to cowboy if we can’t resolve the memory issue.

jjcarstens · February 22, 2024, 3:29am

We put this in NervesHub and were having issues with websockets in our setup at scale which unlimitedly led to reverting to cowboy (Big lengthy breakdown in that PR as well)

However, just today we found one of the default options in ThousandIsland was to set linger: {true, 30} which would hold sockets and increase memory load. That was fixed today in :thousand_island 1.3.3, we deployed and it was all fixed! It would be worth updating that to see if you still have issues

Also, we set thousand_island: [transport_ports: [hibernate_after: 15_000]] in the endpoint which dropped the memory usage significantly as well.

hubertlepicki · February 22, 2024, 8:28am

We rolled it back to Cowboy too. I like the idea but the nature of the project is such that we are getting a lot of requests that are being cancelled before they finish, and Bandid did not interrupt these at all. I suspect this was contributing to the increased load on the servers and increased memory usage, but it’s probably also the architecture that relies on OTP. Cowboy uses their own, simplified primitives for concurrency implementation, precisely because they were able to made them more efficient.

mtrudel · February 22, 2024, 3:55pm

Is this serving primarily HTTP or WS?

If WebSockets, as @jjcarstens sez a few replies down, I’d look at adding hibernate_after.

mtrudel · February 22, 2024, 3:56pm

I haven’t forgotten about you @hubertlepicki! I actually thought of a possible quick win for this; are you able to deploy a branch for testing?

scottmessinger · February 22, 2024, 6:15pm

It’s mainly WS but there was an error when I added hibernate_after. It seemed like it wasn’t an allowed option, likely because I didn’t change the default TCP adapter.

mtrudel · February 22, 2024, 6:28pm

Yep, sorry. hibernate_after is specific to :ssl; there’s no equivalent for :gen_tcp.

That’s curious that you’re seeing increased memory usage; Bandit’s WS layer is really quite thin (it’s a small bit of mostly bounded state on top of a GenServer). I wonder if it’s lagging process closes, and if Thousand Island 1.3.3 would help?

dimitarvp · February 22, 2024, 11:59pm

I am curious if somebody noticed that increased memory usage in an API project without views / templates / components etc.? And without LiveView / WebSockets?

scottmessinger · February 23, 2024, 1:51am

I upgraded to Thousand Island 1.3.3 based on this thread, but still saw the high memory usage. I just switched back over to Cowboy and should have better comparison numbers tomorrow morning, but it seems like Cowboy is 1/2 the memory usage of Bandit.

Bandit is great and I’m excited to see these issues all get resolved in time – it’s really impressive work!

tj0 · February 23, 2024, 5:32am

How is everyone measuring memory usage?

Is this resident/active memory only from the OS for beam?
Is this summing the memory of the processes in observer?

dimitarvp · February 23, 2024, 1:53pm

Also :erlang.memory() – the numbers returned are in bytes.

scottmessinger · February 23, 2024, 3:23pm

Here’s a graph showing the memory reported by AWS. You can see when we switched back to Bandit around 8:30pm.

For our use case (predominantly WS messages served over http behind an AWS ALB inside AWS ECS), it appears we’re seeing 1/2 the memory usage with cowboy.

We look forward to returning to Bandit when the memory is more comparable!

tj0 · February 23, 2024, 4:27pm

ok, I’m going to have to ask a really dumb question because I don’t know what tool you’re using.

Is the memory utilization:

RSS => resident memory is definitely used.
RSS + shared => this would be really close to total memory used
RSS + shared + buffs/cache => now we’re getting into a bit of weird territory

If bandit is increasing the buff/cache memory, then it is not a real memory increase. This is some weird artifact of hibernate or something. There is probably a tool to find out the actual used memory, but I’ve forgotten what it is.

$ free
               total        used        free      shared  buff/cache   available
Mem:        16162188     7218888     4205420     1177868     6264756     8943300
Swap:       16777212      260952    16516260

I compared the memory based on the processes in BEAM via live_dashboard and they looked comparable between cowboy and bandit. Perhaps it’s due to the server being idle, etc. not doing the thing that increases the memory, etc.

mtrudel · February 23, 2024, 5:02pm

I assume you mean ‘switched back to Cowboy’, since your memory usage decreased there?

mtrudel · February 23, 2024, 5:18pm

A couple of things that would greatly help:

Would you be able to disable websocket compression in Bandit and see if that helps? The relevant option is

config MyAppWeb,
   http: [
     ...
     websocket_options: [compress: false]
   ]

If you have console access to your server, running the following would provide a snapshot of a random process’ state; helpful in seeing where you may be using all that memory:

{:ok, pid} = Bandit.PhoenixAdapter.bandit_pid(YourAppWeb.Endpoint)
{:ok, connection_pids} = ThousandIsland.connection_pids(pid)
# You may need to turn the following a few times until you get one back that has `handler_module: Bandit.WebSocket.Handler` listed in its state
:sys.get_state(Enum.random(connection_pids))

This will be a raw dump of everything your socket process has in state; feel free to redact as needed (you’ll probably want to elide the whole elem(1).connection.websock_state; that’s all of your Phoenix state and should be identical between Bandit and Cowboy). Also, if you want to DM me here to further limit any exposure, that’s fine too.

Looking forward to hunting this down!

ziinc · February 24, 2024, 10:21am

At Supabase we have experienced :long_schedule warnings of over 30s at our production load for our logging service that would then result in a subsequent crash. I have not yet had the time to open up an issue in the Bandit repo yet and put together a case for it, but the changes that made the warnings go away was the cowboy revert here, which makes me feel that Bandit still has some way to go in terms of performance at scale…

Unsure if my particular case is related to the memory issues others have also experienced as per this thread, but just putting it out there in case anyone else has a similar experience.

mtrudel · February 24, 2024, 3:36pm

I’d suspect it’s the same issue that @jjcarstens et al saw last week, and which is fixed in Thousand Island 1.3.3+. If you’re able, give Bandit a spin with Thousand Island bumped appropriately and see if it fixes your issue; I’d love to get as many data points as possible for perf issues like this!

scottmessinger · February 27, 2024, 2:21pm

Yup – I had meant cowboy.

mtrudel · March 6, 2024, 12:59am

I think we’ve got a fix.

Anyone able to try the branch referenced here, I’d love to see more evidence that this resolves the issue. The main wrinkle in this solution is that GCing every request is using a sledgehammer on a nail; I’m hoping that I’ll be able to pull the mitigation back to something more reasonable

mtrudel · March 9, 2024, 1:02am

@scottmessinger and others; 1.3.0 just went out with a fix for the long-standing issue of increasing memory use (which wasn’t really a memory issue so much as it was an issue with how memory use is reported, but anyway), along with a few other fixes. See CHANGELOG as always. Hopefully this fixes your issue; please report back if so as I’d love to have more evidence that we’ve finally licked this.