App in production crashing with DBConnection.ConnectionError

dch · January 4, 2024, 2:40pm

lots of great points already here, but not much on OS level.

questions

any info on what is the bottleneck? storage? network? kernel?
is there a load balancer in between you and the users?
are there any errors reported in kernel logs at this time?
any other sysctl or other tunables set for ulimit, max open files, …
are you running into any tcp ephemeral port exhaustion? this is surprisingly common when running behind load balancers, you should monitor total ports, and what state they are in, at runtime, both in kernel and of elixir app.

I would initially check the networking side first, before looking at storage. A reasonable hypothesis is this:

you are running out of free tcp sockets for inbound connections and/or queued connections in the kernel, and they’re not being recycled fast enough by OS
this could cause both the DB pool to be unable to spawn new connections, and same for phoenix on accepting new user connections
user requests start piling up in process mailboxes, causing memory to balloon
everything rapidly turns to custard

TLDR

You need to check/fix all of ulimits as seen by beam.smp process, kernel backlog & accept queues, phoenix acceptors, backlogs, and max requests.

check cat /proc/<pid>/limits for elixir beam.smp pid to ensure ulimits are ok (fix in systemctl unit file or ulimit if not a systemd service)
check ss -s to see how many “active” connections there are at runtime
save full output of ss -tan somewhere for review later
check sysctl net.ipv4.ip_local_port_range and increase it sysctl -w net.ipv4.ip_local_port_range="4096 65535" to give more headroom
check ss -lnt to see how many active connections phoenix is handling atm
adjust phoenix to handle more active connections if required (use gen_tcp backlog 1024 as a reasonable number)

Longer

NB many caveats and handwavey innacuracies, to keep it simple.

Incoming network connections arrive in the kernel, and are queued in several places.

first kernel queue is backlog queue, until tcp handshake is completed (syn, then syn-ack reply, final ack from client)
next kernel queue is the accept queue, waiting for your listening app to accept the next connection. The kernel will buffer 1.5x the maximum configured connections on behalf of your app. This default is either 128 or 4096 depending on OS. Note that 1.5 * 4096 is 6144 which is awfully close to 5500. Could be coincidence of course!
final queue is gen_tcp acceptors in phoenix app, if there are no free ones (busy with existing requests), then we “push back” and requests start piling up in the buffer, then the accept queue itself overflows, and the backlog queue, until new requests for both receive and send get rejected.

If you are using the default port range, you’ll have around 28000 available sockets behind a NAT. If each user request creates 2 tcp requests in, and a further temporary 1 or 2 out to DB or elsewhere, then 5500 active users brings you pretty close to maxing that out. Also another coincidence possibly!

You may see this info in AWS LB metrics, or in whatever grafana or similar server metrics you collect too.

Probably something like this:

### whats the configured ephemeral port range
# sysctl net.ipv4.ip_local_port_range
### how many sockets are open atm? one of these commands should work for you
### they all give slightly different info
# netstat -nat
# ss -lnt
# ss -tan | awk '{print $4}' | cut -d':' -f2 | sort | uniq -c | sort -n

I wrote a bit about this in the past so start with that. Some details on tcp networking in general:

background

finding appropriate tunables / settings

can’t I just query how many ephemeral ports are left?

No. That complexity can wait for another day.