You could collect this info with a synthetic load test and/or by correlating memory usage with load (e.g. number of connections or reqs/sec) from a prod system.
This is of course always possible, though IME the risk can be mitigated with a combination of disciplined programming, practicing code reviews, and measuring the memory usage (synth load testing and/or prod measurements).
I suspect this could become tricky. The VM basically doesn’t know anything about the OTP constructs. It considers all processes to be the same. In you’re proposal you’re already accounting for that with priorities, but I think the problem is more nuanced. You probably want only your app’ss workers to be killed, leaving all other processes intact. Another issue is that relying on the OOM killer or Linux kernel means the solution would not work on other platforms.
However, I feel that an OOM killer could be implemented in a beam language (e.g. Elixir). You could use memsup to observe the OS memory usage, and if it goes above some user-defined threshold, you could collect workers from the supervision tree of the OTP app, and decide which process(es) to kill.
This could be developed as a generic lib. For example, when starting the top-level supervisor, we could do something like
OOM.Supervisor.start_link(children, opts), where
opts are used to configure the OOM killer params (e.g. threshold). When the threshold is reached, the killer will terminate some worker processes under this supervisor. Each process could set it’s own kill priority e.g. by calling
OOM.set_priority(priority). This would allow the app developer to tweak the termination list according to the specifics of their system.
Some people are doubtful about OOM killers, since there’s some amount of randomness involved, and killing random processes might leave the system in a permanent partially working state (which is worse than restarting everything). However, I think that by being conservative (kill only worker processes of the “main” app) an embedded OOM killer might prove to be useful. IMO this is best evaluated in practice, either in a real system or a fake synthetic one.