When is 'Hibernation' of Processes useful?

Qqwy · June 16, 2019, 9:18pm

This question is applicable to both Elixir and Erlang, since it is a question about the OTP feature ‘Hibernation’.

I am currently working on a small library to create ‘persistent GenServers’, which allow storing the state of a GenServer to disk when they are not being used, and start them up later when they are required again, because I’ve come across multiple situations where I wanted such an abstraction (and ended up implementing an unfinished, bug-ridden implementation of half of it).

At first I wanted to call this persisting/unpersisting ‘hibernating’ a process, but I quickly realized that the term hibernation is already overloaded in OTP. See for instance Process.hibernate/3, :erlang.hibernate/3 and the :hibernate option that can be passed to a GenServer on startup to trigger the same after a timeout:

Puts the calling process into a wait state where its memory allocation has been reduced as much as possible. This is useful if the process does not expect to receive any messages soon.
(The documentation then continues on to describe preconditions and postconditions to keep in mind when using it.)

Now while this explains what it does it is not yet clear to me when exactly it is useful: A process that is hibernating still takes up space in RAM. It feels like this strange half-measure:

If the process is short-lived, no reason to hibernate.
If the process is long-lived but is in use all the time, no reason to hibernate.
If the process is long-lived but probably won’t be used for a long time, then why keep it around in memory at all? Chances are that requiring the process to stick around for a long time that the data it contains isn’t ephemeral but needs to survive external failures (like node restarts) as well.

It’s highly possible that my reasoning is flawed or missing something. Please help !
Why was the ‘Hibernation’ functionality created in the first place? Were there certain performance issues that the OTP team wanted to address with it? In what situation(s) is process hibernation used in practice?

dom · June 17, 2019, 12:03am

I’ve found it useful with Phoenix PubSub (websockets), since you end up with a few processes per online user that allocate a bunch of memory when there’s events for that user, but otherwise are idle most of the time. Similar stuff happens in chat systems, push notifications (Nintendo uses it in their NPNS), etc. Anything where you have persistent connections.

dimitarvp · June 17, 2019, 12:12am

I believe you understand things quite well.

IMO you’ll have to have like 1 million hibernated processes until they comprise an overhead of 4GB - 8GB memory.

Don’t obsess over minimising memory usage unless you plan running your project on a very minimal VPS. Trouble is, if you are on such a frail server, then it’s the I/O when serialising processes from and to storage that will block you, and that’s much worse than using some extra memory.

May I ask what’s your reason for seemingly pursuing such micro-optimisations?

Qqwy · June 17, 2019, 7:16am

I am not currently looking into pursuing micro-optimisations like this one, but only interested in why hibernation exists, especially to find out if it indeed is only something for optimizations or if there is another reason/use-case for it as well.

@dom Thanks! That makes sense. That is clearly an example of long-running processes that do not make sense to persist to disk (since then you’d lose the connection to the WebSocket client). Cool !

sasajuric · June 17, 2019, 8:28am

I agree with this.

There are some situations where you need to keep a long-running process which is only used occasionally. As @dom mentions, a good example would be channel processes (which, from what I can tell, are now being hibernated by default). Another example I can think of are mediator processes which are used to serialize data flows (internal queues, gen stages, and such).

More generally, any occasionally used process which shouldn’t be dropped (e.g. because there’s a client on the other side who might interpret this as a netsplit) or which is costly to reinitialize, is a potential candidate for hibernation.

The reason why you might need hibernation lies in the fact that an idle process (AFAIK) won’t be GC-ed. So if a process accumulates a lot of garbage during the period of bursts, but then becomes idle for a longer amount of time, you may end up with a lot of needless memory allocated.

This can be particularly dangerous when combined with refc (aka large) binaries. If a large binary is passed through a process which will be idle for a long time, the binary might never be reclaimed. Ultimately, you might end up consuming the entire memory, and the system might be brutally killed. For this reason, I sometimes preemptively use hibernation in mediator processes, especially if I estimate that they will be long-living and occasionally used. I figure that I’d rather sacrifice some processing time to get predictable memory usage.

Of course, there are other techniques to control memory, such as spawning a one-off process where most of allocations are performed. But if you need to do these allocations in a long-running process, or if such process is propagating the data around, then hibernation can be a useful tool to keep the memory usage stable.

amnu3387 · June 17, 2019, 1:36pm

I think that some long running processes benefit from hibernation when their activity cycles are similar to start, do a lot of work and then hang around until they receive another message to “re-start”.

Like loading half a million records (or more) in batches from the db, then run a high number of list/reduce operations on those (aggregating, creating ets tables, reading from those ets tables, etc), and then they technically go to “sleep”, until they receive another message/start, and due to what they do, you rather have them always running instead of starting/exiting and then re-starting.

Perhaps other parts of the system need to query them to know if they’re “finished” doing whatever they’re supposed to do, or perhaps doing all of this on that one process simplifies everything else, etc.

Or when the processes have indefinite cycles, where you don’t know for sure how long they’ll be around, or at what rate they’ll receive messages, but you can have a ton of them running at the same time accumulating garbage.

So in these cases :hibernate fits nicely, because once they’re done (or specific situations, like timeouts, or after doing a certain thing), you just set them to :hibernate and the GC will kick in. Once it does the process used mem will be shrinked to the minimum needed.

For the first case this can also be implemented with an additional process that acts as a control point, letting the one that does the work exit (and so cleaning up the mem anyway) but sometimes it feels more natural to let the process stay alive (and is less complex).

And some other times (the second example, which could be an abstraction such as channels in phoenix) you can’t kill the process (or killing it would make the whole thing much more complex in other parts), but you can have thousands of open channels, and some might be running for a long time, but not 100% continuously, so you want to make their memory usage be “stable” and not grow indefinitely (beyond the essential) as more channels are opened and used, so that you can plan the required mem for a given “expectation”.

I think the BEAM had some recent work to better de-allocate/return mem to the OS, but I had ran into some situations where loading a lot of stuff from the db and working on it, would end up in the app crashing for requesting more memory than I had available in the machine - specially when I changed a lot of flows from streams (ecto.streams and elixir streams) to batch loading & inserting (in order to simplify error handling, mostly db.connection timeouts).

Also, GC in the beam is extremely fast (I call it directly in some specific situations), so hibernating (in the case of a channel like interface) when paired out with timeouts translates into an efficient way of keeping predictable memory usage.

I had the idea I read somewhere that the process is also taken out of scheduling until a message arrives but I can’t find it now, if this is the way I think I read it, it would mean that on XXXXX processes (say channel processes), if half of them stay naturally idle, when you pair timeouts w/ hibernation, then you save some mem/scheduler work by those that are hibernated not competing at all for resources while their idle, while the other ones that are “active” only entering hibernation after a given timeout - I’m not sure though if I was reading it correctly.

al2o3cr · June 17, 2019, 1:59pm

Chances are different when you’re running a phone exchange on hardware with 100x less RAM than a typical box today. There’s a lot of things in OTP that make a lot more sense in the light of the original use case.

dimitarvp · June 17, 2019, 9:12pm

I fully agree with the motivation but can you please elaborate on what are you doing in your code exactly to achieve it? If a large binary is pointed at by several processes then by the virtue of how the BEAM works it won’t ever be GC-ed as long as at least one such process exists, correct? And if all other processes finish, and that one last process is hibernated, does that mean that the large binary will be GC-ed?

Not sure I understand the intricacies here.

sasajuric · June 18, 2019, 8:13am

Hibernation includes garbage collection (and more, see hibernate/3 docs for more info). Therefore, to answer your question:

Yes, because when you hibernate a process you’re GC-ing it. Hibernation is all about reducing memory allocation of a process.

So given what I just wrote, if I introduced an intermediate GenServer (the one which just forwards messages around), and I estimated that it might be idle for longer periods of time, I would include :hibernate as the last element of the return tuple in all callbacks (e.g. {:reply, response, new_state, :hibernate}). However, starting from (I think) OTP 20, start_link supports the :hibernate_after option, so today I’d probably use that instead.