Given Erlang’s goals, it seems likely that folks have played around with application checkpointing:
Checkpointing is a technique that provides fault tolerance for computing systems. It basically consists of saving a snapshot of the application’s state, so that applications can restart from that point in case of failure. This is particularly important for long running applications that are executed in failure-prone computing systems.
– Application checkpointing - Wikipedia
Indeed, I suspect that the BEAM provides facilities for this. However, I haven’t been able to find anything that seems relevant. So, please give me any clues (eg, best practices, facilities) that might be useful.
Discussion
To focus the conversation a bit, I’ll throw in some history and a use case. As I recall, in the 70’s the CDC 3800’s Drum Scope OS supported rollin and rollout calls. These allowed folks to halt and record a Fortran process, so that it could be restarted after system maintenance.
Is there an analogous facility for sets of processes in Elixir? I don’t think explicit rollouts would be used much, in production. However, they might be used occasionally in single-user (eg, development, Livebook) sessions.
I suspect that there are all sorts of interesting corner cases, but my (naive) notion is that a magic message would go out to a set of processes, telling them to stop processing normal messages until further notice. Then (assuming all went well :-), the VM would monitor activity until the set was quiescent.
At this point, all of the process state could be recorded as a checkpoint. Rollout, if desired, could be added as an option. Finally, a restart facility would be available to reload the recorded state.
Is anything like this available? Comments? Clues? Suggestions?
-r