Application Checkpointing in Elixir?

Given Erlang’s goals, it seems likely that folks have played around with application checkpointing:

Checkpointing is a technique that provides fault tolerance for computing systems. It basically consists of saving a snapshot of the application’s state, so that applications can restart from that point in case of failure. This is particularly important for long running applications that are executed in failure-prone computing systems.
Application checkpointing - Wikipedia

Indeed, I suspect that the BEAM provides facilities for this. However, I haven’t been able to find anything that seems relevant. So, please give me any clues (eg, best practices, facilities) that might be useful.

Discussion

To focus the conversation a bit, I’ll throw in some history and a use case. As I recall, in the 70’s the CDC 3800’s Drum Scope OS supported rollin and rollout calls. These allowed folks to halt and record a Fortran process, so that it could be restarted after system maintenance.

Is there an analogous facility for sets of processes in Elixir? I don’t think explicit rollouts would be used much, in production. However, they might be used occasionally in single-user (eg, development, Livebook) sessions.

I suspect that there are all sorts of interesting corner cases, but my (naive) notion is that a magic message would go out to a set of processes, telling them to stop processing normal messages until further notice. Then (assuming all went well :-), the VM would monitor activity until the set was quiescent.

At this point, all of the process state could be recorded as a checkpoint. Rollout, if desired, could be added as an option. Finally, a restart facility would be available to reload the recorded state.

Is anything like this available? Comments? Clues? Suggestions?

-r

1 Like

I feel a facility like this should be provided by the OS. The idea itself is super interesting but not sure how many guarantees can be made in user-space. A kernel-space facility should be much better!

Otherwise… sounds like decades of work. :smiley:

1 Like

The OS knows about some things; the VM knows about others. Leaving things entirely up to either the OS or the VM is clearly going to miss part of the runtime state. Question is, how big a problem would this be?

Let’s say I’m using a Livebook and want to save a checkpoint. If one of the cells is actively running some code, I might need to shut that down first. However, this might be an acceptable cost. Also, if the app is coded in a fault-tolerant way, the impact of any breakage should be mitigated. Still, I agree that all this sounds non-trivial to do…

-r

Just wrangling persistent state around should be relatively easy, especially with container layers and what not. Pretty sure you can set that up locally fairly quickly.

When we get to actual contents of RAM and a time-travelling code executor however… that’s 100x harder.

This sounds like how a lot of stateful GenServer code works (periodically persisting state that can be reloaded on restart), though that isn’t automatic .

One tricky part of doing that at the system-level in general would be that processes can have state that doesn’t serialize meaningfully - ports, remote PIDs, etc.

You will likely find the OTP Design Principles document interesting. In particular, the section on Release Handling.

2 Likes

This is a very interesting question. :blush:

A plain “make a full snapshot of the running app” runs into problems with distribution (c.f. the fallacies of distributed computing.

However, at a much smaller level (i.e., at individual OTP applications and at individual processes), it is much easier to provide guarantees about snapshotting. In essence, “return to a valid state when failure happens” is how the core of the Actor Model works (if a process crashes its supervisor will replace it with a new one that has a known-valid state).

This will not store the actual latest state of all of the processes or supervision trees itself, however. But the question is whether you really want/need this. In most cases, only a part of the state of a process should be persisted (the rest being ephemeral ‘working memory’ that is mainly kept around at runtime as form of caching).

The most common pattern, to my knowledge, is to have the important processes operate and synchronize (the persistent parts of) their state with (D)ETS, Mnesia or an external datastore/database.

However, I do think that improvements to this status quo are possible. I’ve implemented the pattern of a process that bases its state on some data from the DB quite a number of times now and I know others have too. Some of this definitely is reusable on a higher level.

3 Likes

This is something I happen to have been thinking about lately, so I really appreciate you sharing your experience.

One issue I thought of is that you’d likely want a somewhat intelligent supervisor that knows whether the process is restoring from persisted state or “fresh” state. It’s easy to imagine a situation where a bug leads to an invalid state that you continually restore from. Eventually, you’d want to give up on that state and restart from scratch. Or, at least, I’d think you would.

I’m curious if this situation is something you’ve ever come across and, if so, how you handled it.

1 Like

I have a subsystem which auto-saves every 15 seconds

It is making checkpoints in spirit

A more involved checkpoint process can indeed be implemented globally, but there is a non-trivial cost in performance / responsiveness to quiesce everything.

Is this related to continuations? I remember being intrigued by a Smalltalk web server years ago called Seaside that worked this way. It could effortlessly perform all sorts of stateful back button behavior that seemed magical at the time.