Checkpointing and restarting a function mid-way through

I hope this is question isn’t considered too off-topic, and Elixir may or may not be the answer, but there’s a broad spread of experience in this group.

What I’m trying to find is a language which lets me easily express the idea of stopping a function mid-way through, saving its state to a persistent store, and restarting it later in a completely different process.

The idea is something like continuations, but (1) they must be serializable, and (2) the serialized representation needs to include not just the PC call stack but the local variable state as well.

Let me give a couple of examples of how I want to use this.

  1. Checkpointing

    def workflow() do
    a = first() #1
    checkpoint() #2
    b = second() #3
    checkpoint() #4
    third(a, b) #5
    end

The checkpoint() function saves the current function state to disk. Let’s say the system blows up during step #3. I can start a new process, reload the state from disk, and continue from the checkpoint made at #2. Variable ‘a’ is bound to its previous value, and ‘b’ is unbound.

  1. Long-duration messaging

    def workflow2() do
    a = first()
    send_msg(a)
    b = recv_msg()
    second(a,b)
    end

“recv_msg()” waits for a response from some other system. It may be days or even weeks before the response comes back; meanwhile, this process may be killed and restarted. So this is really another form of checkpoint, where calling recv_msg() causes a checkpoint save and stops execution of the function, while the restore is triggered by the arrival of a message.

As you can see, what I’m really trying to do is to build a long-running “workflow”. I can of course write the logic as separate steps, where one step explicitly calls another step, and all state is held in some explicit state object which is serialized. What I would like is to make a more natural representation where this state is stored and carried forward automatically by the system, avoiding a ton of boilerplate, and making the workflow logic easy to read and maintain.

I’m aware that not everything makes sense to be serialized (e.g. open file handles or sockets), and I’m fine with such limitations. Simple data values are fine. I also don’t need to preemptively stop a function - it can be done at specific marker points.

It would be a bonus if the stored state is not entirely opaque but can be inspected (e.g. to show the values of bound variables or the state of the call stack), and even modified, like you can do in a debugger.

SOLUTIONS

  1. The worst case scenario is that I’d write my own language and interpreter, so all the execution state is available. I may have to go this way, but I don’t want to re-invent the wheel if there’s an existing language which happens to implement this already.

  2. I’ve looked at some mainstream languages but not yet found a match.

Many languages have things like continuations or fibers. For example, Python generators nicely capture the idea of being able to stop a function (with “yield”) and restart it later; but it seems you can’t serialize a running generator. It might be possible to pickle generators in Stackless or PyPy.

Similarly, Ruby’s callcc and Fiber, and Go’s goroutines don’t seem to be serializable.

  1. A self-hosted language might expose enough execution state for me to do this - PyPy again maybe, or perhaps a LISP or Smalltalk. Or an embedded language (Lua?) might also let me run a function within a contained environment where I can capture its entire state.

  2. There are low-level approaches like CRIU which can checkpoint a running process. However that’s likely to be hugely inefficient - if I run (say) a python function in its own process, then I’d be checkpointing the entire python interpreter. Also you’d have to restore the state with exactly the same python binary around.

  3. It might be possible to perform some sort of transformation on the code to convert it into, say, a continuation-passing form:

first |> checkpoint |> second |> checkpoint |> third

Or into a form which breaks down a function into a series of steps, individual micro-functions which are called in turn, passing in the bound variables explicitly, and returning the updated set of bound variables and the name of the next step to be executed.

Those are things which Elixir’s AST inspection/modification might be able to accomplish, although for anything other than a simple pipeline it might become rather tricky. I want to be able to restore state even within control structures like conditionals and loops:

def workflow3() do
  a = first()
  if somecond(a):
    foo(a)
    checkpoint()
    bar(a)
  else:
    baz(a)
    checkpoint()
    qux(a)
end

If anybody has some suggestions, particularly where they know the state save/restore definitely works, that would be very much appreciated.

Thanks in advance,

Brian.

1 Like

Have you looked at something like https://aws.amazon.com/swf/

It seems quite close to what you are looking for.

1 Like

Honestly the only language I’ve ever seen that fulfills your requirements is Stackless Python. I’ve used it in the past to do what you want with suspending a function, serializing it out (even to another machine) then resuming it where it left off.

However, in most languages, including Elixir, you can always separate your functions on ‘pause’ boundaries into more functions and pass the state via a tuple or something to it to be called later (Elixir/Erlang even has syntax for this in the form of tuple modules, however those are deprecated and likely may not work much longer, much to my sadness).

However yes, a transformation via macro could do that, not lightweight work for sure, but it can do it.

1 Like