Saving and restoring process state after crash

vrod · February 14, 2021, 12:18pm

I am trying to understand more about best practices with Elixir and processes. I thought about a GenServer that would maybe sometimes crash just like more complex real-world examples.

defmodule Dynamite.BadCounter do
  @moduledoc """
  A counter that sometimes crashes (pretend this is more complex process)
  """
  use GenServer

  @defaults [n: 0]

  def start_link(opts \\ []) when is_list(opts) do
    state =
      @defaults
      |> Keyword.merge(opts)
      |> Enum.into(%{})

    GenServer.start_link(__MODULE__, state)
  end

  @impl true
  def init(state) when is_map(state) do
    Process.send(self(), :increment, [])
    {:ok, state}
  end

  @impl true
  def handle_info(:increment, %{n: n} = state) do
    case Enum.random(1..10) do
      1 ->
        raise("Boom!")

      _ ->
        IO.puts(n)
        Process.send_after(self(), :increment, 300)
        {:noreply, %{state | n: n + 1}}
    end
  end
end

I can put this into application.ex as children, and then the process will restart, but always starts at 0 because the state becomes forgotten.

I have been reading about how to save the state in event of crash. For example GenServer with supervision tree and state recovery after crash - Bounga’s Home and Concurrent Programming In Elixir | Codementor

The common solution is to use another process to store the state.

However, this feels strange: the one process needs to know about the other process. It creates a dependency. The pattern looks similar to other languages where maybe a database or similar thing would be used to save the state.

I was more expecting to see this saving and restoring logic somewhere in the supervisor. My thinking is that my GenServer has only one job: to count. However the examples I have found so far want to make it have 2 jobs: to count AND to store and restore its state.

I hope I have explained this well. Does my confusion make sense? Is there some way to make the Supervisor handle the storage? Or am I just thinking about this wrong?

Thank you for you explanations!

John-Goff · February 14, 2021, 1:54pm

Saving and restoring state after a crash is difficult, because often times it is the fault of incorrect state that your genserver crashed in the first place. If you automatically save invalid state and then load it again when your process restarts, you’ve just created an endless loop of crashes. So supervisors will restart their processes with “known good” state. Since the only state that we can guarantee with 100% certainty is “good” is the initial state (because if it wasn’t then the process would crash on init and you’d see it), that’s the default state to give to a newly restarted process.

al2o3cr · February 14, 2021, 1:56pm

IMO this is the trouble spot - a GenServer’s job is to maintain its state and handle messages related to that state. Guarantees about durability are part of that job.

The ecosystem offers a whole range of options for providing that durability:

a process is the simplest option, but only durable for the lifetime of the VM
ETS is a more-structured form of “use a process”, but still only in-memory.
DETS can fix that by storing data on local disk (with some gotchas)
separate systems (Redis, Kafka, Postgres, etc) can provide permanent storage, at the cost of additional complexity

A supervisor can have many children, so putting this logic in the supervisor would potentially be a bottleneck. Dealing with persistence in the children avoids this.

dimitarvp · February 14, 2021, 4:07pm

As @John-Goff said, there’s no guarantee that the last known state that your GenServer possesses is valid, hence OTP took a defensive stance and only ever restarts an actor with its initial state.

But if you are very convinced that you can guarantee last good valid state, you can do something like this in your GenServer:

def init(...) do
  state = load_persisted_state(...) # load last known good state from persistent DB
end

def handle_call(...) do
  state = ...
  do_stuff_with(state)
  # if the code hasn't crashed at this point
  # then the current state can be assumed safe.
  persist(state) # store last known good state to persistent DB
end

Personally I wouldn’t recommend it for a simple counter. But if you want it to be persistent, you can use DETS or Mnesia to load/persist it as shown above. Or Redis if you already have it in your stack.

vrod · February 14, 2021, 4:12pm

Thank for replys! This counter is only an example to help demonstrate the architecture. I still have a hard time thinking about SRP (single responsibility principle) if the process needs to do business logics and worry about saving and reloading state. Maybe I can try some more of these ideas and see if they make more sense after trying.

derek-zhou · February 14, 2021, 4:26pm

To add on top of what @dimitarvp have said, I will use a :gen_statem instead of a plain GenServer. The reason is the persist call could be expensive so you don’t want to do it every single time you mutated the state. With :gen_statem it is trivial to build a simple 2 state state machine: dirty and clean. And you can set a state timeout in dirty for eg. 5 seconds, and once timeout hit you persist the state and go back to clean state.

dimitarvp · February 14, 2021, 4:48pm

Usually the best thing you can do. Avoid analysis paralysis by experimenting more.

andreyvolokitin · December 6, 2024, 4:48pm

Are there any general guidelines as to which persist frequency can be considered expensive? E.g. in case of a chat application with many rooms where messages can be quite frequent

In my understanding, in many use-cases (and across languages and ecosystems) the standard approach is to persist any state change straight away without much consideration. Clean/dirty state machines seem more like a niche solution, but I don’t know much about Elixir practices as well as general backend practices