Make GenServers survive a system restart, persisting to and initializing from ecto

Hi everyone

I’d like to create a system where GenServers are dynamically spawned to periodically monitor some backend service. Supervisors, a Registry and a Worker implemented in a GenServer makes it very easy to create such a setup. No problems there. I’d like to take it one step further and make sure the monitors can survive a system restart. I got great inspiration from this blogpost from José Valim about homade analyitics with ecto and elixir. It made me realise that there is nothing preventing me from using ecto in the context of GenServers.

But I’m realising that keeping the monitor specifications in the database in sync with the actual running GenServers might be a little bit tricky. My first question is: where to initialize these GenServers, based on a bunch of specifications queried with ecto? Should I hook into the application startup lifecycle? But then things won’t restart when the Registry or the DynamicSupervisor needs to restart for some reason. Maybe the start_link/1 or init/1 functions of the parent supervisor are a better fit to do this initialization?
A second question I have is about possible race conditions. The startup logic might be in competition with any out-of-band user action to start/delete/update a monitors. My intiution leads me to think I need to go through the monitor GenServers themselves, so they act as a gateway for any modification, making sure the requested action, the running GenServer and the persisted specification of the monitor stay in sync. When the GenServer is not up, for some reason, then the user won’t be able to do anything in the meanwhile I guess, but at least there wouldn’t be any discrepancy. But I can’t see how it all fits together, just yet.

Anyway, if anyone has good ideas about this (I recon this is not the first time such a requirement is implemented) I’d appreciate any advice. Maybe someone has blogged about this precise problem?

You can take a look at this mini POC github project to get more details: GitHub - linusdm/live_tcp: Experimenting with elixir, GenServers and gen_tcp

Thank you!

I do load some gen_server state from db with the help of handle_continue…

https://elixirschool.com/blog/til-genserver-handle-continue/

But the source of truth are in gen_servers, not in db. They are not responsible for persisting to the db… they just load state from db.

3 Likes

I’ve worked on a similar system to what you describe; we solved the “when to query the DB and start processes” problem with a module-based DynamicSupervisor (started from the application’s supervision tree) and a custom start_link/1 function like:

  def start_link(_) do
    with {:ok, pid} <- DynamicSupervisor.start_link(__MODULE__, [], name: __MODULE__),
         {:ok, _} <- DynamicSupervisor.start_child(__MODULE__, {Task, &start_processes/0}) do
      {:ok, pid}
    end
  end

This ensures that when the DynamicSupervisor restarts the processes are reloaded.


Regarding consistency, routing configuration updates through the running processes is a good practice - the GenServer’s one-message-at-a-time behavior ensures that modifications are atomic.

Hmmm, interesting.

Another take on this I stumbled upon while googling this problem is described here: OTP as the Core of Your Application Part 2-Alex Koutmos | Engineering Blog

The author created a supervisor with the sole purpose of ‘rehydrating’ the GenServers. It starts up when the parent supervisor starts up, and dies because the init/1 function returns :ignore. The only think I’m worried about is that this supervisor is child of a supervisor with a strategy of one_for_one, which would not be restarted if the registry or the dynamic supervisor for the workers dies. I would couple those three components (registry, dynamic supervisor and hydrator server) under a separate supervisor with a ‘one_for_all’ strategy, making sure everything goes down and comes up again together.

I’m still in the dark about possible race conditions where a client could start a monitor while the system is restarting. But maybe I’m worrying too much, and the registry will keep everything nicely idempotent?
Designing a system like this will force me to take into account that the GenServers monitoring the backend can be unavailable. I guess there is no way around this, and at least I can be explicit about it.

What is exactly a “monitor” in this context?

You can only have one source of truth:

  • If it’s the database: when a monitor is created an entry is created in database and a “ping” is sent to a process who is responsible to diff the current database with the current live genservers and start/stop your servers to be in sync with the database.
  • If it’s the live system then you could regularly update the database from the live servers, but you need to decide how to know which “monitors” to start in the first place.

Also if by “monitoring a backend service” you mean sending a message to them every once in a while to see if they respond, you do not really need a genserver per backend service. If you poll them every 30 seconds, then it means each server has 30 seconds to poll as many services as it can, and that is a lot.

1 Like

Long-running code in an init callback is pretty strongly discouraged - the parent supervisor is also blocked during that time.

In that article’s approach, the web server doesn’t even start booting until all the books have finished loading. This could be troublesome in some deployment environments; for instance, services like Heroku expect web dynos to bind to the assigned port within 60s.

One way to cut down that window of unavailability is to make sure to register as early as possible in the GenServer’s setup; that way incoming messages can queue up while the process finishes booting.

3 Likes

With ‘monitor’ I mean a GenServer that polls some other service periodically (no reference to Process.monitor/1). This service responds with a value (think: a proxy for some physical device that reports a status, like :on or :off, and I’m interested in seeing changes of this value). Each monitor (or GenServer, or Worker, if you will) is responsible for keeping track and polling one backend device (identified by some unique identifier). It’s fine to poll this backend service once every few minutes to detect a change in value. By scheduling a :poll message with a random time offset I hope to get to the situation where all GenServers are polling the backend without peaks of activity.

But your comment about ‘where is the source of truth’ is very valuable. And to be honest I can’t answer it yet. I’m not sure I understand the nuance of both alternatives. Can you elaborate a little bit on this?

Completely agree and perhaps I didn’t explicitly call that out in the article. If the startup/hydrator procedure is something that takes too long, then perhaps this is not a good route to go. Like many things in engineering this is a possible solution and not the end-all solution.

IIRC with that sample app, loading up and starting 5,000 processes didn’t add any much startup time…but everyone should do their own measurements in their particular environment.

Another option to go when the possible number of GenServers to start is too large is to start the GenServers as needed and perhaps have them stick around for a preset amount of time. This is something that I have done with Swarm in the past using whereis_or_register_name/5 where no processes would be started from the get-go, but would be started only when they were requested and hydrated from the database on that initial start.

5 Likes

Well there are many ways to implement that. But we need to know before:

  • Who decides that a monitor has to be started or stopped?
  • How much monitors, 10, 100, 10000, etc

If I wanted those monitors to be persistent I would do the following:

  • When I want a monitor to exists or not, I insert/delete it from the database.
  • I would have only one GenServer for all monitors. You would not have peaks of activity if checks are made one by one. When it is started, it loads the monitors from the database. From there you can send(self(), {:poll, monitor_id}) for each monitor, and then use Process.send_after/3 from handle_info/3 so if the response is :on you re-check after one minute, if the response is :off you can retry 10 seconds later. For that kind of stuff I implemented TimeQueue because it is easier to reason about and to test than to have timer messages all around.
  • When do you insert/delete a monitor in the database, you also send a message to your GenServer for it to reload from the DB.

And I think that’s all, it should work.

The only problem is if you fail to send the message to the GenServer after inserting your monitor (maybe the transaction is commited but then you crash), the GenServer will not know that the database has changed. But that is a common problem so I would just ignore it. If you need to handle that too then you can:

  • Make the inserts/delete from the server directly, in handle_call/3, that gives you a component as a single GenServer.
  • Link to the server before inserting/deleting and unlink after sending the update message to it.
  • Have the server poll the database every N minutes (but in that case the server does not even need to keep a state, it polls the DB, then polls all services, and start over one minute later).
1 Like

This may be tangential to OP’s case but I’m curious what the tradeoffs are between a strategy like this of maintaining database entries versus perhaps a separate “monitor” application that polls the main application and holds some description of current main application state but runs in a different context. You could still have catastrophic hardware failure that takes down both applications, but that is true for the database as well. Just wondering if there might be a middle ground that offers some potential for recovering state on restart that might be more performant or less organizational overhead than maintaining a database.

1 Like

How this new app would not use a database if the monitors have to be persisted?

While the OP specified surviving a system restart I wasn’t clear how the system was defined. This is why I said the idea may be tangential to OP’s use case. I also realize that the system defined in my post would not have persistence in every sense. If system is defined as the BEAM instance running the main application, however, and your monitoring application is storing snapshots of the main application state in a different BEAM instance, that state would be recoverable and therefore semi-persistent in the event that the main application’s BEAM instance crashes. If that monitoring application BEAM instance crashes then of course the state is no longer persistent. I was curious if there were potential performance tradeoffs that would make that sort of lesser persistence worthwhile. I’m not really suggesting it as a solution for OP so much as trying to further my own understanding of the problem.

Those suggestions make a lot of sense. Your last point would make this system much simpler. It all depends on what you consider to be ‘an actor’ in the system. Is it a server that monitors one endpoint? Or is it a server that monitors all endpoints at once? Both setups will have their pros and cons.

The linking/unlinking trick you describe sounds useful in some situations where you want to couple (link) two GenServers temporarily.

I do not really think in terms of actors in Elixir, but rather just about concurrency: what are the parts that need to run on their own.

First you write the code that does something, and then if it has to run on its own you make it a process.

Polling external APIs once in a while is a simple function. Polling several of them is just an Enum.map wrapper around this simple function, so to me that does not necessarily need one process per monitor.

Now if you need to do a lot of stuff when a monitor is :off, then maybe it deserves its own process. But if you will just send a message to another process that will do the work then probably not.

We all want to build powerful and elegant systems with lots of features but it is actually better to write the most basic solution and move on to something else, because it will be good enough 99% of the times. So I would just fetch from the database, ping all services, send error messages, sleep for one minute, and loop. Done.