DGen - A distributed GenServer

Ah, I see we have QuiCK at home :slight_smile:

I wrote this somewhere before, but the funny thing about quick is it’s not interesting at all. Apple just implemented the dumbest possible queue on top of a really, really good database.

The hard part of implementing something like a quick queue is keeping the multi-level index in sync. Except it’s not hard at all, because you can literally just write the two completely disparate index rows along with the user data in a single atomic transaction with perfect consistency, with zero effort, because FDB. That is the power of actually taking the time to design a system with useful guarantees.

I mean I don’t speak for anyone else, but I’ll probably generalize versionstamps by just implementing them exactly like FDB because they’re pretty good. For a literally-serialized system like SQLite any counter will do.

Watches are bad, though.

1 Like

Oh, so thats a replicated state machine then. And this one is backed strictly by Foundation DB. There were some other projects which did similar replicated state machines on other databases, however I can’t remember any names

Thats better, but still, I don’t understand why input must come from messages. If you had input provided into the state machine by a function call (not GenServer.call or cast), you could atomically add batches of input, you could perform dirty actions without this returning-the-closure pattern, you wouldn’t have a problem of some unexpected casts arriving, these messages having temporary data, etc.. And even if user wants to have input coming from messages, they could write their own wrapper GenServer which would have very explicit control on what gets added to the replicated state machine input queue and what gets ignored.

I’d suggest something like FDBReplicatedStateMachine or fdb_rsm_server or FoundationReplicatedServer, because it describes what your program does. Key words here are “Replicated”, “State machine” and “FoundationDB”


I am still reading the library code. So far, there is a lot of room for optimization. For example, there is unnecessary double await in call. There is case dgen_queue:length(...) of 0 -> code, which can be optimized to not compute length and just check if empty

There are also some strange design decisions. For example, if dgen_config:init is not called, config will work, it will just ignore user provided values. And it makes it impossible to have two DGenServers with different backends.
Next thing is that current dgen_backend behaviour simply matches erlfdb interface. I think that you should limit a behaviour and make it more generic, cause not all distributed databases support futures, directory and keyspace operations. Otherwise you won’t see any other backends in the future. For example, this sophisticated state encoding/decoding approach you’re using is only a subject to erlfdb implementation, because If I were to implement a postgres backend, I would not need this encoding approach

2 Likes

Also, I think @Asd is probably right about the name making no sense, but there is a strong counterpoint to be made that calling the library “degen server” is extremely funny.

3 Likes

The original idea of a GenServer was that if there is a bug in execution state, it crashes and start over, to recover execution from a blank state.

If you start over with the same state, then you have the same bug. If you need persistence, why not just use a proper database?

I read everything, but still can’t get the point. Is it an experiment to learn?

2 Likes

Thanks for taking the time to review the project.

w.r.t. the double await, I think you’re talking about dgen:call/4. The first wait is to a BEAM process. This puts the message onto the durable kv-queue and returns a sentinel key from which the caller can receive the final result. Given the current design, this is necessary because the caller doesn’t know the details about the queue’s identity. Via regular BEAM message passing, DGen allows anyone to push a message, as long as they have a pid, or can look one up. I could have instead chosen to represent the queue details in a struct that the caller must have in order to push. This choice would violate the premise, which was to mimic the GenServer interface, because I like it and find it useful for composing programs. You may disagree with the premise, which is fine, but this is not an unnecessary action.

The second wait in that dgen:call/4 is receiving that final result, which comes from the server-pushed resolution of the watch future. In this case the message comes from the storage backend rather than the DGenServer process. The result of the operation is then retrieved.

This is not correct. The length of :dgen_queue is computed by the difference of two values in the kv store = (number of pushes - number of pops). It does not have a key that represents “emptiness” of the queue. Adding such a key would force us to add more key conflicts to the push and pop functions. This would likely slow them down. As it stands now, we retrieve 2 values concurrently.

You’re right. It’s awkward and wrong.

I tried to be clear about this in the post: I don’t know how to put another backend in here, but I desperately want to, and the interface of :dgen_backend will definitely have to change a lot.

So why do this at all? I find FDB Layers useful. They can be composed into higher level abstractions, and it results in the most ergonomic state management I’ve ever worked in. However, they necessarily tie you to FDB. While I happily run FDB, I don’t want to forever. Since there are other great projects developing that are inspired by FDB, I hope DGen becomes a real Layer that can be compatible with those projects. A Postgres backend is not interesting - the community already has Oban.

2 Likes

At risk of being overly pedantic, I’m going to challenge this, but only slightly. The original idea of supervisor is to do this, but gen_server itself is not opinionated about how, when, or why it’s restarted, or if it is at all.

Of course, the design of gen_server is amenable to being used by the supervisor in a powerful and useful way, just like you describe. I’m a direct beneficiary of the genius design of this simple idea.

DGenServer breaks the rules a little bit. It can still be stopped and restarted by the supervisor, but if a poison message is the result of the crash, it may very well require an operator to intervene - either by correcting database state, fixing a bug in the code, or changing some upstream service. I agree this is a weakness in the design.

This is dismissing FoundationDB as a proper database. Why?

2 Likes

This is a prime example of why asking a one-line “why” question in a forum is such a bad idea. I am sure both of you have good intentions, but because of the lack of common context, trading a bunch of short “why” questions will only steer the discussion further away from truth seeking.

1 Like

Oh not at all. It can be FoundationDB for sure. That was dismissing gen_server as a proper database.

Now I understand the idea further, thanks a lot for taking the time to explain!

Indeed, any stateful program will be at risk of persisting a bugged state. This leads to the uncomfortable realization that one of Erlang’s core ideas is probably wrong, or at least inadequate for large swaths of real-world programs. Aggressive correctness testing (FoundationDB is a good example) is a more fruitful path to ensuring that such states are unreachable.

Another fruitful path is to structure your code in such a way that bugged states are less likely to arise. Programming in a declarative style, where the program rebuilds its state by re-executing itself from the top rather than transitioning between states through piecemeal manipulation, is a helpful strategy. OTP supervisors offer a form of this, but they are fairly primitive. React’s engine is a much more sophisticated tool in this area, as it allows for stateful components with incremental execution and has escape hatches to integrate with non-incremental code.

Something that looks less like a state machine and more like a React (function) component is what I would like to see. But all experimentation is valuable, and what’s special about the “layers” paradigm is that it enables experimentation. You do not one-shot great tools, they are evolved.

1 Like

How did you come up with that idea? The point of supervisors in Erlang is that if there is abnormal state, then it is better to restart and start again with known state. Just like restarting a computer with borked state in RAM.

So you do not restart to previous state, as this indeed would be pointless, but you restart to “clean” state in hope, that it was one-time error that was out of your control.

2 Likes

There is no such a dichotomy in there.

The error might be provoked by wrong message sent to the process, not by the project inner state in the first place. In such a case it makes total sense to restart preserving the latest state.

The error might be provoked by a wrong state+message combination.

Andalso, the error might be provoked by the wrong state indeed, but the previous state was all right and it might make sense to rollback to this “previous” state rather than to the point blank one.

2 Likes

From my understanding, the OTP framework designs gen_server (called GenServer in Elixir) to act as both a server and a client, similar to the TCP server/client pattern. It maintains state, but developers need to handle persistence themselves if they want to restore the state after a crash. It also receives configuration/options from a supervisor (or via manual start) and initializes its state in the init callback.

I think this can be annoying for developers when handling state in some cases.

In my opinion, a GenServer with a pluggable architecture, similar to Phoenix, would be more flexible and better aligned with the Elixir style.

Another interesting component is gen_statem. It is quite suitable for working with state transitions. I saw that Elixir implemented it in the early days, but it seems to have been discontinued.

For DGen, at first glance, it looks like a remote GenServer running on another node. If the goal is to share or persist state, this could be achieved by adding an adapter layer (similar to Plug). I think this would help users support more use cases, such as storing data in Redis, Postgres, etc.

Some programs have pesky correctness conditions like “do not lose committed data ever”.

Of course, but that isn’t something that OTP provides for gen_* modules for you. If you need such behaviour, then it is up to you to decide what “committed data” is and how tell user it was committed at all.

2 Likes

All I’m saying is that “turn it off and on again” is inadequate for maintaining availability in a persistent system because you will either persist the bugs or lose data, neither of which is acceptable. You need correctness testing.

1 Like

Of course you need correctness testing. However, it is a chicken and egg problem in the real world: what do you do before you reached absolute correctness? Nothing? With OTP you can at least limp on and monitor the log file, find out what went wrong, add a test case and fix the bug for good. It gives you a path to correctness but not the correctness itself.

4 Likes

Some stuff may be acceptable in some cases. The perfect example there will be telephone switcher (what a coincidence):

In case of telephone call if there is a bug in software, we want to reduce impact on the overall system. If that was one-off issue, then the callers will call again and “something broke” and everyone will go back to their lives. But if bug in single process (call) can cascade to other calls, then it is highly undesirable.

Similar thing with HTTP services, if there will be some issue on the line, then user will simply hit “refresh”. If that issue isn’t common, and was one-off, then no-one will notice that (browser may even refresh on its own in some cases).

There is a lot of systems (especially related to network), where simply restarting process (often even not needed to be done automatically) will be enough for a lot of error handling in case of one-off errors.

1 Like

I think that you’re missing the point. Some time ago, I’ve seen your Peeper library which stores the state of some GenServer in ets and loads it back when the GenServer restarts. It got me thinking about it and I decided that this approach is just reinventing the bicycle

You are completely right that restoring the latest correct state after the crash is the best option. However, it is not the best assumption that this latest correct state is the state which the GenServer was in right before the crash or before it received the message which crashed the GenServer. And even more, there is no generic answer about how to decide which state (the GenServer is in at some moment) is correct and which is not.

That’s why GenServer has callbacks. Namely init/1 is the callback which executes some code which has to recreate some state which is correct for sure. This approach is generic, because it imposes no expectations and lets the developer decide which state is correct and which is not. If you have a bug in init/1 which returns the incorrect state, then your server will restart with an incorrect state, but that’s just a one callback, and its a callback, a function, which may return different results. That means, that GenServer will recover if init/1 returns a correct state at least once.

Rolling back to some of the previous states will impose the hard requirement onto the developer, who now needs to write the code in a way, that no handle_* callback ever returns incorrect state. If it returns incorrect state once, you’re forever stuck with it. Otherwise every crash will restart the server with incorrect state, thus indefinitely persisting the incorrect state without any chance for automatic recovery. That means, that GenServer will recover only if all callbacks return correct state all the time.

1 Like

I think that not all the people whose opinion differs from yours are missing the point :slight_smile:

If there was a rock solid solution allowing the developer to properly recover from anything, preserving a proper good state, it’d been incorporated into OTP, I’m 102% positive. Obsiously, there is not such a silver bullet.

It does not mean the developer cannot narrow their usecases to some less general surface. For some cases, like aforementioned “wrong message, correct state,” Peeper just does everything right. If I can ensure that my code does not corrupt the state under any circumstances, Peeper would have a lot of hassle prevented. Does it work for everyone under any circumstances?—Of course not. Small libraries are not usually cover each and every need of the depeloper all across the world, standard lib (OTP) does.

1 Like

If I knew that, wouldn’t it be simpler to reject the wrong message and keep the GenServer humming?

You library might be useful for cases like Liveview, where the life span of the process is tied to the health of the socket. However, in the case of LV, the restart of the process is triggered by async user action, not a supervisor, so there could be race conditions between the serialization and de-serialization, thus corrupt the state for good?

1 Like