What can go wrong with GenServers?

Howdy! I’m a researcher trying to understand better how people use Elixir. I have a fair amount of personal experience with Elixir, but I need more than one data point. :slight_smile: I’m particularly interested in GenServers—particularly how they can go wrong.

Have you ever encountered a GenServer that was sensitive to the order of certain messages? E.g. imagine you had a GenServer acting as a proxy for a web service; you might need to send the server :connect to get a token, and only after that could you send it {:request, body} messages.

When have you had bugs with GenServers?

1 Like

One of the dangerous things that can go wrong with genserver is leaking memory when using large binaries. Depending on your orchestration, you can easily take the server down too if you run out of memory.

There are lot of articles on this topic, here is one of them: https://medium.com/coletiv-stories/genservers-memory-issues-ef42cc42e3e9

3 Likes

GenServer handled messages in order; so message ordering is usually not a problem. My usual bugs with GenServer are:

  • start up sequence. For example, GenServer A needs so to be started before GenServer B, but I did not have the right order. Or if the system is not properly designed, who is calling who first may not even be deterministic.
  • dead locks. For examples, GenServer A call GenServer B, and B also calls A back.
  • time outs. in pathological cases, one GenServer could have a large backlog of messages. Calls could timeout.
2 Likes

As @derek-zhou points out, message-ordering isn’t generally much of a problem. However, the code you described sounds like it has a different issue that’s more common: using one process when you really want a supervision tree.

Instead of a single “proxy” that tracks all the connections made through it, a “BEAMier” approach would make the initial connect spawn a new process focused solely on that connection.

On the flip-side of “too much state in one process” is “no state in the process”, the dreaded “using GenServers for code organization”. I’ll leave further discussion to the docs.

A gotcha that’s easy to miss until it becomes a headache in production: the restart threshold max_restarts in supervisors needs to be tuned once there are many children. Very frustrating to have 3 crashes in 5s bounce your DynamicSupervisor with 3k children :crying_cat_face:

When that supervisor restarts (or on a cold system boot), you can also run into another scaling problem: the “thundering herd”, where those 3k children start up and ALL try to make a SQL query in the same 10ms. Combine those DB timeouts with a low max_restarts on the supervisor and you are now officially Having A Bad Time.

7 Likes

messages in order

Yes, I know the BEAM guarantees that messages will be delivered in the order that they were sent from a process, but there isn’t any guarantee on the order that two different processes’ messages arrive.

More generally, I’m looking for cases where there’s some implicit protocol—maybe (hopefully!) explained in the docs—but where clients of the GenServer have to be mindful about e.g. doing some kind of set-up with the GenServer prior to sending other kinds of messages.

Without knowing what kind of problem you are trying to solve, I do not know what to say. A GenServer is a stateful entity that pretends to be a collection of functions. If you require the client to be “mindful” of the state of the GenServer, then the illusion of function calls quickly fades, plus you need to duplicate part of the state at client side, which means you don’t have a single source of truth anymore. I hope the challenges that you are facing are big enough to warrant all these complexities.

Just to clarify, I’m not trying to build anything my self per se—I’m investigating how people use GenServers and some problems they might encounter, and potential ways to help fix those problems.

require the client to be “mindful” of the state of the GenServer…

It might be just as simple as remembering to perform one action before another. A lot of people seem to work hard on making “bulletproof” GenServers—which is a great thing from a software engineering perspective. Something tells me there should be some cases where it’s hard to make GenServers as bulletproof as one might like.

As you have already mentioned there are no real guarantees in which order the messages sent from different processes will arrive. If this is really important to you then you will have to synchronise the sending processes in some way.

One thing which is guaranteed is that the GenServer process handles the messages in the order in which they arrive and it only handles one message at a time. So for example when it gets a call messages it will call the handle_call callback and the GenServer toploop won’t handle any messages in anyway until the handle_call returns.

Note however if you in one of callbacks communicates with other processes, this is very very very common, then as there is only one message queue per process (NO there is no way around this) then your messages might become interspersed with more requests coming into the GenServer. It is upto to you make sure your message tagging ensures that you don’t receive any GenServer requests cause then they are gone. Also remember that any messages you leave in the message queue will be picked up by the GenServer toploop and most likely be processed in a handle_info callback.

Remember the BEAM/Erlang/Elixir process semantics is very simple and straight forward, e.g one message queue where everything gets entered in order and there are no interrupts. Its very the KICASS principle. :wink: :laughing:

4 Likes

It is upto to you make sure your message tagging ensures that you don’t receive any GenServer requests cause then they are gone.

I’m a bit lost on that and would love more explanation.

He’s saying that if your genserver process sends and receives messages, you must ensure that your receive clauses will only match expected messages, and not match messages that are intended to be handled by your handle_call/handle_cast callbacks. Otherwise, as you received the message in your custom code then your handle_call callback will not be called (just because it will not be receive’d by the GenServer module).

3 Likes

Yes. The thing to really point out is that there are no “special” messages, just messages. This is why you need to get the tagging right so that the server and you callback code don’t get mixed up.

2 Likes

Indeed, this is a confusing trap for beginners, especially if you come from languages where order of the functions don’t matter, thank God for compiler warnings.

You can read about it here, it’s IMO well-explained: Elixir, A Little Beyond The Basics - Part 8: genservers

One very important excerpt about the process messages:

# send
:hello

 # cast
{:"$gen_cast", :hello}

# call
{ :"$gen_call",
  {#PID<0.110.0>, [:alias | #Reference<0.426212949.4092919813.244647>]},
  :hello
}

GenServer.cast and GenServer.call basically do send messages with an opinionated format. There’s no magic, just stuff that’s handled for you so you have an easier time – that’s all really.

But if you want to start peeking under the hood you have to make sure you either don’t match on these special messages or you do get them but also forward them further, otherwise GenServer.cast and GenServer.call will not work.

2 Likes