Design issue in Pooly example from The Little Elixir and OTP Guidebook

jaggard · November 5, 2017, 2:05pm

I am making my way through the Little Elixir & OTP book. I like the philosophy behind this book - it doesn’t mess about with lots of chapters teaching you about recursion and pattern matching for the nth time. Instead it jumps straight into the core principles of erlang/elixir i.e. achieving reliability with Processes and Supervisors.

By chapter 6 it is taking the reader through a complex example of creating a worker process pool. Now you start to see why most books opt for simpler examples as errors start to creep in to the text and the code. Viewing the errata on the Manning website is a must but it doesn’t cover all the issues.

I am digressing… Anyway, by page 155 you have a supervision tree with a set of worker processes that are supervised by a Worker Supervisor that in turn is supervised by a pool supervisor. There is a pool_server at the same level as the Worker supervisor (<0.119.0> is the worker supervisor)

Here is the code for the Worker Supervisor:

def init([pool_server, {m, f, a}]) do  
    Process.link(pool_server)
    worker_opts =   [restart: :temporary,
                    shutdown: 5000,   
                    function: f]            

    children = [worker(m, a, worker_opts)] 
    
    opts = [strategy: :simple_one_for_one,  
            max_restarts: 5,
            max_seconds: 5]

    supervise(children, opts)               
end

As far as I can tell it doesn’t do anything! The restart option is set to temporary so it doesn’t restart any of the workers if they are killed.

The restart is actually handled by the Pool Server as this has been made into a system process to trap :EXIT signals and has a link to each of the worker process (as you can see in the diagram).

Why does the Worker Supervisor exist in this instance? Why not make the Pool server a real supervisor rather than using the manual method of trapping exit signals and restarting workers?

I don’t mean to sound disparaging about this book. It has obviously engaged me as I am spotting errors and questioning the code - some books are so dull that they go in one ear (eye?) and out the other…

kokolegorille · November 5, 2017, 3:32pm

I also started with this book, as it quickly shows the promise of concurrency, instead of focusing on what FP is.

For your question, I don’t think they share the same responsability, the worker supervisor is in charge of starting workers, while the pool server manages… the pool, and it ensures they are not more than x workers at a time.

Using a restart strategy will not work on low load, as there will be always the same number of workers, even if the load does not require them.

The pool server allows to check in/out workers on demand, provided they do not get over config limitation.

The worker supervisor role might look limited, as it only start temporary workers, but it is the place where You can get the list of running workers… with Supervisor.which_children/1

Also, it is based on a real library, poolboy. So the book explains how it was designed, as an example of supervision tree.

Also, it does not use the most recent version of Elixir.

But as with anything with OTP in the title, the book is good at showing You what You might expect working with Elixir/Erlang soft concurrency.

dom · November 6, 2017, 12:08am

This is a classic example of the pattern discussed here: The basic Erlang service ⇒ worker pattern – The Intellectual Wilderness

Why not just call game_mob_sup directly? For two reasons:

Defining spawn_mob/N within the supervisor still requires acquisition of world configuration and current game state, and supervisors do not hold that kind of state, so you don’t want data retrieval tasks or evaluation logic to be defined there. Any calls to a supervisor’s public functions are being called in the context of the caller, not the supervisor itself anyway. Don’t forget this. Calling the manger first gives the manager a chance to wrap its call to the supervisor in state and pass the message along — quite natural.

game_mob_sup is just a supervisor, it is not the mob service itself. It can’t be. OTP already dictates what it is, and its role is limited to being a supervisor (and in this particular case of dynamic workers, a simple_one_for_one supervisor at that).

jaggard · November 7, 2017, 7:00pm

Thanks for the replies and the link to the pattern.

In the link, why does game_mob_sup exist at all? The text describes how game_mob_sup should not restart children. This is up to game_mob_man so game_mob_man must monitor the lifetime (e.g. by trapping EXIT signals) of the child processes i.e. game_mob_man is itself a form of supervisor. game_mob_man could also start the child processes by simply calling game_mob:start_link directly - it seems overkill to create a supervisor module just for that.

What is the purpose of game_mob_sup?

The above can be applied to the Pooly example. The Pooly_server can call start_link directly on the worker processes module. Why is the worker process creation delegated to the Worker_supervisor? What is it giving us?

sasajuric · November 7, 2017, 9:39pm

I can’t say exactly for this example (I’d have to reread it), but I can give a more general answer.

The essential roles of a supervisor process are starting and stopping process. In other words, a supervisor is a lifecycle manager. Having processes under a supervisor, ensures that a termination of the supervisor will also terminate its descendants recursively. So if some larger portion of the subtree is taken down, you can be sure that no dangling processes are left behind.

This is why I say that in most cases, workers should sit directly under the supervisor process. It’s not a strict rule, so it’s fine if you sometimes override it for some specific reasons, but it’s a good default which reduces the chances of leaving dangling processes. Therefore, I think that starting workers under some supervisor is never a bad practice.

That said, there are occasional cases where it makes more sense to bypass a supervisor and start workers directly under another worker. This is definitely possible (and sometimes done in practice), but you need to reinvent some parts of the Supervisor abstraction, so you need some tangible reasons for that. If you’re not really sure whether you want to start a child under a supervisor or not, I’d say that it’s better to choose a supervisor as the parent.

One special exceptions are one-off finite-time processes, such as tasks. We frequently start tasks directly under workers with Task.async. But this is usually fine, b/c tasks don’t run indefinitely, and they shouldn’t trap exits, so the problem of dangling processes should not bite us here. If the parent of task(s) terminates, they will be taken down as well, owing to the link mechanism.

jaggard · November 8, 2017, 12:21pm

Thank you sasajuric. I was missing the point of the supervisor existing to stop the worker processes.

In the specific case of the Pooly example the Pooly_server creates links to the worker processes and so does not need the supervisor to stop the worker processes if some larger portion of the sub tree is taken down. Maybe this example would be better if the server only monitored the workers rather than linking them?

However, in the general case (such as the example provided by dom: https://zxq9.com/archives/1311) a supervisor is required as there may not be a another process linked to the workers that will stop them in case of a failure further up.

Thank you to all respondents for your input.

dom · November 8, 2017, 1:16pm

Supervisors have a few other roles in the process lifecycle besides just start / stop:

Log progress and error reports to SASL
Kill children if they do not shutdown within the given timeout
Support code replacement (details in release handling and appup cookbook)

It’s safer not to mix these concerns with application logic, since bugs in a supervisor have a lot more impact than bugs in a worker.

I don’t have the book around, but I would think it’s using links to ensure the workers die if the server crashes and gets restarted. Otherwise when the server restarts its state would be out of sync with the pool’s real state; it has no way of knowing which workers are available and which are not.

This could be avoided by making the top supervisor a one_for_all, so that if the server crashes the pool is restarted as well. Then monitors would be enough.

jaggard · November 9, 2017, 8:01pm

Thank you dom. That’s very helpful. I haven’t got as far as SASL or code replacement yet.