Elixir Blog Post: The many and varied ways to kill an OTP Process

paulanthonywilson · May 31, 2021, 2:16pm

This is an overview of different ways to try and kill an OTP process (in Elixir) and the behaviour to expect when that happens. I know this is fairly basic stuff, but it’s the kind of thing I find myself forgetting and having to recheck, and the documentation I know about is a bit scattered.

I hope to follow up with the impact on on linked and monitoring processes.

As a bonus, the post is an executable LiveBook page, which you can download and execute yourself.

_{Posted via Devtalk (see this thread for details).}

the_wildgoose · June 7, 2021, 3:39pm

This is excellent. Thankyou so much for writing this!

Just one point: I think there is a further difference in behaviour if the process doing the killing is the direct parent of the child? I’m actually not sure I understand the specifics, but there was a post on elixirforum a month or so back about this and the insight was that the child can’t catch it’s own exit if the thing sending the message is the direct parent? I think if this is correct then it would be quite nice to get this documented in your rather comprehensive write-up on this stuff!

I confess I still have a lot of doubts over how to correctly use OTP and especially how best to link some children and guarantee cleanup. I quite like the “parent” library for some of these things (“director” seems nice as well). I think it’s clear from your document that there are a few preferred ways to kill processes if you want to allow them to call their terminate() function. It does leave me with some doubts on how to handle this as a library author as well…

Looking forward to see if you do a followup on this on? Perhaps with regards to how to build robust links, cleanup on failure, monitor callers to ensure we terminate when our invoker terminates, etc?

paulanthonywilson · June 8, 2021, 2:29pm

I don’t really follow. I am not sure I’ve seen the concept of parent/children in processes (beyond that a supervisor starts and supervises it’s children). Anyone?

Thanks though. I have just pushed up a post on what happens when a linked process exits.

Sebb · June 8, 2021, 7:11pm

I think so too. Links are bidirectional, see Process — Elixir v1.12.3

But you can implement some stuff that makes one process behave like a parent.
See Parent - custom parenting of processes

Thank you for this great and deadly blog-series!

the_wildgoose · June 14, 2021, 10:19am

I think you raise a good point and something which is currently causing me a lot of doubt when I’m building elixir code myself.

Consider writing a new genserver, and part of it’s functionality needs to know if a file has changed, so you might use say the “FileWatcher” library which you can start a linked genserver that watches your file.

However, I find two practical problems here:

If for some reason you find yourself handling EXIT signals, eg perhaps you needed to do some cleanup if the genserver you started exits. Then you need a lot of boiler plate to handle cases of your process dying, child dying, etc. I’m not sure I can see how to do this more neatly if I need to maintain the semantics that if the parent process dies, it needs to cascade that to the child (ie a link), but in reverse you just want a “monitor”, ie to know if the child goes down?
Third party libraries tend to be quite well isolated, but when writing some integrated code, it might be more tempting to couple the parent and child. Also, this could happen in the case of running tests, where the parent will start a genserver to test, but then want to terminate it after tests are completed. If the child is trapping exits then it turns out there are some cases it can’t trap. There is a blog article that I now can’t find, where this was discussed and it appears that if you send an exit message to the child and the “from” PID is the literal parent of the genserver, then the genserver is killed without the exit being trappable.

The blog article had discovered this in the context of starting/stopping genservers in test and they solved it by simply using a different PID to send the exit message to the child under test. However, I speculate that it’s a trap for the unwary if handling your own exits. eg consider my case 1) above, I might want to trap my own exit to cleanup, then send an exit to the child, which in turn wants to catch exits to cleanup, but based on this, it appears you need to be careful about how you cascade the EXIT message to the child?

the_wildgoose · June 14, 2021, 10:56am

Bigger picture, and I wonder if you might offer your thoughts on this. I find myself struggling with several architecture issues with Erlang/Elixir, which I don’t find well covered, and largely the underlying issue is about understanding the detail of exit signals, usually in the context of either needing to cleanup some resource or cascading that failure to something else in a controlled way

For example,

1 ) Given a manager process, that starts worker processes. If workers dying shouldn’t take down the manger process then it would seem like the right structure is to start the workers under their own supervisor. However:

Q: If you wanted the termination of the manager to stop the children, then how to construct this? I guess I struggle with knowing how many ways the manager could die and whether any of them need special handling in order to guarantee cleanup of the workers? Would it be enough to have the manager and child-supervisor started under a one-for-all supervisor?

Q: How and where should the code live to terminate the whole structure, eg if we want to implement a “graceful stop” function for the whole subsystem? Should the manager send a message to it’s parent supervisor, which handles the shutdown of everything? Should it instead notify it’s own children directly? (Sure I realise different apps have different needs, I’m more keen to avoid patterns which don’t work well due to races, etc)

Q: How to handle shutdown of the whole app. I think I keep hitting problems where I’m trying to restart stuff as the system is stopping them. This quite possible is due to handling exits (misunderstanding the exit signals). I think during shutdown, things are killed in reverse order? So if I had a supervisor creating a manager process, and a children supervisor, then my children will get killed by the erlang runtime first, but if I’m monitoring them, then how not to restart them?

2 ) Resource cleanup tends to leave me wanting to handle exit signals, which then leads to needing to understand the semantics of those to a very high degree…

Q: If you needed to create a one way link, how would you go about it? eg I have some genserver, which needs another dynamically started genserver that effectively wraps some OS resource, eg I want to monitor an LTE modem, so I request one of the 7 QMI handles that the OS can allocate, it then uses this handle to do some monitoring and send the answer to the parent. If this process dies it needs to handle it’s exit and release the handle. If the parent dies I need to stop the monitoring and release the handle. If the child process dies I would want to restart the process in the parent (as the parent and child need to have knowledge of each other to send messages, etc).

This feel like a need for a one way link? I’m not sure how to model this without just trapping exits, which as your article shows is problematic and easy to get wrong… I did ponder if I couldn’t model this with a dynamic supervisor starting my resource genserver, “monitoring” this from the parent, then setup a separate process a) “linked” to the child and b) monitoring the parent server process… However, this feels ugly and racey to start up.

I can construct similar problems of how to handle a scarce resource which is important to deallocate, and given that anything can be arbitrarily killed without running the terminate function, there isn’t a lot of guidance on how to wrap resources to ensure that they are cleaned up…

I do like Sasa’s “parent” library, it tackles some of these concerns directly. However, your article series is very helpful as it is a bit of an authoritive document on how exit signals function. (caveat that I believe there is this one difference in behaviour for :EXIT messages when the sender pid is given as the parent process, which causes the message to be converted to a “kill”, ie child can’t trap it? The explanation given was this is a beam behaviour, so it happens out of sight of your app)

paulanthonywilson · June 15, 2021, 8:57am

Hi,

So this is a quite a large architectural set of concerns, and I don’t think my attempt at an example-based reference for the specifics of killing (or otherwise) really helps much. I meant it as a reference for something that is hard to remember because I don’t have to deal with it much. But anyway, I’ll have a shot and maybe someone cleverer might chip in.

Yes, I’d be inclined to always start things under their own supervisor; that’s what they’re there for.

Would it be enough to have the manager and child-supervisor started under a one-for-all supervisor?

Or :rest_for_one depending on your inclination, but I don’t see why not.

How and where should the code live to terminate the whole structure, eg if we want to implement a “graceful stop” function for the whole subsystem?

I wonder if you would want to be thinking about grouping the “subsystem” in an Application which I think would have the stop behaviour that you are looking for.

This feel like a need for a one way link? I’m not sure how to model this without just trapping exits, which as your article shows is problematic and easy to get wrong… I did ponder if I couldn’t model this with a dynamic supervisor starting my resource genserver, “monitoring” this from the parent, then setup a separate process a) “linked” to the child and b) monitoring the parent server process… However, this feels ugly and racey to start up.

It sounds like you’re thinking of linking and trapping exits to “monitor” when a process goes down, to manage resources. I would be strongly inclined not to do that and use `Process.monitor/1 instead.

Have you looked at Learn You Some Erlang? I think the chapter on building a process pool along with the next chapter is very relevant to your questions.

Alternatively you could outsource much of this to something like Poolboy .

caveat that I believe there is this one difference in behaviour for :EXIT messages when the sender pid is given as the parent process, which causes the message to be converted to a “kill”

I really don’t think that’s a thing; child and parent are not a concept at this level. It is true that :kill is specifically untrappable (when not sent through a link).

the_wildgoose · June 15, 2021, 11:36am

Aha, found a different link, but here is one thing that you don’t yet mention (would be cool to have an absolutely authoritative article!)

I’ve not created a test to prove it, but the claim is if you send an :EXIT message, where the “from” PID is that of the parent, then this message is intercepted by the genserver internals and I think this implies that terminate isn’t called in your genserver?

Whilst I realise this may be unusual, I think it might happen in 2 main cases:

In testing, you start something to test it, when you stop it, it will get stopped in a different way to that which you expect
I see regularly people doing something like Blah.start_link(…) inside their genservers (I know I’m doing this, perhaps unwisely), and near as I can tell this is fine since everything is linked, but it has an implication for how you would stop this linked server later (if you needed to)

Please don’t misunderstand. I was just trying to highlight something I don’t yet understand fully, that perhaps you could include in your blog series. Many thanks for writing it, very helpful!

paulanthonywilson · June 16, 2021, 9:11am

Oh, nice. Thanks. So the OTP contract specifies special behaviour for when the parent dies and, of course, GenServers follow that contract. As I was specifically saying this was about OTP processes, then it is relevant. I’ll mention that somewhere.

the_wildgoose · June 20, 2021, 8:46pm

That’s super! OTP has so many subtle but important contracts and whilst it’s true they are documented, that documentation is subtle and spread over a wide area.

I think your post is the most enciclopedic that we currently have! Thanks!

ityonemo · June 20, 2021, 9:25pm

this is where the special “exit from parent” is implemented: otp/gen_server.erl at master · erlang/otp · GitHub

paulanthonywilson · June 28, 2021, 8:47pm

I got round to writing up the parent / child exit signal behaviour. Death, Children, and OTP | The log of Paul Wilson

I’ll link up the blog posts tomorrow. Thanks for pointing out the behaviour change.