Helping the workers of a supervisor get to know one another

easco · March 16, 2017, 3:31pm

I’ve been thinking about the problem of letting the workers of a supervisor know one another’s pids, especially in the face of process restarts.

The particular problem I’m working on now I have a Supervisor with two workers. The first worker spins up to own an ets table - let’s call him the TableOwner. The second worker starts up and needs to work with the ets table - we’ll call him the ActualWorker.

(The ets table is owned by a separate process so that if the ActualWorker dies and is restarted he can grab the ets table again and pick up where his previous incarnation left off - think journaling)

These three processes, the Supervisor, the TableOwner, and the ActualWorker form a unit. There could be several of these unit started up at any given time. They care about each other, but they don’t really need to be “globally” available.

What I struggle with is how does the ActualWorker get access to the pid for the TableOwner when it starts (or restarts)? Registering the TableOwner, say with a unique ID that is shared with the ActualWorker, in the process registry seems heavy handed since nobody else cares about getting the TableOwner’s pid. Is that just my own hang-up though; is that the right approach nevertheless?

Or is there another pattern that is often used to pass the pid of workers of the same Supervisor between themselves during setups and restarts?

Kabie · March 16, 2017, 3:43pm

Why not just let the Supervisor own the table?

easco · March 16, 2017, 4:07pm

Why not just let the Supervisor own the table?

A separation of concerns. The supervisor’s job is not to own a table, the supervisor’s job is to make sure that its workers and supervsior’s are restarted when they die.

easco · March 16, 2017, 4:29pm

From further investigation it appears that one approach is to use a transient process at the end of the supervisor whose job it is to introduce the children to one another. I imagine I could use that technique along with a “rest for one” restart strategy to ensure that the process running introductions gets re-run each time.

Here is the solution I came up with based on that idea:

  def init(_) do
    import Supervisor.Spec

    me = self()
    setup_supervior = fn -> setup(me) end

    children = [
      worker(MetricsCollector.IssuesJournal, [], id: :issues_journal),
      worker(MetricsCollector.RepositoryDB, [%{}], id: :repositorydb),
      worker(Task, [setup_supervior], restart: :transient)
    ]

    supervise(children, strategy: :rest_for_one)
  end

  defp setup(setup_supervisor) do
    issues_journal = find_child(setup_supervisor, :issues_journal)
    repositorydb = find_child(setup_supervisor, :repositorydb)

    MetricsCollector.IssuesJournal.give_journal_ownership(issues_journal, repositorydb)
  end

In this solution, I capture the supervisor (in the variable me) and construct a function that calls setup passing that supervisor. In my worker specs, I give each worker that I want to get ahold of an id:. In the setup function I can look up the workers by their id and then pass messages to them to complete the setup of the whole supervisory unit.

The last child in the supervisor’s worker set is a task that calls setup and that is marked :transient so the supervisor won’t automatically restart it when it ends normally. The restart strategy is set to :rest_for_one so that if the :repositorydb process dies the supervisor will restart it and the task that links it to its journal.

Testing shows that this behaves the way I want it to. Constructive criticism is, of course, welcome. But for the time being I’m going to roll with this.

dmarko484 · March 16, 2017, 7:08pm

Isn’t Registry what you actually mean ?
https://hexdocs.pm/elixir/Registry.html

OvermindDL1 · March 16, 2017, 9:12pm

I just have the children ask the supervisor for the PID of the sibling they want via a timeout message on init… >.>

easco · March 17, 2017, 3:45am

Isn’t Registry what you actually mean ?
Registry — Elixir v1.16.0

No, as I alluded to my post, Registry is not what I meant.

In this case there could be several of these “supervisory units” started. I could try to come up with some sort of unique id scheme that would allow me to tag each facet of the unit and uniquely identify them in the node’s registry, but that’s a bit heavy handed. There is no need for other processes on the node to be able to find these they just need to be identified to one another, and only when they are first set up or when one of them restarts.

easco · March 17, 2017, 3:50am

I just have the children ask the supervisor for the PID of the sibling they want via a timeout message on init… >.>

How do you discover, or make, the identity of the supervisor known to the workers. Are you simply passing the supervisor as a parameter?

OvermindDL1 · March 17, 2017, 2:14pm

You know how you pass a child-spec to the supervisor, and that child-spec will call something like start_link, that start_link is called in the supervisor PID as I recall, so you can grab self() there and pass it to your module, or you can pass in self() from the supervisor init to the child-spec’s, etc… a few more ways too.

sasajuric · March 17, 2017, 2:40pm

In this particular case, I usually create the table in the supervisor’s init and pass it to the worker. I saw that you already dismissed that option because of separation of concerns, but I think that this argument is mostly academic in this case. Making the supervisor own the table means you only need one worker process, as opposed to three you have in your current proposal. In addition, the coordination you propose can lead to one subtle edge case: if the process is registered under an alias, requests might hit the Repository worker before it obtained the ETS table. This can be tackled explicitly, but you need to be aware of the issue. So IMHO, supervisor-as-the-ets-owner is a much simpler solution for this problem.

In some cases, obtaining a pid of the sibling will be the simplest approach. If taking that path, I’d do what @OvermindDL1 suggested, except I’d use a self-sent after_init message instead of a timeout. The edge case I mentioned can still happen, so you need to pay attention to that.

OvermindDL1 · March 17, 2017, 2:46pm

Well that is what I mean by a timeout, I don’t mean using adding a timeout to the init return, but literally just sending a message to yourself via send_after or something. ^.^

sasajuric · March 17, 2017, 2:57pm

I just do send(self(), :after_init)

OvermindDL1 · March 17, 2017, 2:59pm

Lol, I have that exact line of code in probably 20 modules in this project. ^.^

easco · March 21, 2017, 8:28pm

Thank you all. I will reconsider

One interesting thought that comes to mind is if the supervisor owns the table then the table will have to be publicly writable. There doesn’t appear to be a way for a supervisor to handle the “info” messages when an ets table with private permissions is given away.

But think that takes us too far off-topic from the original intent of this thread.