Supervision tree conflict in an umbrella app

Background

I have an umbrella app that has many smaller apps inside. One of this apps, called A, needs to be able to spin and supervise another app, called B.

B, being an app in its own right, exposes a public API and has a GenServer, responsible for receiving requests that it then redirects to the logic modules and such.

Issue

So, I have two requirements:

  1. I must be able to launch B independently and have it work as a normal standalone app.
  2. A must be able to have B in its children and restart/manage it, should such a need arise.

The problem I have here, is that with my code I can either achieve 1 or 2, but not both.

Code

So, the following is the important code for app B:

application.ex

defmodule B.Application do
  @moduledoc false

  use Application

  alias B.Server
  alias Plug.Cowboy

  @test_port 8082

  @spec start(any, nil | maybe_improper_list | map) :: {:error, any} | {:ok, pid}
  def start(_type, args) do
    # B.Server is a module containing GenServer logic and callbacks
    children = children([Server])

    opts = [strategy: :one_for_one, name: B.Supervisor]
    Supervisor.start_link(children, opts)
  end

end

server.ex (simplified)

defmodule B.Server do
  use GenServer

  alias B.HTTPClient

  #############
  # Callbacks #
  #############

  @spec start_link(any) :: :ignore | {:error, any} | {:ok, pid}
  def start_link(_args), do: GenServer.start_link(__MODULE__, nil, name: __MODULE__)

  @impl GenServer
  @spec init(nil) :: {:ok, %{}}
  def init(nil), do: {:ok, %{}}

  @impl GenServer
  def handle_call({:place_order, order}, _from, _state), do:
    {:reply, HTTPClient.place_order(order), %{}}

  @impl GenServer
  def handle_call({:delete_order, order_id}, _from, _state), do:
    {:reply, HTTPClient.delete_order(order_id), %{}}

  @impl GenServer
  def handle_call({:get_all_orders, item_name}, _from, _state), do:
    {:reply, HTTPClient.get_all_orders(item_name), %{}}

  ##############
  # Public API #
  ##############

  def get_all_orders(item_name), do:
    GenServer.call(__MODULE__, {:get_all_orders, item_name})

  def place_order(order), do:
    GenServer.call(__MODULE__, {:place_order, order})

  def delete_order(order_id), do:
    GenServer.call(__MODULE__, {:delete_order, order_id})

end

And here is the entrypoint of B

b.ex

defmodule B do
  @moduledoc """
  Port for http client.
  """

  alias B.Server

  defdelegate place_order(order), to: Server

  defdelegate delete_order(order_id), to: Server

  defdelegate get_all_orders(item_name), to: Server

  @doc false
  defdelegate child_spec(args), to: Server
end

b.ex is basically a facade for the Server, with some extra context information such as specs, type definitions, etc (omitted here for the sake of brevity).

How does A manage the lifecycle?

It is my understanding that supervision trees are specified in the application.ex file of apps. So, from my understanding, I have created this application file for A:

defmodule A.Application do
  @moduledoc false

  use Application

  alias B

  def start(_type, _args) do
    children = [B]

    opts = [strategy: :one_for_one, name: A.Supervisor]
    Supervisor.start_link(children, opts)
  end

end

Which should work, except it doesn’t.

When inside A's folder, if I run iex -S mix, instead of having a nice launch I get the following error:

** (Mix) Could not start application a: A.Application.start(:normal, []) returned an error: shutdown: failed to start child: B.Server
    ** (EXIT) already started: #PID<0.329.0>

My current understanding of the issue is that A's application.ex file is conflicting with B's application file.

Questions

  1. How do I fix this conflict?

Passing B as a child spec means “Call B.child_spec/1 and use that”; that’s delegated to B.Server.child_spec, which returns a spec with name: B.Server.

There can only be one process on the node named B.Server, so when A.Supervisor tries to start B it fails with :already_started - because the umbrella app plumbing already starts apps listed as in_umbrella: true dependencies.

A must be able to have B in its children and restart/manage it, should such a need arise.

Regarding your original question, AFAIK there’s nothing stopping a process in A from monitoring etc a process in B.

Applications are not processes. Both can be started and stopped (granted by different API), but that’s where the similarities end.

Which applications are available or even started on a given beam instance is defined by the applications list (in elixir usually implicitly built from all applications in a mix project, their deps and extra_applications in the mix.exs).

This is almost completely separate to processes and supervision trees. Supervisors manage processes and processes only. Those processes can run code from the same application or code from other available applications.

The only place where supervision tree (processes) and applications meet is when a callback module (commonly MyApp.Application) is defined for an given application (making it stateful), which then must start a process when called and return its pid to the beam. If that process of the returned pid crashes the application is considered not working and therefore stopped by the beam (no retries or anything). Usually the process returned here is the root level supervisor for many other processes of an application, but it doesn’t need to be.

1 Like

So, I understand the issue is happening because application A is trying to start application B (namely B.server) which then explodes because B is an umbrella application and umbrella apps pluming already does this in the first place.

So a new question arises. If I don’t place B as a child of A, who is then responsible for restarting B and making sure it is restarted properly should something fail?

In which file is this decided in umbrella apps?

By this logic, if I understand correctly, the umbrella app will have both application A and application B at the same level in the supervision tree. The only thing noticeable is that A depends on B to work properly.

Am I understanding this correctly?


@LostKobrakai Although I believe I understand what you are trying to say, I must confess I fail to connect the dots in regards as to how this helps solve my particular issue. Could you elaborate on how this would manifest in the code samples I have?

Applications are never automatically restarted. Depending on the mode they’re started in an application being stopped either means the whole beam process stops (default) or nothing at all is done.

There is no “supervision” tree at the application level. Supervision only happens for processes started within applications. Applications kinda loosely “exist” side by side.

Can you elaborate why A (or maybe better :a as applications are identified by atoms, not modules) needs to manage the lifecycle of B a.k.a. :b in the first place? Can’t :a just depend on :b (in deps()) and call into whatever :b starts?

The process I’d consider “application B” is the one started by Supervisor.start_link/2 in B.Application.start/2, named B.Supervisor. That supervisor is responsible for restarting the process named B.Server if it crashes.

IIRC if B.Supervisor crashes too many times too fast then The System Is Down, but there may be ways to alert about that.

The mix.exs file for project A declares {:b, in_umbrella: true} alongside its other dependencies.

1 Like

The top level process crashing once should be enough. It’s not a child of another supervisor at this point.

Alright, so, :a and :b are both applications that exist side by side. Since :a depends on :b working properly, :a should supervise it because if :b fails, :a won’t be around for much longer.

This is why I think :a should supervise it. At the very least, someone should supervise it, otherwise I will be missing out on Elixir’s self healing, which is one of the main reasons for using it (IMO).

  • Does this mean that if I want my application to have a supervision tree (where different apps in the umbrella supervise each other), I should not make umbrella apps? If so, what’s the use case for them?

There is explicitly no level above applications and therefore no supervision of any kind. If any application* crashes the whole beam instance exits. The only way to recover from that is using system level supervisors like e.g. systemd on linux or setting up erlang’s heart to try to restart the whole instance from the outside.

An application crashing is the very end of trying to self-heal from within the beam. The application itself stopping because the root process crashed is basically the equivalent of: Restarts didn’t help, now it’s time to stop trying.

By the above logic this dependency is irrelevant, as when :b crashes it will take down the whole beam instance including :a anyways.

You need to adjust your mental model of applications. Applications are groups of code and maybe a set of stateful processes started when starting an application. That’s it. Besides order of startup (based on dependencies between applications) applications stand in no hierarchy to each other and there is also no supervision of any kind. If any application* fails the whole beam instance fails.

Stateless applications are often called libraries or library applications, so maybe thinking of stateful applications as libraries with state might make it more obvious.

Supervision trees are a completely different thing. Here you’re dealing with processes, supervision and restarts, the possibility to self heal and so on. Resilience in your system comes from splitting up code execution into different processes, while splitting code into different applications is mostly for organization of code and/or functionality.

  • Tech. there’s :transient and :temporary applications as well. Those are rarely used however.
2 Likes

So, if I understand correctly, since these applications (:a and :b) will be deployed in the same physical machine (and in the same BEAM instance, I assume), if application :a fails, it will also take down application :b, even though :b is working perfectly fine. In fact, any application failing, will kill everything, right?

Since supervision trees are concepts aimed at how we organize processes, it makes no sense to say

"I want my umbrella app :a to supervise my umbrella app :b"

If I want supervision, :b has to become a process inside of :a, which means that by then :b won’t be an umbrella app any longer, just a process inside :a.

This all means my application architecture is incorrect.

Am I getting it right?

Seems about right. Just the last statement I can’t really assess without having deep insight into your actual project.

However you might want to keep one thing in mind: As applications can be for code organization you could still have :a and :b, where :a starts processes using modules of :b. But generally it’s easier to keep code in a single application unless there’s a concrete reason not to.

1 Like

Where :a is :manager and :b is :auction_house.

At the bottom there is a graph that illustrates what I am trying to achieve:

Screenshot 2021-01-05 at 16.23.09

The main lesson I am taking from this is that umbrella apps are useful if you want to do a micro-services architecture where each app is deployed separately, but not if you want to just have a better means of organizing your code.

Since in my case, the apps will be deployed together, umbrella brings more issues than advantages.


I do have one issue however. I want to be able to have different types of interfaces. Right now I have :cli which is a Command Line Interface, but I will also have something with phoenix live. To me, it makes sense to have these separated into different apps co-existing in an umbrella, where I can define different releases…

manager and store don’t start any processes at all, and auction_house only starts an empty Supervisor except in the test environment. I’m not following what exactly could crash or need supervision…

IMO “umbrella app” is a slightly confusing name - it really should be “umbrella project” since every directory in apps is expected to be a Mix project but doesn’t have to have a mod: key in it’s application function in mix.exs.

There are good reasons to split code like this that don’t involve runtime processes / applications - for instance, at work we have a project in the umbrella that’s devoted solely to parsing a complex telecommunications message format. Making in a separate project in the umbrella keeps the code, documentations and tests for that subsystem in a well-defined area; the tests for just that project can be run separately easily, etc

Another way you could organize your umbrella with a future LiveView interface:

  • market_manager which contains all the logic to interact with the external service and persist data
  • cli which calls functions in market_manager based on user CLI commands and formats output
  • web_interface which calls functions in market_manager based on HTTP calls and formats output

In this structure, the separate projects also help ensure dependencies point the right way: calling functions in cli from market_manager will make the compiler complain.

1 Like

That is something I want to change with my next update, thus why I am asking here :smiley:

The next step in my small project is to add GenServers to :manager :store and :auction_house for some extra functionality that will come later (ETS/DETS tables, timers, etc).

For now imagine they have GenServers :stuck_out_tongue:

So, for example, an umbrella project can have libraries and statefull libraries alongside its applications. My question then is, how do you differentiate between them without entering the code?

This is exactly my goal !

But now I have a question regarding this project:

  • Is it really a good idea for it to be umbrella? Think about it, if my :cli breaks because of some bad user input, instead of being restarted, it will kill everything, market_manager included (and vice-versa). I have no self-healing at all, just a massive crash.

This is what is burning my head off. I want the organizational benefits of an umbrella project, but with the self healing and lifecycle and single apps have …

This is why the top-level process of an application is a Supervisor and not an application GenServer; barring weird hardware errors, the only reason Supervisor will exit is if its children are restarting too often - see the :max_restarts and :max_seconds options.

I’ve seen this happen in production, but it was because the supervised process had a bug and got a MatchError when running init - so no amount of restarting would help.

2 Likes

An application should in an ideal world never exit because of things known in advance having the potential to fail. E.g. just because http requests in phoenix break for a multitude of reasons this won’t make that application stop. The processes handling the requests crash, errors are logged, but that’s it. This is partly due to supervisiors, but also how cowboy/ranch deal with request processes. It’s still your job to decide how exactly errors shall propagate within a supervision tree of so. Sometimes it even makes sense to not propagate errors at some level.

After several posts from you guys I have settled in what I believe is a good organization for the project:

  • :manager, :store and :auction_house will be statefull libraries aka @LostKobrakai. They will be used like parsing project @al2o3cr mentioned.
  • :cli and :web_interface will be real applications that will have a :mod key in their application function in mix.exs.
  • When using mix release I will have a release for the cli app and one for the web_interface app.

I think this structure give me the benefits of umbrella’s organization, while still giving me the benefits of self healing that I so much value in elixir.

Thank you everyone for your help!