Wanted: Your OTP architecture diagram

I’ve been reading Elixir in Action to get a better grasp of OTP. Part of the book works through building up an OTP app, and shows a diagram, which I’ve recreated here in ASCII using Monodraw:

                ┌────────────────────┐
                │  Todo.Supervisor   │
                │   (rest_for_one)   │
                └────────────────────┘
                           │
          ┌────────────────┴───────────────┐
┌─────────▼───────────┐       ┌────────────▼─────────────┐
│Todo.ProcessRegistry │       │  Todo.SystemSupervisor   │
└─────────────────────┘       │      (one_for_one)       │
                              └──────────────────────────┘
                                            │
                ┌───────────────────────────┼──────────────────────┐
    ┌───────────▼──────────┐  ┌─────────────▼────────────┐  ┌──────▼──────┐
    │ Todo.PoolSupervisor  │  │  Todo.ServerSupervisor   │  │ Todo.Cache  │
    │    (one_for_one)     │  │   (DynamicSupervisor)    │  │             │
    └──────────────────────┘  │                          │  └─────────────┘
                │             └──────────────────────────┘
                │                           │
                │                           │
     ┌──────────▼─────────┐         ┌───────▼───────┐
     │Todo.DatabaseWorker ├─┐       │  Todo.Server  ├┐
     └─┬──────────────────┘ ├┐      └┬──────────────┘├┐
       └┬───────────────────┘│       └┬──────────────┘│
        └────────────────────┘        └───────────────┘

This diagram and the description of the app were very illuminating for me. @sasajuric tells why we spawn a fixed number of Todo.DatabaseWorker but a dynamic number of Todo.Server, why we use one_for_one on Todo.SystemSupervisor but rest_for_one on Todo.Supervisor, etc. (I would explain here if @sasajuric doesn’t mind, but don’t think it would be polite otherwise. Buy his book! :smile: )

My request to you all: If you have a production OTP app and don’t mind, please share a diagram like this one of your application (or some interesting part of it) and explain the architecture choices in general terms. Maybe start with :observer.start(), then make an ASCII graph (eg with Monodraw or ASCIIFlow) and edit for clarity and/or discretion. I’m looking for things like:

  • Which processes talk to each other and how do they find each other?
  • Why did you choose multiple processes of type X but only one of type Y?
  • Why did you choose to put A and B under different supervisors?
  • Why did you choose the restart strategies you did?

I think this kind of high-level discussion would be very educational, without getting into all the details of the code.

Please share!

18 Likes

Thanks for creating this topic! I think it’s a discussion that will be very enlightening for all involved.

I know that it is not quite what you want but I really like ElixirConf 2016 - Selling Food With Elixir by Chris Bell which has a great description of how they split up setup their OTP architecture (and that link goes to that portion of the talk). I especially find it cool how their design mirrors their business domain.

3 Likes

Crickets, eh? :laughing: Alright, I’ll post one. Here’s part of a supervision tree from an application I’ve been looking at, built by @alexgaribay.

         ┌────────────────────────┐
         │ Streams RootSupervisor │
         │     (one_for_all)      │
         └────────────────────────┘
                      │
              ┌───────┴─────────────────────┐
              ▼                             ▼
  ┌──────────────────────┐      ┌───────────────────────┐
  │ Stream Info Fetcher  │      │  Streams Supervisor   │
  │                      │      │(dynamic, one_for_one) │
  └──────────────────────┘      └───────────────────────┘
                                            │
                     ┌──────────────────────┴───────────────────────┐
                     ▼                                              ▼
           ┌──────────────────┐                           ┌──────────────────┐
           │Stream Supervisor │                           │Stream Supervisor │
           │  (one_for_all)   │                           │  (one_for_all)   │
           └──────────────────┘                           └──────────────────┘
                     │                                              │
          ┌──────────┴──────────┐                        ┌──────────┴──────────┐
          │                     │                        │                     │
          ▼  GenStage Pipeline  ▼                        ▼  GenStage Pipeline  ▼
┌──────────────────┐  ┌──────────────────┐     ┌──────────────────┐  ┌──────────────────┐
│  Stream Fetcher  │  │ Stream Processor │     │  Stream Fetcher  │  │ Stream Processor │
└──────────────────┘  └──────────────────┘     └──────────────────┘  └──────────────────┘

There are multiple incoming streams of data that need to be processed. We don’t know exactly what those streams are when the application boots.

StreamsRootSupervisor starts one Stream Info Fetcher, which makes a network request and gets back a list of the streams. Stream Info Fetcher tells Streams Supervisor to start one Stream Supervisor for each stream we need to process. This is why Streams Supervisor is “dynamic” - we start its children as we discover that we need them.

Each Stream Supervisor is in charge of making sure one stream of data is processed. To that end, it starts a Stream Fetcher to pull down data for that stream and a Stream Processor to process it, with the two connected in a GenStage pipeline.

Those are all the parts. Regarding supervision strategies:

  • At the top, Streams RootSupervisor uses one_for_all because its two children depend on one another; if one crashes, they both need restarting.
  • Streams Supervisor uses one_for_one because each stream is processed independently; if one crashes, the others should carry on.
  • Stream Supervisor uses one_for_all because a Stream Fetcher and its Stream Processer are useless without each other; if one crashes, they both need restarting.

Note that once Stream Info Fetcher has done its work at boot time, it’s no longer needed and terminates. This doesn’t trigger StreamsRootSupervisor to restart all its children unless Stream Info Fetcher actually crashes; because Stream Info Fetcher is :transient; it’s expected to terminate normally and not be restarted.

7 Likes

Hard to make it manually, especially with large sets. :wink:

What we need is a plugin that inspects the running system and generates a supervision graph with indicators showing children that pop in and out of existence and so forth. ^.^

3 Likes

No need to post the whole thing! Pick a part of the tree with something interesting happening and just explain that.

That would be great! :thumbsup:

Here is one of my OTP engine.

I needed to generate token for a given period of time. The worker just call stop after this period, and the GenServer traps the workers EXIT.

Often I use a main supervisor, coupled GenServer and a worker supervisor. Spawning workers on demand. I could have added a stash and a cache to this example.

This replace the old way of doing with simple_on_for_one supervisors and remove the need for gproc.

It can go fractal, as the workers can be a combination of the same pattern (One main Sup, one GenServer, one worker dynamic supervisor and many workers).

It was a nice way to play with Registry and Dynamic Server. It is also useful to see linked processes in the observer.

I still think it’d be cool to have a task that generated these ascii/dot/pdf diagrams for you, especially in the spirit of always-up-to-date project documentation. Observer is super for understanding the runtime behaviour of running systems, but screenshots of it is a poor way to document, share, and understand the compile-time design of an application’s architecture.

I come back to this concept now and again but still haven’t gotten around to trying to implement it, partially because observer seems to suffice for most people so I’m not sure it’s worth the effort. I really want it so I’m sure I’ll get around to it one of these days.

2 Likes

Great idea for a thread! Yes, Elixir in Action rocks!

I wrote this a couple of years ago. Here’s my diagram with fun colors!

Parallel hangman game play works by spinning up pairs of player workers and game servers (which talk to each other). The rest of the process tree is mostly administrative support. There’s more diagrams below on the github… I didn’t get around to diagramming the GenStage dictionary ingestion pipelines.

5 Likes