Legitimate reasons to use unlinked processes in production

Pretty much as the title says. Is there any reason to start a process unlinked to any other process, particularly if that process is long lived. i.e. When have you used GenServer.start over GenServer.start_link.

5 Likes

Shooting from the hip - when would you use UDP instead of TCP on a network? Sometimes, reliability isnā€™t necessary. Maybe logging or telemetry is optional but if it dies and refuses to startup, youā€™d rather keep on processing than kill the whole service.

3 Likes

That is still not a valid reason because you can still put it under a supervisor and tell the supervisor to not care about failures at all.

So even if the job of a process is completely discardable, you should still start it under a supervision tree because it gives you visibility of your application structure and give you sane shutdown semantics (even if the semantics is kill them all).

9 Likes

That sounds as good as unsupervised

Except that within the OTP supervision framework it is easy to introspect and get information about. Iā€™ve been of the opinion for over a decade that every-single-process should be supervised or linked somewhere.

5 Likes

I think itā€™s more about keeping track of the processes and making sure they shut down when a part of the application is no longer needed. More so than anything related to reliability.

You might use UDP for performance but I donā€™t think not linking is going to have much benefit for performance

3 Likes

This is IMO the most important reason for start_link. If you use plain start, thereā€™s always a chance youā€™ll leave some dangling process behind, and that might cause various weird behaviour.

This is why a worker should sit under a supervisor even if you donā€™t want to restart it. Because supervisors are not just about restarting, but also about synchronized starting in the proper order, as well as proper termination and cleanup.

6 Likes

So, my conclusion from this thread is the following. There is not really any reason to start processes without a link.

5 Likes

That would be my conclusion as well :slight_smile:

1 Like

Related to this: What about a process that is only monitored?

start_link ensures that a bidirectional link is created between the two processes, with the supervisor trapping exits as to be able to intelligently respond to an exiting child, instead of also just blowing up. The other way around of course we do not trap exits: when the supervisor crashes, the children immediately crash as well.

Using a monitor (the supervisor monitoring the child), we would lose this second behaviour. This means that we would lose the cleaning-up that would happen when a supervisor would disappear.

My intuition tells me that this situation is less desirable, but I think it would be good to think/discuss about this related possibility in more detail. Are there any cases in which uni-directional monitoring would be better than a bi-directional link?

IMO, these things serve different purposes, and therefore donā€™t exclude each other. As mentioned in this thread, start_link is a prerequisite for proper termination of processes. I canā€™t think of any scenario where dangling process is desirable, so I think that every process should be start_link-ed under some parent (which should IMO most often be a supervisor).

A monitor is useful if you care about process termination, but you donā€™t want your termination to affect the process.

For example, letā€™s say that some message arrives over a websocket. We want to handle it and send the response back to the other side. If thereā€™s an error, we need to report that as well. But if the websocket connection is closed, we donā€™t want to stop the handling.

To make this happen, we could handle the message in a separate process. Weā€™d start this process somewhere else in the supervision tree, so the process is start_link-ed to some parent (e.g. a :simple_one_for_one supervisor). That process would then send a message back to the communication process when it has finished. To handle an error, we need to monitor that process, and handle :DOWN message with the abnormal exit reason. Such setup decouples message handling from the communication process, but at the same time ensures that the communication process can detect a failure of the message handler.

A more generalized case of this is: ā€œstart a job, report a success or failureā€. This means we have two activities: the job itself, and monitoring of its lifecycle. I usually handle this by starting the job under a :simple_one_for_one supervisor, and have the reporter process monitor the job. If you want to ensure that the termination of the reporter takes down all the associated jobs, you can bundle the reporter and the job supervisor under a common :one_for_all supervisor.

1 Like

I just ran into this, as a matter of fact, I changed from using start_link to start. Itā€™s a one-off device discovery that uses UDP multicast, the process automatically shuts down after a couple seconds. Itā€™s almost like using a port.

There are other reasons that processes use links which have nothing to do with its supervisor, if it has one. One typical case is when the process has allocated resources then typically the servers managing the resources will link to the process so if/when it dies they will be notified and can clean up after it. This definitely has nothing to do with the supervisor which should not be handling things like this.

One benefit of this is that when it does properly then I will never have to worry about cleaning up after a process. I can just let it die and other processes will detect this and automatically clean up after it.

I sort of see it as linking in 2 directions: vertically in a supervision tree; and horizontally between workers. And they have different purposes.

4 Likes

I agree, IF we are talking about long lived processes and not very short tasks. But I believe it is perfectly fine to spawn simple processes in the middle of an OTP app when those processes have a short life and are allowed to fail without consequences. Sometimes it is an overkill to create Tasks for them or create a simple_one_for_one supervisor with a transient child spec just to host these one-off processes. Thereā€™s just no upside of supervising certain processes.

However, is there any disadvantage to supervising these short-lived processes?

Nope, well a ā€˜touchā€™ of initial spawn overhead, but really nope. I still supervise everything just for the OTP tree functionality like introspection and reporting and all such.

1 Like

Iā€™d like to know if thereā€™s any advantage of it? :slight_smile:
I believe there isnā€™t, but please let me know if Iā€™m missing something important here.

A process thatā€™s heavily working on a task is not able to reply to {:system, ā€¦} messages, so I donā€™t see the point using Tasks or GenServers for this purpose either.

Whatā€™s more, supervision can start working against you, since too frequent restarts can cause the supervisor to crash itself. TaskSupervisor is configured with :temporary children, so they get restarted when they fail. (What if your processes are allowed to crash?) The the starter process looses the pid of those restarted tasks, so I really donā€™t see the benefit of supervising short lived processes.

One more thing: the tiny little overhead of supervising the process can be huge when the process has a short life.

I think the main reason is that you want to find out when your ā€˜short lived processā€™ for some reason turns out to stay around much longer than you intended.

I also believe that processes that do not respond to {:system, ...}-messages probably end up in some special section in the introspection tools, and at least many of the introspection functions work directly with the schedulers (so outside the processā€™ own execution scope) to ensure that things like infinite loops will not hamper you from e.g. seeing how much memory the process is claiming.

As I mentioned here, the main advantage of a supervised process is that it sits in the process hierarchy, and therefore it can be properly terminated when any of its ancestor terminates.

In contrast, a vanilla spawned process will linger on, which may cause problems such as reentrancy and race conditions, for example.

Of course, if a process is ā€œshortā€ (whatever that means, b/c in some cases even a single millisecond can be long), the chances of that happening are smaller. But they are still greater than zero, whereas with proper OTP supervision tree this canā€™t happen, at least not with default settings.

Moreover, the ā€œshortnessā€ of a process is a tricky thing to guarantee. You need to be absolutely sure that no matter what kind of input is given, the process is going to finish ā€œquicklyā€. If thereā€™s some bug which causes the process to run longer, the chances of dangling process increase. If a bug causes a process to hang indefinitely, the system might not be able to fix that automatically, and a human operator needs to fix the problem manually.

Therefore, I would never advise using plain spawns in production, or otherwise bypassing OTP hierarchy of processes. By doing this, youā€™re creating processes in a limbo, outside of any OTP app or supervision tree. In many cases it might work fine, but when it bites you, it can be nasty, and hard to understand. Iā€™m speaking from my personal experience here :slight_smile:

A middle-ground between a supervised process and a plain spawn could be to start a Task directly with start_link or async. At least with that, the OTP hierarchy is preserved, and the risks are reduced. You can still mess things up, but the risk surface is smaller than with plain spawns.

Task.Supervisor indeed uses :temporary restart strategy by default. However, that means that tasks are not restarted when they fail, and this is usually what you want.

The overhead of asking a supervisor to start some child is in most cases insignificant. In 7 years of working with Erlang, Iā€™ve never personally encountered the case where supervisor overhead was a problem.

That said, there is a known bottleneck if many processes frequently ask the same supervisor to start some child. Since all those requests are serialized in the same process, the supervisor can become overloaded, and that can cause problems. If thatā€™s the case, sharding (using multiple supervisors) can usually help.

1 Like

Hi sasa,

As I wrote a few comments above, I agree with supervising processes that are there for a long time. What I tried to say is that in my opinion really short processes are fine to go without supervisors. And when I write short lived processes, I really mean short lived ones.

A accept that supervision in this case can help reveal that a short process turnes into a long running one. You guys are right, this is a valid reason to supervise them.

What was in my mind when I wrote is a system I had to write 2 weeks ago. It consists of two processes: 1) a GenServer that is receiving network packets from an ethernet interface and 2) another process that is a GenEvent event manager. The purpose of the system is to generate events for other applications when certain packets appear on the LAN. I donā€™t want the GenServer to be a bottleneck, so I donā€™t parse the raw packets there. Instead, I spawn a process with this function:

  def assimilate(raw_packet) do
    {:packet, _link_type, time, _pkt_len, frame} = raw_packet
    {:ok,
      {[
        {:ether, local_mac, _remote_mac, _, _},
        {:ipv4, _, _, _, _, _, _, _, _, _, _, _, remote_ip, _local_ip, _},
        {:tcp, _, _, _, _, _, _, _, _, _, 1, _, _, 1, _, _, _, _, _raw_tcp}
       ], _}
    } = :pkt.decode(frame)
    true = :ets.member Sniffer.HostList, remote_ip
    timestamp = :calendar.now_to_universal_time time
    GenEvent.notify Sniffer.Events, {timestamp, local_mac, remote_ip}
  end

It uses and Erlang lib to parse the packet. I have to filter the SYN+ACK (those are the 1s in the pattern) TCP packets that where the ip address is on a predefined list. Because of the pattern matches this process crashes as soon as it find out that the packet is not of the specified type. The last line is only reached when everything is a-okay with it.

Do you really think that this simple code should run under a supervisor? Is it a real danger that this one does not stop for a long time, and the process remains alive? I really would like to know if you guys think that this solution is wrong or could be rewritten in a more stable, more effective or more elegant way.

I believe it depends on the exact situation if itā€™s okay to have raw processes or itā€™s better to use OTP. And yes, I accept that as a general rule itā€™s better to always use supervisors.

1 Like