There are a few issues with the error handling.
The naive assumptions about the
:DOWN reasons are incorrect in many situations. The simplest being a failed
GenServer.call A library should not (because it can not) make assumptions about what happen based on the exit reason of a process. However we are able to format them with the
Exception module that will get it right 99% of the time.
Note that it is possible for a generic user to call
exit(:normal) which will mean that there will be a stopped gracefully but not a job ok log message. This will be treated as success but a task that calls exit(:normal) is treated a as failure because it did not return a response.
It is also possible for an async task to send a response but to exit abnormally - it is possible that the process receives an exit signal in between sending the response and exiting. Therefore when you receive a response you want to treat that as a successful job and demonitor with flush.
Logger macros in a separate function is a poor pattern because information is lost. Meta data, such as the module, function and line, are included in the log event automatically. Moving the log messages to their own function loses the function and line information. It would also be more idiomatic to include custom metadata in the log messages than prefix them with the meta data…
The fault tolerance of a queue is not ideal and it somewhat breaks guarantees that the supervision (and queue) are trying to provide. A supervision tree intends that when a process exits in its tree that everything below it has been terminated or will exit asynchronously when it does (descendants are neighbours and not trapping exits). If a child is trapping exits then it may take “some time” to terminate because the exit signal is no propagated. This means that the child can still exist when the process is restarted because it is temporarily orphaned. Therefore if trapping exits in a child process the parent should also trap exits and terminate its child in its terminate callback. This guarantees that clean up occurs before the restart so that a restart is given a clean slate. Note with the current implementation that the Task.Supervisor is trapping exits and so during a restart the concurrency limit is not enforced.
I think in this situation a Queue should be a
one_for_all supervisor because the
Task.Supervisor needs to be started before the queue server can start any jobs, and the
Task.Supervisor should be shutdown when the queue server exits. By providing a
supervisor at the top of the libraries tree it also allows more freedom to make changes to supervision in the future. It is also follows OTP principles more closely and leaves error handling to a supervisor.
If there is a single job in the queue and it fails, it will be run X number of times and then be lost forever. I am unsure if the tight loop is desirable and whether their should be a user callback to handle a dropped job.