This question might dive into sub-areas, so I would like to clarify as much as possible.
Application:
The application has multiple genserver process started though Supervisor and Dynamic supervisor. Application also uses elixir Logger, LoggerFileBackend to log into different files.
Behaviour:
Sometimes some of the process crash also (and recover successfully). Sometimes Logger process crashes completely and logging in all the files stops. This is completely random, happening at stretch of months, or sometimes multiple times in a day. There is no pattern or logs which might indicate the reason and it is not recoverable (for the logger process). However, the application runs fine (for months without logging) and there are no logs or any crash report during this event. This issue only gets resolved after restarting the application (which is not acceptable).
Observations:
As there are no log reports, so this is based on debugging which I was able to do:
From application iex shell, I can confirm that Process.whereis(Logger) returns nil.
I assume that some genserver might be crashing multiple times repeatedly - causing Logger to crash entirely. The application genserver is able to recover, however, the logger doesnt recover. This is only my assumption as I dont get any crash report for any genservers of application during this event. Other genserver crashes do get logged properly (when Logger has not crashed).
Environment:
FROM erlang:25.1-alpine
ENV ELIXIR_VERSION="v1.13.4" LANG=C.UTF-8
Flying blind would worry me. As a first measure I’d do my best to find why, even if it takes a while. As a second measure I’d fire up a separate process that periodically checks if the logger is alive and try to start it if it’s dead; I reckon once every 5 seconds is enough.
But really, don’t you get any stack traces? If you configure f.ex. Loki then all the stdout and stderr outputs get put in the logging backend of Grafana and you’ll see everything.
That is not surprising, as there is no such process at all in Logger. It seems that the Logger in :registered_names in logger.app file is just an omission. You can easily test that by running:
$ iex
iex> Process.whereis(Logger)
nil
My bet there is that either there is some message that handler was removed because of some failure, or there is message that there is too many messages in the process inbox and you run in the drop mode.
I get all the logs before Logger crash, a lot of it in separate log files, and all the error or stdout logs in console(as configured in LoggerFileBackend). @dimitarvp As there are no logs after this event, so not able to debug further. I already use continuous monitoring and alerting stack, and able to restart the application once the issue comes. On production, this is all very helpful, but the end result is same which is my whole application has to be restarted.
Can you suggest how to restart Logger service once its dead? Honestly, I might be missing something here (and maybe my question is too naive), but I am not able to get how to restart only the Logger process.
@LostKobrakai Thanks, I already have this enabled. I get all the extra logs on console (even when any genserver start_link function gets called). I have tested even crashing genservers manually to verify logs, and everything is logged properly. However, nothing gets logged during this logger crash event.
My understanding might be wrong here, but my observation is like this:
So, even if I kill the logger process, everything resumes. However, due to some event, this becomes non-recoverable and after this event only Process.whereis(Logger) becomes nil.