Logger "overflow" bringing down the whole system?

I recently run into a situation where my system would eventually stop logging and become very very slow. I traced all this to the mailbox of the Logger process containing around 2 million messages, which means the Logger had switched to “sync” mode and therefore all the calling code had become either extremely slow or just blocked forever.

This was using the default console back that was then handled by systemd on a CentOS VM.

I blogged about it and I proposed that the design of the Logger has severe limitations, which I think I observed. I think instead that the Logger should drop messages if the backend can’t handle them on time (and just add a synthetic message so that people can see it happened).

I’m not sure if I need to open an issue, go to the mailing list, so I start here…

1 Like

This is a very well known problem going back almost 2 decades. ^.^;

It is an issue in any language, if you ‘log too much’, the system can die. That is why you need to tune your logging for your workload.

That isn’t really true. In the Java world, for example, the norm is for configurable fixed-size buffers to be used to drop log messages when the production rate exceeds the rate they can be processed. Nobody wants an application that falls over because of logging.

Log messages should always be considered indicative, not canonical and created with the understanding that they are lossy. If you want transactional semantics, you need to look to another tool.

That’s doable via an ETS atomic counter, and has indeed been done. ^.^

I would argue that having a default Logger that will kill your system just at the time where you are experiencing high load goes against Elixir’s “pit of success” idea.

I got a reply on Twitter about Lager, which is an Erlang/OTP logger that explicitly mentions:

  • Tolerant in the face of large or many log messages, won’t out of memory the node
  • Optional load shedding by setting a high water mark to kill (and reinstall) a sink after a configurable cool down timer

Pinging @josevalim for an opinion…

When we designed Logger, we made sure it wouldn’t run into some of such situations. There is an application that was created when Lager was first designed that reproduces such situations and Logger is resilient to it.

We borrowed a lot from Lager, such as the sync/async behaviour that you blogged about. However, the sync behaviour is only useful if the same processes are logging over and over again. I am imagining that in your case you have newly spawned processes doing the logging and it is clear now that this back-pressure does not really stop new processes from being created.

It seems Lager already has mechanisms to handle those cases and we should probably follow their footsteps once more. Would you be so kind to open up an issue? I want to work on this ASAP and include a solution on v1.6.

5 Likes

Issue opened: https://github.com/elixir-lang/elixir/issues/7192

Thanks!

2 Likes