Logger.Backend.Console hangs and blocks calling proccess

Hi,
I’ve been learning Elixir for about a month, and a week ago I’ve deployed my first app to a staging environment. Everything worked smoothly for a couple of days. But then the service started giving 500 responses, but to my surprise there were no new logs. I’ve restarted the app and it worked for a couple of days but then I’ve observed the same behavior.
This my Elixir and OTP versions:

Erlang/OTP 24 [erts-12.0.1] [source] [64-bit] [smp:64:64] [ds:64:64:10] [async-threads:1] [jit]
Elixir (1.12.0) 

So upon investigation I found out that one of my GenServers has hanged. This is the current stacktrace of this hanged genserver:

iex(ingestor@almighty)3> Process.info(pid_commiter, :current_stacktrace)
{:current_stacktrace,
 [
   {:gen, :do_call, 4, [file: 'gen.erl', line: 214]},
   {:gen_event, :rpc, 2, [file: 'gen_event.erl', line: 290]},
   {Logger.Handler, :log, 2, [file: 'lib/logger/handler.ex', line: 101]},
   {:logger_backend, :call_handlers, 3, [file: 'logger_backend.erl', line: 51]},
   {Admiral.BucketCommiter, :commit, 1,
    [file: 'lib/admiral/bucket_commiter.ex', line: 220]},
   {Admiral.BucketCommiter, :handle_info, 2,
    [file: 'lib/admiral/bucket_commiter.ex', line: 57]},
   {:gen_server, :try_dispatch, 4, [file: 'gen_server.erl', line: 695]},
   {:gen_server, :handle_msg, 6, [file: 'gen_server.erl', line: 771]}
 ]}

'lib/admiral/bucket_commiter.ex', line: 220 is just a simple Logger.debug call,
And Logger process is always stucking on this line:

iex(ingestor@almighty)5> Process.info(Process.whereis(Logger), :current_stacktrace)
{:current_stacktrace,
 [
   {Logger.Backends.Console, :await_io, 1,
    [file: 'lib/logger/backends/console.ex', line: 270]},
   {Logger.Backends.Console, :handle_event, 2,
    [file: 'lib/logger/backends/console.ex', line: 153]},
   {:gen_event, :server_update, 4, [file: 'gen_event.erl', line: 627]},
   {:gen_event, :server_notify, 4, [file: 'gen_event.erl', line: 609]},
   {:gen_event, :server_notify, 4, [file: 'gen_event.erl', line: 611]},
   {:gen_event, :handle_msg, 6, [file: 'gen_event.erl', line: 351]},
   {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 226]}
 ]}

I’m new to Elixir and I can’t fully understand what’s going on, but it appears like there is some kind of problem with Logger.Backend.Console as it eventually stucks on await_io call.

This is my config for logger’s console backed

config :logger, :console,
  format: {Admiral.LoggerFormatter, :format},
  metadata: [:file, :function, :line],
  sync_threshold: 1000,
  discard_threshold: 1000

I’ve tried to play with sync_threshold and discard_threshold but the problems still occurs no matter what is set in thresholds.
I’m not able to reliably reproduce the issue, sometimes it takes couple of hours before my GenServer hangs, sometime it’s couple of days.

I’ll appreciate any help, and I can provide any further details.

How do you run your application?

Can’t help with your particular problem but I’m seeing that you use Erlang 24. It is fairly new and has some backwards-incompatible changes (removed functions) and not all libraries have caught up yet.

So your safest bet would likely be to downgrade to Erlang 23.3.4.1 for now. Might also want to downgrade Elixir to 1.11.4, too.

1 Like

I built it with MIX_ENV=prod mix release and I start it as a systemd service, the configuration file for systemd is:

[Unit]
Description=Ingesting service
Requires=network.target
After=network.target

[Service]
Type=forking
User=ingestor
Group=ingestor
WorkingDirectory=/var/www/admiral/ingestor
ExecStart=/var/www/admiral/ingestor/_build/prod/rel/ingestor/bin/ingestor  daemon_iex
ExecStop=/var/www/admiral/ingestor/_build/prod/rel/ingestor/bin/ingestor stop
StartLimitBurst=10
StartLimitInterval=101
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target

I was thinking the same, I’ll try downgrade and see if the issue repeats

Do not use Type=forking for new services. It is there only for legacy and IMHO systemd should never give that option and instead use external program to watch over forking services. Additionally if need to (you do not) for service to background then do not use daemon_iex but just daemon.

Start service in the front and use Type=simple or if you want to be “modern” Type=notify together with systemd library.

2 Likes

Your link was broken. Here’s the correct one: GitHub - hauleth/erlang-systemd: systemd utilities for Erlang applications

As a side note I have used that library and it works well in my experience :+1:

1 Like

Yeah, I was on mobile so I was writing URL by hand and I forgot that there is erlang- prefix (on my machine I store it in directory without prefix and that is source of confusion).

1 Like

hi, may try using another

  • device,
  • also check your custom formatter, documentation explains.
It is extremely important that the formatting function does not fail, 
as it will bring that particular logger instance down,
 causing your system to temporarily lose messages. 
If necessary, wrap the function in a rescue and log a default message instead.

This seems like the same issue we faced [thread] a long time ago. It used to happen like once a month. The pod would go out of memory in the older elixir version, in the newer one 1.9+, it would just hang. I spent a lot of time trying to understand where the issue was (is it docker or erlang level or elixir logger), I was never able to reproduce the issue consistently.

There is a jira ticket in erlang issue tracker which closely resembled our case. There was an interesting note about calling port_info function on the port would unblock the port. I added the following prometheus metrics collector to our services and the issue stopped occurring after that. I still don’t know what the real problem was, and how calling port_info periodically is fixing the issue

defmodule Core.Prometheus.StandardIOCollector do
  use Prometheus.Collector
  alias Prometheus.Model

  def collect_mf(_registry, callback) do
    stderr = find_by_name('2/2')
    stdout = find_by_name('0/1')

    if stderr do
      callback.(
        Prometheus.Model.create_mf(
          :erlang_stderr_queue_bytes,
          "STDERR port queue size",
          :gauge,
          __MODULE__,
          stderr
        )
      )
    end

    if stdout do
      callback.(
        Prometheus.Model.create_mf(
          :erlang_stdout_queue_bytes,
          "STDOUT port queue size",
          :gauge,
          __MODULE__,
          stdout
        )
      )
    end

    :ok
  end

  def collect_metrics(metric, port)
      when metric in [:erlang_stdout_queue_bytes, :erlang_stderr_queue_bytes] do
    {:queue_size, bytes} = Port.info(port, :queue_size)
    Model.gauge_metrics([{[], bytes}])
  end

  defp find_by_name(name) do
    Port.list()
    |> Enum.find(fn port -> match?({:name, ^name}, Port.info(port, :name)) end)
  end
end

You might have been bitten by this. You can try to do update to fixed OTP versions and try again.

1 Like