This seems like the same issue we faced [thread] a long time ago. It used to happen like once a month. The pod would go out of memory in the older elixir version, in the newer one 1.9+, it would just hang. I spent a lot of time trying to understand where the issue was (is it docker or erlang level or elixir logger), I was never able to reproduce the issue consistently.
There is a jira ticket in erlang issue tracker which closely resembled our case. There was an interesting note about calling port_info function on the port would unblock the port. I added the following prometheus metrics collector to our services and the issue stopped occurring after that. I still don’t know what the real problem was, and how calling port_info periodically is fixing the issue
defmodule Core.Prometheus.StandardIOCollector do
use Prometheus.Collector
alias Prometheus.Model
def collect_mf(_registry, callback) do
stderr = find_by_name('2/2')
stdout = find_by_name('0/1')
if stderr do
callback.(
Prometheus.Model.create_mf(
:erlang_stderr_queue_bytes,
"STDERR port queue size",
:gauge,
__MODULE__,
stderr
)
)
end
if stdout do
callback.(
Prometheus.Model.create_mf(
:erlang_stdout_queue_bytes,
"STDOUT port queue size",
:gauge,
__MODULE__,
stdout
)
)
end
:ok
end
def collect_metrics(metric, port)
when metric in [:erlang_stdout_queue_bytes, :erlang_stderr_queue_bytes] do
{:queue_size, bytes} = Port.info(port, :queue_size)
Model.gauge_metrics([{[], bytes}])
end
defp find_by_name(name) do
Port.list()
|> Enum.find(fn port -> match?({:name, ^name}, Port.info(port, :name)) end)
end
end