Erlang error: {:timeout, {:gen_server, :call, [#PID<0.7078.0>, {:checkout, #Reference<0.180397287.2702180353.191007>, true, 5000}, 5000]}}

I have a Phoenix project running on Docker Swarm using AWS spot machines and I’m getting this error when the service tries to start

(ErlangError) Erlang error: {:timeout, {:gen_server, :call, [#PID<0.7078.0>, {:checkout, #Reference<0.180397287.2702180353.191007>, true, 5000}, 5000]}}

This error sometimes happens, sometimes not, I didn’t figure out the pattern which triggers the error. Have you guys ever got this error?

I’m using Elixir 1.7.4 and Erlang 21

bash-4.4$ /opt/app/releases/1.0.0/content_proxy.sh remote_console
Erlang/OTP 21 [erts-10.1.1] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe]

Interactive Elixir (1.7.4) - press Ctrl+C to exit (type h() ENTER for help)
iex(content_proxy@127.0.0.1)1>

My O.S is Alpine Linux 3.8.4

My Application module only connect into MongoDB database

defmodule ContentProxy.Application do
  # See https://hexdocs.pm/elixir/Application.html
  # for more information on OTP Applications
  @moduledoc false

  use Application

  def start(_type, _args) do
    # List all child processes to be supervised
    database_config = Application.get_env(:content_proxy, :db_config)

    children = [
      ContentProxyWeb.Endpoint,
      %{
        id: Mongo,
        start: { Mongo, :start_link, [database_config] }
      }
    ]

    {:ok, _} = Logger.add_backend(Sentry.LoggerBackend)

    opts = [strategy: :one_for_one, name: ContentProxy.Supervisor]
    Supervisor.start_link(children, opts)
  end

  # Tell Phoenix to update the endpoint configuration
  # whenever the application is updated.
  def config_change(changed, _new, removed) do
    ContentProxyWeb.Endpoint.config_change(changed, removed)
    :ok
  end
end

An important info, my mongodb service is running and connecting very well

NAME           IMAGE             NODE     DESIRED STATE       CURRENT STATE  
mongodb       mongo:latest        web3      Running             Running 7 days ago

Apparently GenServer.call/3 is timed out after 5000ms.

Do you use mongo directly from genserver? like a process -> genserver -> mongo. If so, it seems like the first part is timed out, because genserver is blocked for some reasons. GenServer will serialize all “call” messages, and handles them sequentially so it will be a bottleneck.

mongo process is managed by Supervisor

def start(_type, _args) do
    # List all child processes to be supervised
    database_config = Application.get_env(:content_proxy, :db_config)

    children = [
      ContentProxyWeb.Endpoint,
      %{
        id: Mongo,
        start: { Mongo, :start_link, [database_config] }
      }
    ]

    {:ok, _} = Logger.add_backend(Sentry.LoggerBackend)

    opts = [strategy: :one_for_one, name: ContentProxy.Supervisor]
    Supervisor.start_link(children, opts)
  end

According to Supervisor doc https://hexdocs.pm/elixir/Supervisor.html, when Supervisor shutdown a child process, he waits until 5000 milliseconds before to kill it. This timeout is only for worker or infinity child spec.

an atom that defines how a child process should be terminated (see the "Shutdown values" section below). This key is optional and defaults to 5000 if the type is :worker or :infinity if the type is :supervisor.

Mongodb defines itself as a worker process (https://github.com/ankhers/mongodb/blob/master/lib/mongo.ex)

  @spec start_link(Keyword.t()) :: {:ok, pid} | {:error, Mongo.Error.t() | atom}
  def start_link(opts) do
    opts
    |> UrlParser.parse_url()
    |> Topology.start_link()
  end

  def child_spec(opts, child_opts \\ []) do
    Supervisor.Spec.worker(Mongo, [opts], child_opts)
  end

According to MongoDB source code (https://github.com/ankhers/mongodb/blob/master/lib/mongo.ex), he waits until 5000 for the initial connection, so I can’t say if the problem is when Supervisor tries to terminate mongodb process or when MongoDB process tries to connect the MongoDB host.

1 Like

This has been something I have been trying to track down for a while now. Unfortunately I have never been able to replicate it. There is an issue open with a fair amount of traffic.

What version of the mongodb package are you currently using? You say that it sometimes happens and sometimes not. Can you say roughly how frequently it happens? Something like 20-30%. Would it be possible for me to gain access to your database to see if I am able to replicate the issue? If I can replicate it, there is a possibility I can track down what is happened and fix it.

1 Like

What version of the mongodb package are you currently using?
-> I’m using 0.4.7 version.

Can you say roughly how frequently it happens?
-> rarely. Happened in two days many times (11 times). After these two days, all works very well. I noticed in these two days the error happened when docker swarm shutdown the Phoenix application in a node and starts it in another node.

Would it be possible for me to gain access to your database to see if I am able to replicate the issue?
-> Unfortunately, not. My docker swarm is placed at AWS VPC security groups and I have sure my DevOps Leader wont autorize it :smiley:

I was talking to my DevOps Leader and he said to me it is possible docker swarm starts the service while the network layer doesn’t ready, so connection timeout problem can happen