Case to use or not use :infinity as timeout for calls

sezaru · March 20, 2020, 12:15pm

Hello,

Setting the timeout value in my GenServer calls was always something that I found uncomfortable, it seemed to me that 5 seconds default was kinda a “random” number and I was not sure what number I needed to use.

At the same time, I was scared of using :infinity for it since you rarely see an example using it, so I thought it was not safe (I thought you would be stuck if the callee died or something like that).

Well, looking deeply I found this link Thoughts on when to use ‘infinity’ timeouts for gen_server:call and friends. They do discuss the call default timeout and says that it should be :infinity by default.

After that, I did some tests and indeed it seems to me to be very safe to use it as the default (and only use a timed timeout when it really makes sense). It fixed a lot of issues I had with timeouts when I changed the backend machine processing power which would often trigger these timeouts.

So, my question is a two-part one (sorry for that), the first part is, what is your opinion about that? Maybe Genserver documentation should be more clear about it (If it is I couldn’t find it sincerely)? Maybe we should have :infinity as the default timeout for calls as the link suggests?

The second part of the question is in regard of other timeouts configurations in the system that too are not really clear if they are safe to use :infinity or not.

For example, Ecto.Repo, you have the :timeout parameter for your config, this is what the documentation tells about it:

The time in milliseconds to wait for the query call to finish. :infinity will wait indefinitely (default: 15000)

As you can see, it mentions :infinity, but it is not clear (at least to me) if the query call is a Genserver.call or it is the call to the database server. If it is the first, I would consider safe to use :infinity since if the callee dies, we will not be stuck. But if it is referencing the database server, then my guess is that it could simply crash/disappear/whatever and it would never return from it, basically being locked in this call forever.

So, the second part of my question is, is it safe to use :infinity for the case of Ecto.Repo as an example? Do you know other libs that would not be?

Thank you very much.

lucaong · March 20, 2020, 12:34pm

Indeed, in recent Elixir/Erlang versions, setting the GenServer call timeout to :infinity is a good default in my opinion: it is generally safe, does not depend on arbitrary timeouts, does not require special handling of messages that come late if the caller rescues failures, and can provide a back pressure mechanism in some cases. If the GenServer crashes, the caller will be notified, so it won’t be hanging forever. One case when it would hang though, is if your GenServer handles the call with a {:noreply, state} and then never sends a reply for whatever reason.

As for Ecto.Repo, I am personally not sure about the implication of the timeout.

asummers · March 20, 2020, 12:43pm

IMO :infinity an an antipattern. It could be easily replaced with 3 hours and have the same effect. Should this run for 3h? Probably not. So why say it can run for infinity? If I have an Ecto query that has to e.g. take a whole table lock, and it can’t get the lock (say for TRUNCATE), giving it an infinity timeout will cause it to hold a DB worker forever. Get enough of these and you have major resource contention on your DB.

tiagodavi · March 20, 2020, 12:49pm

It makes sense.

sezaru · March 20, 2020, 12:51pm

I know what you mean, but at the same time, you are referring IMO to a specific situation, for that case you can consider a lower timeout because you want it to not lock something for too much time.

The point is that for the majority of the calls you would do, this would not be the case, and for that cases, it doesn’t seem to me that it makes sense to use anything but :infinity with the guarantees that @lucaong enumerated.

tiagodavi · March 20, 2020, 12:53pm

I needed to use :infinity in my tests because I was expecting a rabbitmq server returns. It works fine, but what @asummers said makes more sense.

asummers · March 20, 2020, 12:55pm

If I have a single GenServer there’s only one message queue, so it exhibits the same resource contention as a DB. You can model this differently, of course, but naively using :infinity everywhere has the potential to deadlock your whole app. There are cases where you do need :infinity but I can’t think of any off the top of my head where saying 3 weeks or some equally silly large number would be less appropriate.

lucaong · March 20, 2020, 12:57pm

The thing is, speaking about asynchronous calls in general and not referring to GenServer, it makes sense to explicitly timeout when something takes more than reasonable. With GenServer though, if the call times out, the caller fails, but the server is still running and trying to produce the result even after the timeout. In other words, the deadlock would still be there, as the GenServer would be still blocked. If one really wants to free up resources when an operation takes too long, a custom timeout logic on the GenServer side is better than a timeout on the caller.

Conversely, using a timeout of :infinity would at least ensure that if there is an unreasonable delay, it surfaces immediately. The right action to take is then to enforce a timeout logic that cleans up resources, which is not what the GenServer.call/3 timeout does.

The GenServer timeout was absolutely necessary back when gen_server could crash without the caller knowing about that. Nowadays it’s generally not the case anymore.

sezaru · March 20, 2020, 1:00pm

That is a very good point that I was not aware of but it makes total sense if you think about it. The caller will be free but the callee would be still “locked/blocked” doing the job the caller requested.

sezaru · March 20, 2020, 1:11pm

When you say potential deadlock you mean when the caller calls the callee and the calle calls the caller back?

Yeah, I can see that you would get a deadlock forever, but at the same time I don’t see setting the timeout number to something big would help in this case, you would get this GenServer blocked for 3 weeks anyway until the timeout and then probably not too much time later a similar call would come that would block it again to more 3 weeks.

Personally I would consider this specific case as a software bug that needs to be fixed in the code side and not mitigated by timeout parameters.

shanesveller · March 20, 2020, 3:04pm

I would actually say that IME :infinity timeouts have precisely the opposite effect. They obfuscate the problem because operations which can never succeed (in a reasonable time frame or not) do not result in local, actionable errors that can be handled at the calling site, logged, observed, or otherwise raised for human attention. They just result in deadlocks and upstream timeouts where someone else above you in the logical hierarchy chose not to use :infinity, perhaps at your load-balancing layer in the case of a web service.

The only way I can say :infinity helped me do discovery while understanding a problem was by providing a big red flag that I can search the codebase for to find the likely offender, which I assume is not how you meant this.

lucaong · March 20, 2020, 4:02pm

Well, I understand that timeouts in general are a good thing, but the GenServer.call timeout gives you no chance of cleaning up, and hides the real problem.

Suppose we are in a scenario in which the GenServer operation is very slow. For the sake of the argument, let’s say it hangs forever. If the caller enforces a timeout with GenServer.call/3, even after the timeout elapses the GenServer is still hanging. Any subsequent call will be queued in the mailbox of the hanging GenServer, which keeps growing unbound. The real problem is not solved, because the GenServer is not released.

Using a timeout of :infinity would block the caller forever. It is also not a great course of action, but it reflects the real performance of the GenServer, propagating backpressure at least on that specific caller. I agree that it’s not the solution, but my point is that setting a timeout is not a solution either.

The real solution in such a case is a use-case specific timeout logic inside the GenServer, that knows how to cleanup resources.

lucaong · March 20, 2020, 6:13pm

Here is a code example of what I mean. Let’s simulate a slow call (also printing a message every second while waiting):

defmodule Slow do
  def call(seconds) do
    for i <- (1..seconds) do
      IO.puts("Waiting #{i}...")
      Process.sleep(1_000)
    end
  end
end

Now create a GenServer setting a 3 seconds call timeout:

defmodule One do
  def start_link(), do: GenServer.start_link(__MODULE__, [], [])

  def hang(pid, seconds \\ 10),
    do: GenServer.call(pid, {:hang, seconds}, 3000)

  def init(_), do: {:ok, nil}

  def handle_call({:hang, seconds}, _from, state) do
    reply = Slow.call(seconds)
    {:reply, reply, state}
  end
end

If we call One.hang(pid), it will hang until the 3 seconds timeout elapses, then error. As we can see from the printed messages though, the slow operation is still going on, and further calls will just engulf the inbox more and more:

{:ok, pid} = One.start_link()

One.hang()
# Waiting 1...
# Waiting 2...
# Waiting 3...
# ** (exit) exited in: GenServer.call(#PID<0.165.0>, {:hang, 30}, 5000)
#     ** (EXIT) time out
#     (elixir) lib/gen_server.ex:1009: GenServer.call/3
# Waiting 4...
# Waiting 5...

One.hang()
# Waiting 6...
# Waiting 7...
# Waiting 8...
# ** (exit) exited in: GenServer.call(#PID<0.165.0>, {:hang, 30}, 5000)
#     ** (EXIT) time out
#     (elixir) lib/gen_server.ex:1009: GenServer.call/3
# Waiting 9...
# Waiting 10...
# Waiting 1...
# Waiting 2...
# Waiting 3...

The GenServer.call timeout is not really helping, as it only stops the caller, not the callee. What would work is to implement logic on the callee side to stop the slow operation and cleanup if a timeout elapses:

defmodule Two do
  def start_link(), do: GenServer.start_link(__MODULE__, [], [])

  def hang(pid, seconds \\ 10),
    do: GenServer.call(pid, {:hang, seconds}, :infinity)

  def init(_), do: {:ok, nil}

  def handle_call({:hang, seconds}, from, state) do
    task = Task.async(Slow, :call, [seconds])

    case Task.yield(task, 3000) || Task.shutdown(task) do
      {:ok, reply} -> {:reply, reply, state}
      nil -> {:stop, :timeout, state}
    end
  end
end

In this case, the slow call is wrapped in a Task that is terminated when the 3 seconds timeout elapses:

Two.hang(pid)
# Waiting 1...
# Waiting 2...
# Waiting 3...

# 18:57:12.341 [error] GenServer #PID<0.165.0> terminating
# ** (stop) time out
# Last message (from #PID<0.104.0>): {:hang, 30}
# ...
# ** (EXIT from #PID<0.104.0>) shell process exited with reason: time out

No more "Waiting #..." messages are logged, confirming that the slow Task is terminated after the timeout elapses.

jc00ke · March 20, 2020, 11:08pm

Just this week I had to switch some streaming queries to use timeout: :infinity because I was streaming several 100k CSV rows. Actually turned out Heroku couldn’t/won’t handle it (kept getting H18 errors) so I’m working on another solution. But I know I can’t get all the data out w/o using :infinity.

dimitarvp · March 20, 2020, 11:30pm

I usually get annoyed by some longer-running requests to 3rd party API providers and just raise the timeout to 1 or 2 minutes and just pray that the callees (the GenServers) aren’t eternally waiting on a response long after the callers have timed out. I am usually extremely conservative and carefully study the timeout and cancellation options of the network services my apps work with. Take special care never to use :infinity there!

As @lucaong excellently demonstrated, the timeouts don’t do much for your application in general if the callees are deadlocked. So it’s best that you take special care that your GenServers will always eventually receive some kind of a response from the 3rd party service, even if it’s a failure. In that scenario I usually put anything between 5 to 120 seconds timeout in those GenServers and add 1-2 more seconds on top of that for my callers.

Haven’t worked on NASA-level projects yet and can’t say how universally applicable such an approach is but I believe it’s a reasonable tradeoff.