Slow process timeout when called from another GenServer

overture8 · August 31, 2017, 9:07am

I’m having some trouble understanding what the best approach is for dealing with slow GenServer calls. I’ve created an example of what I’m trying to do here: https://github.com/overture8/my_app.

Essentially, I have two GenServers: GenServer1 and GenServer2. In GenServer1 I make a call to GenServer2.slow_thing(), which is mocked to be a slow action - and given the 5 second default timeout of a GenServer call, I get a timeout error.

iex(1)> MyApp.GenServer1.do_somthing()
** (exit) exited in: GenServer.call(MyApp.GenServer1, :do_something, 5000)
    ** (EXIT) time out
    (elixir) lib/gen_server.ex:737: GenServer.call/3
iex(1)>
09:51:20.799 [error] GenServer MyApp.GenServer1 terminating
** (stop) exited in: GenServer.call(MyApp.GenServer2, :slow_thing, 5000)
    ** (EXIT) time out
    (elixir) lib/gen_server.ex:737: GenServer.call/3
    (my_app) lib/my_app/gen_server1.ex:14: MyApp.GenServer1.handle_call/3
    (stdlib) gen_server.erl:636: :gen_server.try_handle_call/4
    (stdlib) gen_server.erl:665: :gen_server.handle_msg/6
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Last message (from #PID<0.131.0>): :do_something
State: []
Client #PID<0.131.0> is alive
    (kernel) code_server.erl:140: :code_server.call/1
    (kernel) error_handler.erl:41: :error_handler.undefined_function/3
    (iex) lib/iex/evaluator.ex:226: IEx.Evaluator.print_error/3
    (iex) lib/iex/evaluator.ex:158: IEx.Evaluator.eval/4
    (iex) lib/iex/evaluator.ex:61: IEx.Evaluator.loop/3
    (iex) lib/iex/evaluator.ex:21: IEx.Evaluator.init/4
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3

I can fix this by changing the default timeout in both GenServer1 and GenServer2 to something greater than 10 seconds (the sleep time mocked in GenServer2). However, I’m not really sure what the best practice it here? I can’t use a cast because I want to have a return value.

orestis · August 31, 2017, 9:53am

So I have been struggling about how to model this in my code as well.

As far as I can tell, this implicit timeout in GenServers is a design decision from the OTP team.

In the end, the client (i.e. GenServer1 in your case) knows best how to handle this - and what kind of delay is acceptable. I believe the 5s default timeout is generous enough for most things that should be very fast but might need to go over a network, simplifying the calling code.

If you expect your calls to finish within the default timeout, then hitting the timeout is a good thing: it probably points to a problematic component somewhere in your system. Is the DB overloaded? Network slow? CPU on remote node churning? etc. The caller can’t handle with those things so it promptly crashes. If the default timeout wasn’t there, you’d have to defensively add it yourself on every call site.

If you are actually modelling a known long running process, e.g. transcoding a video, doing a batch processing job etc, then you need to explicitly design your system around it.

One possible design for a long running job is you start a long-running job, get a ticket back. Pass the ticket to the job manager to query the status of your job. The ticket signifies that the job has been successfully received.

Additionally/alternatively, register a PID for that ticket/job, and send a message in that process mailbox when the job finishes.

overture8 · August 31, 2017, 11:55am

Thanks @orestis - that has cleared things up for me