We’re using AWS ECS to deploy our Elixir app which uses blue/green deployments. This means that whilst the deployment is occurring (and shortly after it) we are running our app across multiple nodes.
We have Horde setup to enable the handover of supervisors, genservers, etc… when the old node gets taken down. This is all working and we are seeing the correct behaviour with the exception of the following error that keeps occurring in our logs.
[error] GenServer MyApp.MySupervisor.NodeListener terminating
(horde 0.9.0) lib/horde/node_listener.ex:34: Horde.NodeListener.handle_info/2
(stdlib 6.2) gen_server.erl:2345: :gen_server.try_handle_info/3
** (stop) exited in: GenServer.call(MyApp.MySupervisor, {:set_members, [{MyApp.MySupervisor, :"MyApp@21.60.73.12"}, {MyApp.MySupervisor, :"MyApp@68.18.57.76"}]}, 5000)
** (EXIT) time out
(elixir 1.18.2) lib/gen_server.ex:1128: GenServer.call/3
(horde 0.9.0) lib/horde/node_listener.ex:50: Horde.NodeListener.set_members/1
(stdlib 6.2) gen_server.erl:2433: :gen_server.handle_msg/6
(stdlib 6.2) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
I can only reproduce this on production and haven’t had any success locally or in staging.
My assumption is that the second member in that list has already gone down and become unreachable, but the AWS internal DNS still has a reference to it, and so Horde thinks it’s still there.
Is there a way I can or should handle this error case? And are there likely any negative side effects caused by this?