I’ve got an issue that is driving me a little crazy. I’m seeing some really odd behavior crop up that I don’t expect. Race conditions that seemingly change the entire logic flow and give me opposite responses from the test run just prior.
The project I’m working on has a rather large code base (all of it mine and most of it crappy I’m sure) so I’ll bring up the relevant sections from an example test-gone-wrong.
The example error I’m going to use to illustrate this is as follows:
1) test Scripts: Running Script RunErrorInterval (Exmud.Engine.Test.ScriptTest)
test/engine/script_test.exs:102
Assertion with == failed
code: assert Script.stop(object_id, RunErrorInterval.name()) == :ok
left: {:error, :script_not_running}
right: :ok
stacktrace:
test/engine/script_test.exs:107: (test)
So I’m getting an error when attempting to stop what should be a running Script. The test that failed this particular run is as follows:
@tag script: true
@tag engine: true
test "Running Script RunErrorInterval", %{object_id: object_id} = _context do
assert Script.start(object_id, RunErrorInterval.name()) == :ok
assert Script.running?(object_id, RunErrorInterval.name()) == true
assert Script.run(object_id, RunErrorInterval.name()) == :ok
Process.sleep(10) # Give Process enough time to actually run
assert Script.stop(object_id, RunErrorInterval.name()) == :ok
end
In this case a Script is a callback module which has its own GenServer worker process wrapping it which handles all of the behind-the-scenes coordination. So starting the script is making a call to a Supervisor to start a new child process. The running?
check pings the Registry module to check if the worker process is registered which obviously passes. Then the call to ‘run’ is made.
def run(object_id, name) do
send_message(:call, object_id, name, :run)
end
defp send_message(method, object_id, name, message) do
try do
apply(GenServer, method, [via(@script_registry, {object_id, name}), message])
catch
:exit, {:noproc, _} -> {:error, :script_not_running}
:exit, {:normal, _} -> :ok
:exit, {:shutdown, _} -> :ok
end
end
On the GenServer worker side the received call to run triggers the following code:
def handle_call(:run, from, state) do
if state.timer_ref != nil do
Process.cancel_timer(state.timer_ref)
end
ref = Process.send_after(self(), :run, 0)
{:reply, :ok, %{state | timer_ref: ref}}
end
def handle_info(:run, state) do
state = %{state | timer_ref: nil}
wrap_callback_in_transaction(fn ->
run(state)
end)
end
defp run(state) do
run_result = apply(state.callback_module,
:run,
[state.object_id, state.deserialized_state])
case run_result do
{:ok, new_state} ->
Logger.info("Script `#{state.script_name}` on Object `#{state.object_id}` successfully ran.")
persist_if_changed(state.object_id, state.script_name, state.deserialized_state, new_state)
{:noreply, %{state | deserialized_state: new_state}}
{:ok, new_state, interval} ->
Logger.info("Script `#{state.script_name}` on Object `#{state.object_id}` successfully ran. Running again in #{interval} milliseconds.")
ref = Process.send_after(self(), :run, interval)
persist_if_changed(state.object_id, state.script_name, state.deserialized_state, new_state)
{:noreply, %{state | deserialized_state: new_state, timer_ref: ref}}
{:error, error, new_state} ->
Logger.error("Error `#{error}` encountered when running Script `#{state.script_name}` on Object `#{state.object_id}.")
persist_if_changed(state.object_id, state.script_name, state.deserialized_state, new_state)
{:noreply, %{state | deserialized_state: new_state}}
{:error, error, new_state, interval} ->
Logger.error("Error `#{error}` encountered when running Script `#{state.script_name}` on Object `#{state.object_id}. Running again in #{interval} milliseconds.")
ref = Process.send_after(self(), :run, interval)
persist_if_changed(state.object_id, state.script_name, state.deserialized_state, new_state)
{:noreply, %{state | deserialized_state: new_state, timer_ref: ref}}
{:stop, reason, new_state} ->
Logger.info("Script `#{state.script_name}` on Object `#{state.object_id} stopping after run.")
stop_result = apply(state.callback_module,
:stop,
[state.object_id, reason, new_state])
script_state =
case stop_result do
{:ok, script_state} ->
Logger.info("Script `#{state.script_name}` on Object `#{state.object_id} successfully stopped.")
script_state
{:error, error, script_state} ->
Logger.error("Error `#{error}` encountered when stopping Script `#{state.script_name}` on Object `#{state.object_id}.")
script_state
end
persist_if_changed(state.object_id, state.script_name, state.deserialized_state, new_state)
{:stop, :normal, %{state | deserialized_state: new_state}}
end
end
The Script/Callback Module being referred to in the test has the following definition for its run callback:
def run(_object_id, _) do
{:error, :error, :ok, 60_000}
end
So what should be happening is that the callback ‘run’ method should return an error which, per the code above, should return {:noreply, %{state | deserialized_state: new_state, timer_ref: ref}}
which should in no way shut down a GenServer process during a handle_info
callback. And yet the final line of the test case, the call to ‘stop’, ends up triggering the following line in the catch :exit, {:noproc, _} -> {:error, :script_not_running}
. And yet just prior we can see that the GenServer is in fact registered with the Registry module and definitely is running. Otherwise I would have gotten an error during the run
call instead of a successful return.
The thing is, this race-condition behavior extends out to things like the supervisor modules, declared as part of the Application supervision tree, not being started when the tests try to run and start worker modules under them…which just seems fundamentally wrong.
Or cases like this:
1) test Scripts: Running Script RunOnce (Exmud.Engine.Test.ScriptTest)
test/engine/script_test.exs:75
** (exit) exited in: GenServer.call(Exmud.Engine.ScriptRunnerSupervisor, {:start_child, [10, "Elixir.Exmud.Engine.Test.Script.RunOnce", nil]}, :infinity)
** (EXIT) shutdown
code: assert Script.start(object_id, RunOnce.name()) == :ok
stacktrace:
(elixir) lib/gen_server.ex:830: GenServer.call/3
(exmud_engine) lib/engine/script.ex:251: Exmud.Engine.Script.start/3
test/engine/script_test.exs:76: (test)
Where somehow during the start up of the script the GenServer decides to shut down when it works the vast majority of the time.
If the above code snippets are not enough, you can find the latest-and-still-broken code by following this link: https://github.com/mononym/exmud/tree/develop/apps/exmud_engine