Race conditions during testing

I’ve got an issue that is driving me a little crazy. I’m seeing some really odd behavior crop up that I don’t expect. Race conditions that seemingly change the entire logic flow and give me opposite responses from the test run just prior.

The project I’m working on has a rather large code base (all of it mine and most of it crappy I’m sure) so I’ll bring up the relevant sections from an example test-gone-wrong.

The example error I’m going to use to illustrate this is as follows:

    1) test Scripts: Running Script RunErrorInterval (Exmud.Engine.Test.ScriptTest)
      test/engine/script_test.exs:102
      Assertion with == failed
      code:  assert Script.stop(object_id, RunErrorInterval.name()) == :ok
      left:  {:error, :script_not_running}
      right: :ok
      stacktrace:
        test/engine/script_test.exs:107: (test)

So I’m getting an error when attempting to stop what should be a running Script. The test that failed this particular run is as follows:

    @tag script: true
    @tag engine: true
    test "Running Script RunErrorInterval", %{object_id: object_id} = _context do
      assert Script.start(object_id, RunErrorInterval.name()) == :ok
      assert Script.running?(object_id, RunErrorInterval.name()) == true
      assert Script.run(object_id, RunErrorInterval.name()) == :ok
      Process.sleep(10) # Give Process enough time to actually run
      assert Script.stop(object_id, RunErrorInterval.name()) == :ok
    end

In this case a Script is a callback module which has its own GenServer worker process wrapping it which handles all of the behind-the-scenes coordination. So starting the script is making a call to a Supervisor to start a new child process. The running? check pings the Registry module to check if the worker process is registered which obviously passes. Then the call to ‘run’ is made.

  def run(object_id, name) do
    send_message(:call, object_id, name, :run)
  end

  defp send_message(method, object_id, name, message) do
    try do
      apply(GenServer, method, [via(@script_registry, {object_id, name}), message])
    catch
      :exit, {:noproc, _} -> {:error, :script_not_running}
      :exit, {:normal, _} -> :ok
      :exit, {:shutdown, _} -> :ok
    end
  end

On the GenServer worker side the received call to run triggers the following code:

  def handle_call(:run, from, state) do
    if state.timer_ref != nil do
      Process.cancel_timer(state.timer_ref)
    end

    ref = Process.send_after(self(), :run, 0)

    {:reply, :ok, %{state | timer_ref: ref}}
  end

    def handle_info(:run, state) do
      state = %{state | timer_ref: nil}

      wrap_callback_in_transaction(fn ->
        run(state)
      end)
    end

    defp run(state) do
      run_result = apply(state.callback_module,
                       :run,
                       [state.object_id, state.deserialized_state])

    case run_result do
      {:ok, new_state} ->
        Logger.info("Script `#{state.script_name}` on Object `#{state.object_id}` successfully ran.")

        persist_if_changed(state.object_id, state.script_name, state.deserialized_state, new_state)

        {:noreply, %{state | deserialized_state: new_state}}
      {:ok, new_state, interval} ->
        Logger.info("Script `#{state.script_name}` on Object `#{state.object_id}` successfully ran. Running again in #{interval} milliseconds.")

        ref = Process.send_after(self(), :run, interval)
        persist_if_changed(state.object_id, state.script_name, state.deserialized_state, new_state)

        {:noreply, %{state | deserialized_state: new_state, timer_ref: ref}}
      {:error, error, new_state} ->
        Logger.error("Error `#{error}` encountered when running Script `#{state.script_name}` on Object `#{state.object_id}.")

        persist_if_changed(state.object_id, state.script_name, state.deserialized_state, new_state)

        {:noreply, %{state | deserialized_state: new_state}}
      {:error, error, new_state, interval} ->
        Logger.error("Error `#{error}` encountered when running Script `#{state.script_name}` on Object `#{state.object_id}.  Running again in #{interval} milliseconds.")

        ref = Process.send_after(self(), :run, interval)
        persist_if_changed(state.object_id, state.script_name, state.deserialized_state, new_state)

        {:noreply, %{state | deserialized_state: new_state, timer_ref: ref}}
      {:stop, reason, new_state} ->
        Logger.info("Script `#{state.script_name}` on Object `#{state.object_id} stopping after run.")

        stop_result = apply(state.callback_module,
                            :stop,
                            [state.object_id, reason, new_state])

        script_state =
          case stop_result do
            {:ok, script_state} ->
              Logger.info("Script `#{state.script_name}` on Object `#{state.object_id} successfully stopped.")
              script_state
            {:error, error, script_state} ->
              Logger.error("Error `#{error}` encountered when stopping Script `#{state.script_name}` on Object `#{state.object_id}.")
              script_state
          end

        persist_if_changed(state.object_id, state.script_name, state.deserialized_state, new_state)

        {:stop, :normal, %{state | deserialized_state: new_state}}
    end
  end

The Script/Callback Module being referred to in the test has the following definition for its run callback:

    def run(_object_id, _) do
      {:error, :error, :ok, 60_000}
    end

So what should be happening is that the callback ‘run’ method should return an error which, per the code above, should return {:noreply, %{state | deserialized_state: new_state, timer_ref: ref}} which should in no way shut down a GenServer process during a handle_info callback. And yet the final line of the test case, the call to ‘stop’, ends up triggering the following line in the catch :exit, {:noproc, _} -> {:error, :script_not_running}. And yet just prior we can see that the GenServer is in fact registered with the Registry module and definitely is running. Otherwise I would have gotten an error during the run call instead of a successful return.

The thing is, this race-condition behavior extends out to things like the supervisor modules, declared as part of the Application supervision tree, not being started when the tests try to run and start worker modules under them…which just seems fundamentally wrong.

Or cases like this:

    1) test Scripts: Running Script RunOnce (Exmud.Engine.Test.ScriptTest)
      test/engine/script_test.exs:75
      ** (exit) exited in: GenServer.call(Exmud.Engine.ScriptRunnerSupervisor, {:start_child, [10, "Elixir.Exmud.Engine.Test.Script.RunOnce", nil]}, :infinity)
        ** (EXIT) shutdown
       code: assert Script.start(object_id, RunOnce.name()) == :ok
       stacktrace:
         (elixir) lib/gen_server.ex:830: GenServer.call/3
         (exmud_engine) lib/engine/script.ex:251: Exmud.Engine.Script.start/3
         test/engine/script_test.exs:76: (test)

Where somehow during the start up of the script the GenServer decides to shut down when it works the vast majority of the time.

If the above code snippets are not enough, you can find the latest-and-still-broken code by following this link: https://github.com/mononym/exmud/tree/develop/apps/exmud_engine

Tests don’t start applications by default anyway, so you need to start things manually. Like your @script_registry may not work if the Registry application is not started (if that is what you are using) or so.

However you want to try not to use global things in tests if at all possible, makes it fairly impossible to multi-core the tests then (and you will get random data races and so forth). If you really need to start something globally then Application.ensure_started/1 is your friend (and be sure tests cannot run concurrently then). :slight_smile:

I said Registry module when I meant process. That’s what you get when you write late at night. In this case I use a couple of Registry processes to facilitate keeping track of different pools of workers:

    children = [
      Exmud.Engine.Repo,
      %{
        id: Cachex,
        start: {Cachex, :start_link, [cache()]}
      },
      {Registry, keys: :unique, name: system_registry()},
      {Registry, keys: :unique, name: script_registry()},
      {DynamicSupervisor, strategy: :one_for_one, name: Exmud.Engine.CallbackSupervisor}
    ]

The Script.running?/2 call made in the test pings the script Registry to check if the process is registered. So I’ve got processes which shouldn’t be erroring, and in fact should effectively be idle, that disappear after successfully being started.

Basically it all comes down to processes dying when I can find no reason they should. And this is all manifesting itself around the Script runner and associated Script tests. Even though the System behavior is very similar.

1 Like

Not sure it’s related but since you use umbrella: I noticed few funny things when I ran tests on a highly concurrent umbrella app from a specific app’s directory, it has in_umbrella dependencies and the misbehaving test uses these. The tests remained green but I got warnings that were possibly related to parallel apps not running as intended. It was never the case when I ran the tests from the top level directory.

So you might want to check if you get this weird behavior when you run tests from the top level dir of your umbrella app.

I thought Race Condition was only Go’s thing. :slight_smile:

I will definitely keep that in mind for the future, but that is not going to be the root problem here. While I do make a few calls to another of the umbrella applications during these tests, it is to call library functions that do not rely on any application state or process is being alive.

Race conditions are a fundamental engineering problem no matter the language. Even in languages which have no parallelism you will deal with race conditions the moment they interact with another program or external service.

4 Likes