Mix test fails while individual test files succeed

IVR · July 14, 2021, 7:22pm

Hi all,

I have a small Phoenix API which you can find here.

I’m running into a bizarre bug: whenever I run mix test I get intermittent failures, which primarily occur in my controllers, the errors say that the passed user IDs are nil – user IDs are pulled from conn.assigns.current_user which is set via Plug.Conn.put_req_header(“authorization”, token), so I guess this part intermittently fails. The strange part is that when I run mix test on individual test files, everything works fine. It appears that the problem is caused by some sort of DB pollution.

Has anyone experienced this? Any ideas on how to fix it?

Many thanks!

benwilson512 · July 14, 2021, 8:26pm

What database are you using, and are you doing any kind of caching in your application?

al2o3cr · July 15, 2021, 4:01am

I can’t pinpoint a cause, but I discovered something interesting. There appears to be a race between putting the user’s data in the Pow session cache and reading it out; I verified this by adding this (very much NOT FOR PRODUCTION USE) hack to the authorization plug:

  def fetch(conn, config, tries \\ 0) do
    with {:ok, signed_token} <- fetch_access_token(conn),
         {:ok, token}        <- verify_token(conn, signed_token, config),
         {user, _metadata}   <- CredentialsCache.get(store_config(config), token) do

      if tries > 0 do
        IO.puts "succeeded after #{tries} tries"
      end

      {conn, user}
    else
      :not_found ->
        if tries <= 5 do
          fetch(conn, config, tries + 1)
        else
          {conn, nil}
        end

      _any ->
        {conn, nil}
    end
  end

When I run mix test with this change applied, even with a fixed --seed, I observe intermittent failures that resolve after retries:

Matts-MacBook-Pro-2:sleep_rescue_api mattjones$ mix test --seed 198643
Excluding tags: [:pending]

.................23:55:32.539 request_id=FpHaoqQA2cBHz4wAAASk [error] Failed to create a user: %{email: ["has invalid format"], password_confirmation: ["does not match confirmation"]}
....succeeded after 2 tries
...................................succeeded after 2 tries
.....

Finished in 65.6 seconds
61 tests, 0 failures

Randomized with seed 198643

Matts-MacBook-Pro-2:sleep_rescue_api mattjones$ mix test --seed 198643
Excluding tags: [:pending]

.......succeeded after 3 tries
..........23:56:46.498 request_id=FpHas9xVG5CAbwAAAAYD [error] Failed to create a user: %{email: ["has invalid format"], password_confirmation: ["does not match confirmation"]}
..succeeded after 1 tries
..........................succeeded after 2 tries
...succeeded after 2 tries
.............

Finished in 68.5 seconds
61 tests, 0 failures

Randomized with seed 198643

IVR · July 15, 2021, 4:40am

Thanks for looking into this!

I’m using a standard Postgres DB and I’m not caching anything myself, but perhaps some of the libraries that I’m using are doing that. The only one that comes to mind is Pow that is used for authentication.

IVR · July 15, 2021, 4:59am

Wow

@al2o3cr thank you so much for looking into this and spotting such a weird bug. I probably wouldn’t have picked it up.

After trying out your hack, all my tests are fixed! As you said it though, this is not a production-grade solution. Perhaps @danschultzer might be able to help here?

The fetch method in question comes from this documentation, but I imagine that the actual bug must be somewhere deeper in Pow’s codebase. I’ve created a GitHub issue here, hopefully, we’ll be able to get that fixed, but I feel we’ll need to do a bit more digging first.

What is most concerning to me is that the bug might not actually be limited to tests only. I believe Pow is a battle-tested solution, so I’d be surprised if I’m the first one to run into this.

Sebb · July 15, 2021, 8:52am

I assume you are using Ecto.Adapters.SQL.Sandbox — Ecto SQL v3.6.2 (just asking because thats where I failed testing an app with DB)

IVR · July 15, 2021, 8:58am

I’ve tried the following inside my test_helpers.exs:

#Ecto.Adapters.SQL.Sandbox.mode(SleepRescue.Repo, :manual) # this
Ecto.Adapters.SQL.Sandbox.mode(SleepRescue.Repo, {:shared, self()}) # and this

And I’ve also tried :ok = Ecto.Adapters.SQL.Sandbox.checkout(Repo) and use ExUnit.Case, async: true in my tests just like the link you’ve shared suggests. Neither of these helped.

Either way, it looks like we’ve narrowed down the problem to Pow’s auth plug, but we still need to figure out where exactly the problem occurs. I’ll try to look at Pow’s source code tomorrow when I should have a bit more time, but I’m an Elixir beginner, so I’d be surprised if I could nail this one.

al2o3cr · July 15, 2021, 12:17pm

Found a useful comment from Pow’s maintainer - note the use of sleep to ensure the cache has caught up:

~~Using the alternative store in that comment reduces the number of “succeeded after 2 tries” messages in the output, but they don’t go away entirely~~ EDIT: I did not have the store plugged in correctly when this message was written - see followup below

IVR · July 15, 2021, 12:21pm

Thanks for sharing! I’m now concerned that this problem might affect the production environment too, so the use of sleep is certainly not a solution and since the alternative store doesn’t resolve the problem either, I think the search goes on

I’ve reached out to people on Slack as well, hopefully, this will get some attention because it seems that it’s a Pow issue.

al2o3cr · July 15, 2021, 1:36pm

I did not have the “mock” store plugged in correctly before, so it wasn’t actually affecting the behavior of the system.

Updating the mock to:

defmodule SleepRescueWeb.EtsCacheMock do
  @moduledoc false
  @tab __MODULE__

  def init, do: :ets.new(@tab, [:ordered_set, :public, :named_table])

  def get(config, key) do
    ets_key = ets_key(config, key)

    @tab
    |> :ets.lookup(ets_key)
    |> case do
      [{^ets_key, value} | _rest] -> value
      []                          -> :not_found
    end
  end

  def delete(config, key) do
    :ets.delete(@tab, ets_key(config, key))

    :ok
  end

  def put(config, record_or_records) do
    records     = List.wrap(record_or_records)
    ets_records = Enum.map(records, fn {key, value} ->
      {ets_key(config, key), value}
    end)

    :ets.insert(@tab, ets_records)
  end

  def all(config, match) do
    ets_key_match = ets_key(config, match)

    @tab
    |> :ets.select([{{ets_key_match, :_}, [], [:"$_"]}])
    |> Enum.map(fn {[_namespace | keys], value} -> {keys, value} end)
  end

  defp ets_key(config, key) do
    [Keyword.get(config, :namespace, "cache") | List.wrap(key)]
  end
end

and adding:

SleepRescueWeb.EtsCacheMock.init()

makes the tests pass consistently!

Unrelated: password hashing is intentionally resource-intensive to prevent attacks, but that’s a problem in tests. Setting config :pow, Pow.Ecto.Schema.Password, iterations: 1 in config/test.exs takes the runtime for mix test from 73s down to 1 second on my machine.

IVR · July 15, 2021, 8:08pm

@al2o3cr thank you so much. I really appreciate you looking into this and you did not just find a solution but you’ve also helped me speed up my tests by a massive factor. I’m quite new to Elixir, so I would have struggled with this one. I wish I could buy you a beer!

EDIT: after running tests repeatedly, I noticed that unfortunately it still fails intermittently

The SleepRescueWeb.EtsCacheMock.init() is supposed to go into test_helpers.exs, right?

al2o3cr · July 15, 2021, 8:43pm

Oh noes! I haven’t been able to reproduce that locally, so I’m out of ideas

test/test_helper.exs, to be precise. That file gets run by ExUnit first and kicks off the whole testing process.

Mine contains:

Ecto.Adapters.SQL.Sandbox.mode(SleepRescue.Repo, {:shared, self()})
SleepRescueWeb.EtsCacheMock.init()

ExUnit.start(exclude: [:pending])
#Ecto.Adapters.SQL.Sandbox.mode(SleepRescue.Repo, :manual)

IVR · July 16, 2021, 5:19am

So it works for you consistently? Strange… Perhaps my ETS cache isn’t configured properly somehow.

I ended up fixing it by adding :timer.sleep(50) to all of my test setups and now it takes a few seconds longer to run, but the tests are passing consistently.

Thank you for helping me find the solution!