Issue with DBConnection.Ownership.Proxy.checkout and GenServer process

We (@frahugo and I) have the following intermittent exception raised when running the test suite:

 1) test handling a BuyableProductListed event given a standalone product, it lists the product (MyApp.StatusHandlerTest)
     test/status_handler_test.exs:14
     ** (exit) exited in: GenServer.call(MyApp.Router, %MyApp.BuyableProductListed{payload: %{buyable_product_seller_sku: ":master-sku:", product_group_seller_sku: "3", seller_id: ":seller-slug:"}}, 5000)
         ** (EXIT) exited in: GenServer.call(#PID<0.740.0>, {:checkout, #Reference<0.0.1.19398>, true, 15000}, 5000)
             ** (EXIT) shutdown: "owner #PID<0.739.0> exited with: shutdown"
     stacktrace:
       (elixir) lib/gen_server.ex:737: GenServer.call/3
       test/status_handler_test.exs:97: MyApp.StatusHandlerTest.notify/1
       test/status_handler_test.exs:21: (test)

08:46:41.609 [error] GenServer MyApp.Router terminating
** (stop) exited in: GenServer.call(#PID<0.740.0>, {:checkout, #Reference<0.0.1.19398>, true, 15000}, 5000)
    ** (EXIT) shutdown: "owner #PID<0.739.0> exited with: shutdown"
    (db_connection) lib/db_connection/ownership/proxy.ex:32: DBConnection.Ownership.Proxy.checkout/2
    (db_connection) lib/db_connection.ex:920: DBConnection.checkout/2
    (db_connection) lib/db_connection.ex:742: DBConnection.run/3
    (db_connection) lib/db_connection.ex:790: DBConnection.transaction/3
    (my_app) lib/my_app/router.ex:54: MyApp.Router.handle_call/3
    (stdlib) gen_server.erl:615: :gen_server.try_handle_call/4
    (stdlib) gen_server.erl:647: :gen_server.handle_msg/5
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Last message: %MyApp.BuyableProductListed{payload: %{buyable_product_seller_sku: ":master-sku:", product_group_seller_sku: "3", seller_id: ":seller-slug:"}}

About the MyApp.Router:

  • is not a HTTP Router
  • it is an application router that Routes events to functions
  • it is a GenServer started by a supervisor when the application starts
  • all messages are handled synchronously (i.e. GenServer.call)

@frahugo investigate the ownership of connections. Here is what he found:

When I set the ownership_log to :info in the Repo configuration, I see some interesting messages that might indicate a concurrency problem in “releasing” the proxy when the test process goesDOWN.

As you can see below, the long-running Router PID is 714. For each test, that PID is “allowed” to use the proxy checked out by the test (#PID<0.714.0> allowed on proxy ...). When the test process ends, db_connection receives a DOWN message (because it is monitoring the test process), and then logs message #PID<0.714.0> lose proxy.... Therefore, on the next test, the Router (pid 714) will have to use the new proxy that the next test will check out.

It looks like the DOWN message is processed asynchronously, and that at some point, a new test starts while the owner_down function in db_connection is busy cleaning up the checkouts/owners map. That would explain why the Router is then trying to use the proxy from the previous test.

10:05:00.093 [info] Subscribe #PID<0.714.0> / Elixir.MyApp.Router to test_channel
10:05:02.719 [info] >>>>>>>>>>>>>>>>>>>>>>>
10:05:02.719 [info] TEST pid: #PID<0.756.0>

MyApp.ProductHandlerTest
  * test updating a product in catalog given a non-existing standalone product, it adds the product
10:05:02.726 [info] #PID<0.756.0> owns proxy #PID<0.757.0>
10:05:02.802 [info] #PID<0.714.0> allowed on proxy #PID<0.757.0>
  * test updating a product in catalog given a non-existing standalone product, it adds the product (508.9ms)
10:05:03.228 [info] >>>>>>>>>>>>>>>>>>>>>>>
  * test updating a product in catalog given new image for a regular variant, it adds images to the variant
10:05:03.229 [info] TEST pid: #PID<0.791.0>
10:05:03.229 [info] #PID<0.791.0> owns proxy #PID<0.792.0>
10:05:03.244 [info] #PID<0.756.0>, #PID<0.714.0> lose proxy #PID<0.757.0>
10:05:03.428 [info] #PID<0.714.0> allowed on proxy #PID<0.792.0>
  * test updating a product in catalog given new image for a regular variant, it adds images to the variant (1059.4ms)
10:05:04.288 [info] >>>>>>>>>>>>>>>>>>>>>>>
  * test adding a product to catalog adding a product with just one variant
10:05:04.288 [info] TEST pid: #PID<0.920.0>
10:05:04.288 [info] #PID<0.920.0> owns proxy #PID<0.921.0>
10:05:04.301 [info] #PID<0.791.0>, #PID<0.714.0> lose proxy #PID<0.792.0>
10:05:04.303 [info] #PID<0.714.0> allowed on proxy #PID<0.921.0>
  * test adding a product to catalog adding a product with just one variant (186.2ms)
  * test adding a product to catalog adding a product with many variants
10:05:04.475 [info] >>>>>>>>>>>>>>>>>>>>>>>
10:05:04.475 [info] TEST pid: #PID<0.958.0>
10:05:04.475 [info] #PID<0.958.0> owns proxy #PID<0.959.0>
10:05:04.492 [info] #PID<0.920.0>, #PID<0.714.0> lose proxy #PID<0.921.0>
10:05:04.511 [info] Router terminated: self: #PID<0.714.0> : {{:shutdown, "owner #PID<0.920.0> exited with: shutdown"}, {GenServer, :call, [#PID<0.921.0>, {:checkout, #Reference<0.0.2.22672>, true, 125000}, 125000]}} state: nil
10:05:04.511 [info] Subscribe #PID<0.960.0> / Elixir.MyApp.Router to test_channel
  * test adding a product to catalog adding a product with many variants (36.7ms)
10:05:04.511 [info] >>>>>>>>>>>>>>>>>>>>>>>
10:05:04.511 [info] TEST pid: #PID<0.961.0>
10:05:04.512 [info] #PID<0.961.0> owns proxy #PID<0.962.0>
10:05:04.512 [info] #PID<0.958.0> lose proxy #PID<0.959.0>

  1) test adding a product to catalog adding a product with many variants (MyApp.ProductHandlerTest)
     test/product_handler_test.exs:17
     ** (exit) exited in: GenServer.call(MyApp.Router, %MyApp.ProductGroupAdded{payload: %MyApp.ProductGroup{__meta__: #Ecto.Schema.Metadata<:built, "product_groups">, buyable_products: [%MyApp.BuyableProduct{__meta__: #Ecto.Schema.Metadata<:built, "buyable_products">, created_at: nil, fields: [%{"name" => "price", "value" => "10.00"}, %{"name" => "msrp", "value" => "12.00"}, %{"name" => "item_length", "value" => "5.0"}, %{"name" => "item_width", "value" => "6.0"}, %{"name" => "item_height", "value" => "4.0"}, %{"name" => "item_weight", "value" => "2.0"}, %{"name" => "shipping_cost_standard", "value" => "5.95"}, %{"name" => "shipping_cost_expedited", "value" => "9.95"}, %{"name" => "size", "value" => "Large"}, %{"name" => "style", "value" => "Chevron"}], listing_status: nil, product_group_seller_sku: nil, quantity: nil, seller_id: ":slug:", seller_sku: ":buyable-product-sku-1:", updated_at: nil}, %MyApp.BuyableProduct{__meta__: #Ecto.Schema.Metadata<:built, "buyable_products">, created_at: nil, fields: [%{"name" => "price", "value" => "12.00"}, %{"name" => "msrp", "value" => "14.00"}, %{"name" => "item_length", "value" => "7.0"}, %{"name" => "item_width", "value" => "5.0"}, %{"name" => "item_height", "value" => "6.0"}, %{"name" => "item_weight", "value" => "4.0"}, %{"name" => "shipping_cost_standard", "value" => "5.95"}, %{"name" => "shipping_cost_expedited", "value" => "9.95"}, %{"name" => "size", "value" => "Small"}, %{"name" => "style", "value" => "Chevron"}, %{"name" => "image2", "value" => "https://dummyimage.com/10/09f/fff-2.png"}, %{"name" => "image3", "value" => "https://dummyimage.com/10/09f/fff-3.png"}, %{"name" => "image4", "value" => "https://dummyimage.com/10/09f/fff-4.png"}, %{"name" => "image5", "value" => "https://dummyimage.com/10/09f/fff-5.png"}], listing_status: nil, product_group_seller_sku: nil, quantity: nil, seller_id: 1, seller_sku: ":buyable-product-sku-2:", updated_at: nil}], created_at: nil, fields: [%{"name" => "item_name", "value" => ":new-name:"}, %{"name" => "item_description", "value" => ":description:"}, %{"name" => "brand", "value" => ":brand:"}, %{"name" => "gender", "value" => "Male"}, %{"name" => "category", "value" => ":category-name:"}, %{"name" => "prop_65", "value" => "true"}, %{"name" => "image1", "value" => "https://dummyimage.com/10/09f/fff.png"}], seller_id: ":slug:", seller_sku: ":master-sku:", updated_at: nil}}, 5000)
         ** (EXIT) exited in: GenServer.call(#PID<0.921.0>, {:checkout, #Reference<0.0.2.22672>, true, 125000}, 125000)
             ** (EXIT) shutdown: "owner #PID<0.920.0> exited with: shutdown"
     stacktrace:

Other possibly helpful information:

  • this occurs in running a small suite of 5 tests
  • we’re running MySQL
  • our tests are configured with Ecto.Adapters.SQL.Sandbox.mode(MyApp.Repo, {:shared, self()})
  • tests are run synchronously
  • no issues when calling functions directly, i.e. not going via the Router
  • it appears setting the repo config to pool_size: 1 resolves the issue
  • interesting this setting was removed from the generators: https://github.com/elixir-ecto/ecto/issues/1279
  • the repo being used in another OTP application

Any help on how to debug or resolve this would be much appreciated. Thank you in advance.

4 Likes

@frahugo has created an application demonstrating this issue:

Thinking about this further, I can focus this issue down to one question:

In a test suite, should we be able to make multiple synchronous calls (i.e. handle_call) to a GenServer that queries a database?

Thanks again.

If you do so, you will need to disallow the GenServer from using the test connection at the end of your tests. But that will still have a race condition because the GenServer can still try to perform a query, except this time it won’t have a connection to use. So I believe the best option is to actually kill the GenServer at the end of your tests and guarantee proper shutdown from there. This will become slightly easier in Elixir v1.5 as we will allow processes to be associated to ExUnit tests.

If you cannot properly kill or shutdown the GenServer or because it is named or because it is supervised, then it is global state, so db_connection is rather telling you that you have global state and therefore concurrent tests won’t work.

2 Likes

Thank you for help, @josevalim.

Not sure if I got it correctly though I thought that specifying async: false would stop tests from running concurrently? Still I get the same bug as described here (the test cases all still pass, it’s just that the error messages appear in the console. When I try to run only one single test, everything’s fine).

I’ve tried different proposed solutions such as

  • Setting Ecto.Adapters.SQL.Sandbox.mode(App.Repo, {:shared, self()}) in test_helper.exs
  • Writing
          parent = self()
          Task.start(fn ->
            Ecto.Adapters.SQL.Sandbox.allow(Repo, parent, self())
            apply(mod, func, args)
          end)
    
  • Use Elixir 1.8.0 and DBConnection 2.0.5 (it did fix another bug with tests though)

But none of them worked.

I wonder if it has something to do with the usage of Task.start.

I was trying to adapt the code posted by Chris McCord at https://stackoverflow.com/questions/33934029/how-to-detect-if-a-user-left-a-phoenix-channel-due-to-a-network-disconnect. I wonder if that code itself is a bit outdated.

I’ve found the immediate cause of the problem but I’m still not sure of the underlying reason: In this example, the supervisor triggers a callback whenever a channel is closed. Apparently, whenever every test ends, the channel associated to that test is closed, so the callback always gets triggered for every test case during testing, and this led to the errors. I perform some DB actions in the callback. I’m not sure why the errors occur though.

My current workaround:

  def handle_info({:EXIT, pid, reason}, state) do
    # Whenever a test ends the following path happens
    # Whenever the following path doesn't match, a genuine exit in the production environment happened, so we can use the `else` clause for our real code.
    with :test <- Application.get_env(:app, :environment),
         :shutdown <- reason do
      {:noreply, drop_participant(state, pid)}
    else
      _ -> # Real code
    end
end

It occurs for the reason mentioned above: this is triggered when the test finishes and since the test process owns the connection, when the channel gets to run this code there is no connection. You can fix this by explicitly closing the channel on each test.

You have to be really careful with code such as below becausse the handle_info calback is not guaranteed to run, for example in case the node suddenly shuts down for some reason.

2 Likes