Tests trigger `unknown registry: MyApp.EventStore.EventStore.PubSub` while migrating to Elixir 1.15

Hi,

I am trying to migrate an application from Elixir 1.14.5 to 1.15.7.
All my tests pass but they trigger additional errors related to commanded and eventstore:

Most of the errors are: unknown registry: MyApp.EventStore.EventStore.PubSub as shown here:

16:05:55.218 [error] GenServer {MyApp.EventStore.EventStore.Subscriptions.Registry, {"$all", "MyProcessManager"}} terminating
** (ArgumentError) unknown registry: MyApp.EventStore.EventStore.PubSub
    (elixir 1.15.7) lib/registry.ex:1391: Registry.info!/1
    (elixir 1.15.7) lib/registry.ex:1007: Registry.register/3
    (eventstore 1.4.7) lib/event_store/pubsub.ex:38: EventStore.PubSub.subscribe/3
    (eventstore 1.4.7) lib/event_store/subscriptions/subscription_fsm.ex:70: EventStore.Subscriptions.SubscriptionFsm.transition/3
    (eventstore 1.4.7) lib/event_store/subscriptions/subscription.ex:221: EventStore.Subscriptions.Subscription.handle_call/3
    (stdlib 4.3.1.6) gen_server.erl:1149: :gen_server.try_handle_call/4
    (stdlib 4.3.1.6) gen_server.erl:1178: :gen_server.handle_msg/6
    (stdlib 4.3.1.6) proc_lib.erl:240: :proc_lib.init_p_do_apply/3
Last message (from #PID<0.16668.1>): {:connect, #PID<0.16668.1>, [retry_interval: 60000, hibernate_after: 15000, partition_by: nil, concurrency_limit: 1, name: MyApp.EventStore, mapper: &Commanded.EventStore.Adapters.EventStore.Mapper.from_recorded_event/1, conn: MyApp.EventStore.Postgrex, event_store: MyApp.EventStore, query_timeout: 15000, schema: "public", serializer: Commanded.Serialization.JsonSerializer, stream_uuid: "$all", subscription_name: "MyProcessManager", start_from: 0]}
domain=otp file=gen_server.erl line=1402 mfa=:gen_server.error_info/8 pid=<0.16687.1>

Less often I also have: 1st argument: the table identifier does not refer to an existing ETS table as shown here:

16:05:56.398 [error] GenServer {MyApp.EventStore.EventStore.Subscriptions.Registry, {"$all", "MyProcessManager"}} terminating
** (ArgumentError) errors were found at the given arguments:

  * 1st argument: the table identifier does not refer to an existing ETS table

    (stdlib 4.3.1.6) :ets.insert(#Reference<0.4090088613.484835331.67685>, {#PID<0.17183.1>, "$all", #Reference<0.4090088613.484835331.67684>, -576460752299835900})
    (elixir 1.15.7) lib/registry.ex:1018: Registry.register/3
    (eventstore 1.4.7) lib/event_store/pubsub.ex:38: EventStore.PubSub.subscribe/3
    (eventstore 1.4.7) lib/event_store/subscriptions/subscription_fsm.ex:70: EventStore.Subscriptions.SubscriptionFsm.transition/3
    (eventstore 1.4.7) lib/event_store/subscriptions/subscription.ex:221: EventStore.Subscriptions.Subscription.handle_call/3
    (stdlib 4.3.1.6) gen_server.erl:1149: :gen_server.try_handle_call/4
    (stdlib 4.3.1.6) gen_server.erl:1178: :gen_server.handle_msg/6
    (stdlib 4.3.1.6) proc_lib.erl:240: :proc_lib.init_p_do_apply/3
Last message (from #PID<0.17166.1>): {:connect, #PID<0.17166.1>, [retry_interval: 60000, hibernate_after: 15000, partition_by: nil, concurrency_limit: 1, name: MyApp.EventStore, mapper: &Commanded.EventStore.Adapters.EventStore.Mapper.from_recorded_event/1, conn: MyApp.EventStore.Postgrex, event_store: MyApp.EventStore, query_timeout: 15000, schema: "public", serializer: Commanded.Serialization.JsonSerializer, stream_uuid: "$all", subscription_name: "MyProcessManager", start_from: 0]}
domain=otp file=gen_server.erl line=1402 mfa=:gen_server.error_info/8 pid=<0.17183.1>

The 2 error stacks have a lot in common. Somehow eventstore is trying to do something with what appears to be a registry linked to a dead process.

I have a lot of tests that run in parallel. If I try to run 1 test file at the time, no error occur. But if I run 2 test files in parallel, error sometime occur. It is difficult to reproduce because it is not predictable.

I do not know how to troubleshoot this. I have re-read the Commanded documentation in particular regarding Application and Process managers and I believe I do what the documentation recommends.

Any idea? @slashdotdash perhaps?

For information, the issue appears from Elixir version 1.15.0.

At this point the only advice I could give you is to create a small git repository that reproduces the issue. I’d think people would be more willing to help.

Why are you upgrading to that specific version and what happens if you go higher, for instance latest version, or 1.16? Also what about erlang version(s)?

1 Like

The issue is not reproducible in a deterministic way so it might be difficult to create such repository. I will try anyway.

I was just updating my app one version at a time but I went on and noticed the issue also happens with:

  • Erlang 25 and Elixir 1.15.0
  • Erlang 25 and Elixir 1.15.7
  • Erlang 25 and Elixir 1.16.3
  • Erlang 25 and Elixir 1.17.3
  • Erlang 25 and Elixir 1.18.1
  • Erlang 27 and Elixir 1.18.1

I did a little digging. There is this stackoverflow question with similar issue.

Turns out the issue was in my children array. Since application order DOES MATTER, my PubSub child should be before Endpoint:

But I don’t see why would version bump be related to this. :thinking:
Also somebody in the comment’s comment referenced this elixirforum thread
where they resolve issues with phoenix changes.

So if your app has phoenix consider bumping that as well and research what changes were made. Just an idea. :bug:

Thank you but this is not helpful.

My children array is in order.

Few days ago when I posted the issue, all my elixir dependencies were up-to-date (phoenix 1.7.19).

You are right. Perhaps this then?
Do you have start_from: :current?

Again thank you very much, but I do not have :start_from specified. The issue you linked seems like the same as mine.

Okay, maybe try out some configurations of that option and see if anything changes.
Other than that it’s what @dimitarvp mentioned: make a repo that reproduces this issue, it’s the best way for help in debugging.

Adding start_from: :current made the problem more reproducible but I could not find a solution.

I built a short repository that reproduces the issue when running mix test. On my side, error messages show up more than 50% of the time I run mix test.

The repo is available at: GitHub - datafoo/elixir-1.15-registry-issue

Thank you for your help so far.

1 Like

Extra advice: include a .tool-versions file in the repo, and put some little README.md file to describe what is this repository reproducing as a problem. A screenshot or copied logging output should orient any visitor.

1 Like

I have managed to drastically improve the reproducibility of the issue by creating tests that execute very fast:

defmodule MyApp.Facility.EventStoreRegistryTest do
  use MyApp.CqrsCase

  describe "fast test" do
    @tag :integration
    test "1" do
      assert true
    end

    @tag :integration
    test "2" do
      assert true
    end

    @tag :integration
    test "3" do
      assert true
    end
  end
end

Here are the results:

# mix test
.15:57:48.962 [error] GenServer {MyApp.EventStore.EventStore.Subscriptions.Registry, {"$all", "CascadeSelectorMarkedAsNonexistent"}} terminating
** (ArgumentError) unknown registry: MyApp.EventStore.EventStore.PubSub
    (elixir 1.15.7) lib/registry.ex:1391: Registry.info!/1
    (elixir 1.15.7) lib/registry.ex:1007: Registry.register/3
    (eventstore 1.4.7) lib/event_store/pubsub.ex:38: EventStore.PubSub.subscribe/3
    (eventstore 1.4.7) lib/event_store/subscriptions/subscription_fsm.ex:70: EventStore.Subscriptions.SubscriptionFsm.transition/3
    (eventstore 1.4.7) lib/event_store/subscriptions/subscription.ex:221: EventStore.Subscriptions.Subscription.handle_call/3
    (stdlib 4.3.1.6) gen_server.erl:1149: :gen_server.try_handle_call/4
    (stdlib 4.3.1.6) gen_server.erl:1178: :gen_server.handle_msg/6
    (stdlib 4.3.1.6) proc_lib.erl:240: :proc_lib.init_p_do_apply/3
Last message (from #PID<0.367.0>): {:connect, #PID<0.367.0>, [retry_interval: 60000, hibernate_after: 15000, partition_by: nil, concurrency_limit: 1, name: MyApp.EventStore, mapper: &Commanded.EventStore.Adapters.EventStore.Mapper.from_recorded_event/1, conn: MyApp.EventStore.Postgrex, event_store: MyApp.EventStore, query_timeout: 15000, schema: "public", serializer: Commanded.Serialization.JsonSerializer, stream_uuid: "$all", subscription_name: "CascadeSelectorMarkedAsNonexistent", start_from: 0]}
domain=otp file=gen_server.erl line=1402 mfa=:gen_server.error_info/8 pid=<0.368.0> 

.15:57:49.090 [error] GenServer {MyApp.EventStore.EventStore.Subscriptions.Registry, {"$all", "CascadeSelectorMarkedAsNonexistent"}} terminating
** (ArgumentError) unknown registry: MyApp.EventStore.EventStore.PubSub
    (elixir 1.15.7) lib/registry.ex:1391: Registry.info!/1
    (elixir 1.15.7) lib/registry.ex:1007: Registry.register/3
    (eventstore 1.4.7) lib/event_store/pubsub.ex:38: EventStore.PubSub.subscribe/3
    (eventstore 1.4.7) lib/event_store/subscriptions/subscription_fsm.ex:70: EventStore.Subscriptions.SubscriptionFsm.transition/3
    (eventstore 1.4.7) lib/event_store/subscriptions/subscription.ex:221: EventStore.Subscriptions.Subscription.handle_call/3
    (stdlib 4.3.1.6) gen_server.erl:1149: :gen_server.try_handle_call/4
    (stdlib 4.3.1.6) gen_server.erl:1178: :gen_server.handle_msg/6
    (stdlib 4.3.1.6) proc_lib.erl:240: :proc_lib.init_p_do_apply/3
Last message (from #PID<0.428.0>): {:connect, #PID<0.428.0>, [retry_interval: 60000, hibernate_after: 15000, partition_by: nil, concurrency_limit: 1, name: MyApp.EventStore, mapper: &Commanded.EventStore.Adapters.EventStore.Mapper.from_recorded_event/1, conn: MyApp.EventStore.Postgrex, event_store: MyApp.EventStore, query_timeout: 15000, schema: "public", serializer: Commanded.Serialization.JsonSerializer, stream_uuid: "$all", subscription_name: "CascadeSelectorMarkedAsNonexistent", start_from: 0]}
domain=otp file=gen_server.erl line=1402 mfa=:gen_server.error_info/8 pid=<0.430.0> 

.
Finished in 0.4 seconds (0.00s async, 0.4s sync)
3 tests, 0 failures

Randomized with seed 521679

What is relevant is in MyApp.CqrsCase which follows the documentation:

defmodule MyApp.CqrsCase do
  @moduledoc false

  use ExUnit.CaseTemplate

  using do
    quote do
      import Ecto
      import Ecto.Changeset
      import Ecto.Query
    end
  end

  setup do
    {:ok, _} = Application.ensure_all_started(:myapp)

    on_exit(fn ->
      # Uncomment to prevent `unknown registry: MyApp.EventStore.EventStore.PubSub` error messages
      # :timer.sleep(100)

      :ok = Application.stop(:myapp)

      MyApp.Storage.reset!()
    end)

    :ok
  end
end

Somehow, since Elixir 1.15, things are going too fast and slowing down the test cleanup by adding :timer.sleep(100) right before :ok = Application.stop(:myapp) make the error messages disappear.

The test results are predictable. There is exactly a total of N-1 error messages with N being the amount of tests. And it is the first test which does not trigger an error message.

Interesting,

Elixir v1.15 is a smaller release with focused improvements on compilation and boot times.

I also see commanded uses elixir 1.12

I believe Elixir 1.12 is just the minimum version required.

@slashdotdash @drteeth Do you know what could trigger this issue?

oh my bad :sweat_smile:

Could you submit your test case and setup in an issue on github and I’ll take a look?

1 Like