Confusing error when trying to start Genserver for multi-node testing

aeturnum · April 6, 2021, 1:00am

Hello,

I’m getting an odd error in a test I’m trying to write for what I imagined would be a simple multi-node test setup. Basically, in production we run a few different nodes and I’m trying to write some test to verify that the module works. Unfortunately, I ran into a problem where the GenServer (App.Cluster.TaskTests) was not running on the slave nodes and so I tried to start it. Here is where I ran into the odd error.

Here is the relevant part of my code (in a test function):

defmodule App.Cluster.TaskTests do
 use ExUnit.Case
 alias App.Cluster.Tasks

 setup_all do
   :net_kernel.monitor_nodes(true)
   :os.cmd('epmd -daemon')
   Node.start(:test@localhost, :shortnames) |> IO.inspect()

   children = ['child_1', 'child_2', 'child_3']

   for child <- children do
     IO.inspect(child)
     {:ok, node} = :slave.start_link(:localhost, child) |> IO.inspect()
     Node.spawn(node, Supervisor, :init, [[{App.Cluster.Tasks, {}}], strategy: :one_for_one]) |> IO.inspect()
   end

   on_exit(fn ->
     [node() | Node.list()]
     |> Enum.each(fn node -> Node.disconnect(node) end)
     :net_kernel.monitor_nodes(false)
   end)
 end

end

And here is the confusing error I get:

17:41:46.361 [error] Process #PID<37495.88.0> on node :child_1@localhost raised an exception
** (UndefinedFunctionError) function Supervisor.init/2 is undefined or private
    (elixir) Supervisor.init([{App.Cluster.Tasks, {}}], {:strategy, :one_for_one})

I get this error on each child, which of course causes any calls to the Tasks module to fail. I’m sure I’ve made some boneheaded mistake in setting this up (I’ve never worked with Nodes before), but I thought I would ask here and hope someone more experienced could spot my error.

hubertlepicki · April 6, 2021, 6:12am

Can you try this instead?

 Node.spawn(node, Supervisor, :init, [[{App.Cluster.Tasks, {}}], [strategy: :one_for_one]]) |> IO.inspect()

I.e. if you put [] around second argument of init, i.e. `startegy: :one_for_one.

I suspect the way you wrote the list literal, it ends up passing a tuple of {:strategy, :one_for_one} instead of a list of keyword options to it, i.e. [strategy: :one_for_one] that can be written as [{:strategy, :one_for_one}] too and the init/2 is not being called with correct arguuments… although the errors is kinda weird for this case so maybe I am not right. Can you try it and let me know if it helps?

aeturnum · April 6, 2021, 6:25am

Hey @hubertlepicki, thank you for the suggestion.

Here’s an updated snippet(rest unchanged):

    :os.cmd('epmd -daemon')
    Node.start(:test@localhost, :shortnames) |> IO.inspect(label: "main")

    children = ['child_1', 'child_2', 'child_3']

    for child <- children do
      {:ok, node} = :slave.start_link(:localhost, child) |> IO.inspect(label: "start #{child}")
      Node.spawn(node, Supervisor, :init, [
        [{App.Cluster.Tasks, {}}],
        [strategy: :one_for_one]
      ]) |> IO.inspect(label: "spawn #{child}")
    end

And the result:

main: {:ok, #PID<0.999.0>}
start child_1: {:ok, :child_1@localhost}
spawn child_1: #PID<37450.88.0>

23:19:48.050 [error] Process #PID<37450.88.0> on node :child_1@localhost raised an exception
** (UndefinedFunctionError) function Supervisor.init/2 is undefined or private
    (elixir) Supervisor.init([{Distributor.Cluster.Tasks, {}}], [strategy: :one_for_one])

It seems particularly odd because the error is complaining that Supervisor.init/2 doesn’t exist. If there was a match error I’d have something to work on. Does this code somehow fail to load my modules in the other nodes?

hubertlepicki · April 6, 2021, 6:30am

Yeah this is odd. I am also not sure this is the way to set it up in first place, but the error is puzzling indeed…

aeturnum · April 6, 2021, 4:50pm

I don’t know exactly what I was doing wrong in the code I posted, but I found an example of how to start up other nodes for tests in an older Elixir library that let me get my test working: swarm/cluster.ex at 4aee63d83ad5ee6ee095b38b3ff93a4dbb7c3400 · bitwalker/swarm · GitHub

Hopefully this will be helpful to anyone else in the same fix.

rvirding · April 7, 2021, 5:32pm

I think the main problem is that you are using the Supervisor.init/2 function in the wrong way. If you check the docs here you will see it is basically intended to be used to return the supervisor spec in the init/1 callback function. This is called when you start a supervisor with Supervisor.start_link , you don’t explicitly spawn it yourself. This is true for all the OTP behaviours.

aeturnum · April 7, 2021, 6:01pm

You are correct about Supervisor.init @rvirding, but I also tried it with Supervisor.start_link/2 and got an identical error. The solution, I believe, started with calling Erlang’s :code.add_paths() (Erlang -- code) on each node where the list comes from the master nodes’ :code.get_path(). The reason it couldn’t find a definition for init/2 or start_link/2 is that neither function was loaded into the child nodes.

The details on that (and more setup to do) can be found in the swarm test helper I linked before: swarm/cluster.ex at 4aee63d83ad5ee6ee095b38b3ff93a4dbb7c3400 · bitwalker/swarm · GitHub

aceraxx · April 30, 2021, 12:56am

I ran into a similar issue while running tests in a similar environment.

What I think is happening is that the slave node is being started, but none of the modules from your master are loaded onto the code server for the slave. Here’s the blog post I read (in two parts) where I uncovered and figured out the solution: Intro: Slave Nodes and Remote Code Loading | by Sean Stavropoulos | Medium

Basically you have to point the slave node to the path where the .beam files live, so it can load it into its code server. This is done using :code.add_paths/1

Here’s what I did that seems to have solved the issue:

defmodule MyAppTest do

  use ExUnit.Case, async: true

  test "distributed test" do
    :net_kernel.start([:node1, :shortnames, 3000])
    {:ok, slave} = start_slave

    # rest of the test
  end

  defp start_slave do
    {:ok, hostname} = :inet.gethostname()
    {:ok, slave} = :slave.start(hostname, 'slave')
    :rpc.block_call(slave, :code, :add_paths, [:code.get_path])
    {:ok, slave}
  end
end