Difficulty figuring out DynamicSupervisor :invalid_child_spec error

Hello all…

I’ve run into a problem that I can’t figure out and I was hoping to get a second set of eyes on. I imagine it’s something silly that I’m just missing due to confirmation biases or similar.

I’m trying to start a GenServer under a dynamic supervisor and when I call DynamicSupervisor.start_child/2 using a child spec, I end up with an :invalid_child_spec error. Naturally I assume something is wrong with my child spec… but I can’t see it. Let me provide some details and what I’ve steps I’ve already taken. To start some versions: Elixir 1.16.2 with Erlang 26.2.3.

The child spec and GenServer start is part of a larger function which does some other setup/startup stuff based on options passed to the function. The child spec is generated using the following line:

settings_service_spec =
      %{
        id: opts[:childspec_id],
        start:
          {MscmpSystSettings, :start_link,
           [opts[:service_name], opts[:datastore_context_name], []]}
      }

Which in turn produces a child spec that looks like:

%{
  id: MscmpSystSettings.Runtime.DevSupport,
  start: {MscmpSystSettings, :start_link,
   [
     :"MscmpSystSettings.TestSupportService",
     {:via, Registry,
      {MscmpSystSettings.TestRegistry, "0000mt.tstspt.MscmpSystDb.ContextRole"}},
     []
   ]}
}

Eventually this ends up at DynamicSupervisor.start_child/2:

DynamicSupervisor.start_child(resolved_supervisor_name, settings_service_spec)

…and it’s the start_child/2 call that ends with an :invalid_child_spec error:

 {:invalid_child_spec,
  {{MscmpSystSettings, :start_link,
    [
      :"MscmpSystSettings.TestSupportService",
      {:via, Registry,
       {MscmpSystSettings.TestRegistry, "0000mt.tstspt.MscmpSystDb.ContextRole"}},
      []
    ]}, :permanent, 5000, :worker, [MscmpSystSettings]}}}

As far as I can tell, the child spec is structured correctly… I have an :id value, and a valid :start mfa tuple.

So part of what I’ve done is trace how this code runs through DynamicSupervisor source. I can see what specific function in the source is producing the error as it appears to be the only way to get an :invalid_child_spec, but what I can’t see is how the error message spec is even possible.

After a few steps in Supervisor which really don’t look like they do anything for the story here, we end up at DynamicSupervisor.validate_child/1:

defp validate_child(%{id: _, start: {mod, _, _} = start} = child) do
    restart = Map.get(child, :restart, :permanent)
    type = Map.get(child, :type, :worker)
    modules = Map.get(child, :modules, [mod])
    significant = Map.get(child, :significant, false)

    shutdown =
      case type do
        :worker -> Map.get(child, :shutdown, 5_000)
        :supervisor -> Map.get(child, :shutdown, :infinity)
      end

    validate_child(start, restart, shutdown, type, modules, significant)
  end

Presumably this is were the values I didn’t provide in the submitted child spec, but appear in the error message, came from… though interestingly the significant value doesn’t appear in the error message version of the spec. This function ends with a call to validate_child/6. However, the error message reported child spec didn’t have 6 values it only had 5 and the error message apparently comes from validate_child/1 when no prior match happens:

defp validate_child(other) do
  {:invalid_child_spec, other}
end

Which wouldn’t respond to a call to validate_child/6.

And ultimately this is where my confusion steps in… I can’t see how I could get far enough along to get values I didn’t provide defaulted into the child spec given in the error message… but end up at that error message. Sure, there could be lots of other failure modes, but I can’t see where I’d be taking any different path than the one I described.

Again, I have to be missing something simple and otherwise obvious in the spec, the path through Supervisor and DynamicSupervisor that the code takes, etc.

Thanks in advance,
Steve

Really? Where is the :id in the error message?

If those other values I didn’t add really are coming from the validate_child/1 call matching the map version… I wouldn’t expect to see :id. The signature of that function is:

validate_child(%{id: _, start: {mod, _, _} = start} = child)

The :id value is explicitly ignored and I wouldn’t expect to see it in the error. In the child spec that I’m passing, the :id value is there.

But this is the issue. In reality I expect that I’m not hitting that validate_child/1 function which matches on the child spec map… but somewhere I’m getting most of the extra values which aren’t in my child spec defaulted in and that’s the only place I see where it might come from … so its as though I’m hitting it… I just don’t see how that’s possible.

Interestingly, the error child spec is given as a tuple, not a map. So I’m digging right now to see how that might be possibly happening. My sense is that maybe I’m not finding all the places where :invalid_child_spec may be coming from.

My apologies… this is a silly statement on my part considering the match on child is what is used in the meat of that function… but in the end it doesn’t do anything with :id anywhere anyway… it doesn’t pass it on into the subsequent calls.

Look at the back trace. You are not passing the right child_spec and the function balked. Add logging at the right place if necessary.

I’d suggest tracing the functions involved and see if the correct values reach the places you expect them to reach.

1 Like

This is peculiar - it looks like the result of validate_child/6 has somehow ended up as the input of validate_child/1 :thinking:

I don’t understand how this could be happening - the only place that calls validate_child/1 is start_child, and that’s guarded to only allow in 6-tuples or maps. Passing a 5-tuple to start_child would give an ArgumentError in Supervisor.child_spec/2 before it even made it to validate_child.

1 Like

Right. I assume that’s the case. But I can’t see how that original child spec is invalid. My looking through the Elixir code was an effort to figure out what I’m doing wrong in the child spec… but I can’t line up what the child spec is with the result that Elixir is producing.

Good tip. I know that my understanding of Elixir/Erlang debugging, tracing, and monitoring is underdeveloped… and this is very helpful. I wasn’t aware of this tool and I think it could be what I need to at least.

Naturally, I couldn’t understand it either… but the extrace tip from @LostKobrakai is getting me going in the right direction. What I’ve found so far is that the child spec I’m submitting appears to actually be passing the validate_child/1 (map match) and validate_child/6 successfully. I don’t appear to actually be getting to:

defp validate_child(other) do
  {:invalid_child_spec, other}
end

The error is coming from somewhere else.

The reason we see a 5 tuple is because that’s what validate_child/6 produces:

defp validate_child(start, restart, shutdown, type, modules, significant) do
    with :ok <- validate_start(start),
         :ok <- validate_restart(restart),
         :ok <- validate_shutdown(shutdown),
         :ok <- validate_type(type),
         :ok <- validate_modules(modules),
         :ok <- validate_significant(significant) do
      {:ok, {start, restart, shutdown, type, modules}}
    end
  end

This ultimately gets passed to DynamicSupervisor.call/2 which in turn calls GenServer.call/3. So in the trace I see:

GenServer.call(:"MscmpSystSettings.TestSupportSupervisor", {:start_child,
 {{MscmpSystSettings, :start_link,
   [
     :"MscmpSystSettings.TestSupportService",
     {:via, Registry,
      {MscmpSystSettings.TestRegistry, "0000mt.tstspt.MscmpSystDb.ContextRole"}},
     []
   ]}, :permanent, 5000, :worker, [MscmpSystSettings]}}, :infinity)

… and that’s as far as I’ve gotten. They key point is that, at least to this point I’m passing the child spec validation step enough for the GenServer.call/3 attempt.

I’m not done troubleshooting it yet, but at this point I’m not stuck.

You already had good answers and they sadly have not helped – IMO the next logical step is to make a reproducible example in a singular file that’s using Mix.install?

The other place that :invalid_child_spec could come from is in :supervisor.handle_call’s :start_child clause:

The only way I could see this happening is calling DynamicSupervisor.start_child with a name or PID that’s actually running :supervisor code.

1 Like

Quite the contrary. I agree I’ve gotten good answers… and they have helped me on this problem immensely. Enough so that I think I can solve the problem without additional help. This is why I mentioned I’m no longer stuck.

Benjamin’s advice regarding tracing and pointing me to that extrace utility was sufficient to break the mental block that I couldn’t get past: I assumed my child spec had to be wrong and the place where it had to fail was in the validate_child functions of DynamicSupervisor: extrace showed me that my child spec was not failing were I was stubbornly believing it must be; the ability to trace through Elixir private functions has made the most difference. While I haven’t found the problem yet, that’s now more about my personal stamina to pursue it further today than missing the knowledge to figure it out… I’m quite sure I’ve got what I need now to solve the problem at this point.

And while Matt’s answer didn’t directly get me on the right path… he at least gave me some comfort that I might not be a complete moron in my reading of the issue. Matt’s answers in this community are regular and of consistently high quality… that his reading was similar to mine gave me hope that I was being somewhat reasonable.

You’re no slouch either :slight_smile: … but this time I have to both agree and disagree with you.

If I get to the root cause, I’ll post what I find here for the record in case it might help others. I have some suspicions about what might be wrong, but I’ll wait until I’ve completed the troubleshooting.

1 Like

So I figured out the issue: it was a naming collision in the supervisory tree.

I had a regular Supervisor created in an early step and given a certain name. Later, in a different module, I accidentally created a DynamicSupervisor with the same name.

This is exactly what was happening. Because I was calling DynamicSupervisor.start_child/2 while using a name that was also associated (and associated first) with a regular Supervisor.

The extrace utility eventually got me to the code that Matt had called out, and that was where things were failing. It wasn’t clear that there was a naming collision, but stepping back with all of the information I traced through the whole process and across modules and with fresh eyes I saw the collision. This is all test/development support code that wouldn’t be running in production and so I’ve taken liberties and short-cuts here and there… and in this case it bit me. <sigh/>

I wish I could mark two solutions because both @LostKobrakai and @al2o3cr really contributed to getting me across the finish line with this.

Thanks to all that commented!

2 Likes