How to detect wich process is getting unexpected messages?

Fl4m3Ph03n1x · June 14, 2019, 2:52pm

Background

We have an app the keeps running. This is a OTP application that has other OTP applications as dependencies. It uses GenServers and Supervisors as do many of its dependencies.

Problem

The app doesn’t crash, but it constantly prints the following error:

=ERROR REPORT==== 14-Jun-2019::14:31:35.089496 ===
Unexpected message: {#Ref<0.2607974300.3493068808.175359>,badarg}

Actions taken

To find out where the issue could be comming from we added the following code to every GenServer and Supervisor we could find (even going into the deps folders and adding it there as well):

  def handle_info(msg, state) do
    Logger.error("Unhandled message -> module: #{__MODULE__}, msg: #{inspect msg}, state: #{inspect state}")
    {:noreply, state}
  end

The idea here is to identify which module is having issues so we can fix it. However, nothing we do seems to be revealing the issue. The message keeps appearing.

Question

A few weeks ago @sasajuric posted a talk named The soul of erlang and elixir . The talk was interesting but what truly astonished me was the ease with which he could find critical pieces of code from just a pid or a reference.

I was wondering if someone could help me implement his strategies so we can find out where this error is coming from. This reference must be useful for something after all. I just don’t know how to use it.

Given this information, how can I find out which module has the failing code?

NobbZ · June 14, 2019, 3:02pm

After you changed your dependencies sources, have you made sure to recompile them by deleting _build?

This could also be some task returning it’s results, it had a similar shape as far as I remember.

Fl4m3Ph03n1x · June 14, 2019, 3:05pm

Yes. Always

Is there a way to know what exactly this reference is? It could be a reference to an ETS table, to a Stream, to anything. I suspect it may be connected to an ETS table because of the badarg reason, but I don’t know of any way to disprove or prove my theory.

Is there a function in ETS that allows me to know if a given reference is pointing to a table?

NobbZ · June 14, 2019, 3:10pm

A reference is just a random value, guaranteed to be unique among all connected node when it gets created. It does not know anything about what it is used for.

peerreynders · June 14, 2019, 3:20pm

I’d start looking at the logger and meta data configuration to capture additional information.

Before:

11:58:49.837 [error] Unexpected message: :ping

After:

config :logger, :console,
  format: "\n$time $metadata[$level] $levelpad$message\n",
  metadata: [:application, :module, :function, :pid, :registered_name, :file, :line]

2:03:19.400 application=demo module=Demo.Server function=handle_info/2 pid=<0.166.0> file=lib/demo/server.ex line=16 [error] Unexpected message: :ping

Not all loggers support this but it could be worthwhile retrofitting this.

tty · June 14, 2019, 4:34pm

The 4 major correlation points from error logs are:

file and line number. By default SASL error logs give you the offending module and line number.
registered pids. As registered names are unique to a node or application (global registration) this makes it easy to pin point the offending process. However not all pids are registered.
process state. All processes have a state and this is the most reliable correlator but can be a complicated for larger state structures.
process parent. All processes have a parent process and checking ancestry can help. This is especially true if started via a supervisor.

2-4 would be process specific, although 1 is usually sufficient to point to the error.

sasajuric · June 14, 2019, 5:33pm

Warning: what I’m about to show is not production safe! Try it out on local development or on some staging server.

You could use Erlang tracing to find out the processes sending and receiving these messages. Here’s a simple sketch using low-level Erlang tracing functions (code is copy-pastable to iex).

Suppose we have two following modules which are powering processes simulating your situation:

defmodule Receiver do
  use GenServer

  def start_link(_arg), do: GenServer.start_link(__MODULE__, nil, name: __MODULE__)
  def init(_arg), do: {:ok, nil}
end

defmodule Sender do
  use Task

  def start_link(_arg), do: Task.start_link(&loop/0)

  defp loop() do
    Process.sleep(2000)
    send(Process.whereis(Receiver), {make_ref(), :bad_arg})
    loop()
  end
end

You could add another process which traces {_, :badarg} messages sent from any process. Here’s a very simple version based on Erlang tracing BIFs:

defmodule Tracer do
  use GenServer

  def start_link(_arg), do: GenServer.start_link(__MODULE__, nil)

  def init(_) do
    :erlang.trace_pattern(:send, [{[:_, {:_, :bad_arg}], [], []}], [])
    :erlang.trace(:processes, true, [:send])
    {:ok, nil}
  end

  def handle_info({:trace, sender, :send, {_ref, :bad_arg}, receiver}, state) do
    IO.puts([
      "sender info:\n#{inspect(Process.info(sender), pretty: true)}\n\n",
      "receiver info:\n#{inspect(Process.info(receiver), pretty: true)}"
    ])

    {:noreply, state}
  end
end

Now we can start all these processes:

Supervisor.start_link(
  [Tracer, Sender, Receiver],
  strategy: :one_for_one
)

The output (snipped a bunch of noise):

19:19:27.949 [error] Receiver Receiver received unexpected message in handle_info/2: {#Reference<0.3363507600.1371275267.244393>, :bad_arg}

sender info:
[
  ...
  dictionary: [
    ...
    "$initial_call": {Sender, :"-start_link/1-fun-0-", 0},
    ...
  ],
  ...
]

receiver info:
[
  registered_name: Receiver,
  ...
  dictionary: [
    "$initial_call": {Receiver, :init, 1},
    ...
  ],
  ...
]

Depending on your particular situation, you might (but not necessarily will) be able to pinpoint the modules powering these processes. AFAIK, it’s not possible to get the stack trace, so you’ll need to analyze the code. But hopefully, the info above might help you track down the problematic process/module.

For production-safe tracing, check out recon_trace, and redbug (there’s also Elixir wrapper rexbug). You might also want to read this recent post on tracing in Elixir.

Fl4m3Ph03n1x · June 17, 2019, 10:52am

First of all, thank you so much for your reply. It is always a huge pleasure reading your answers and comments as they are full of insight.

I did try out your Tracer module, but unfortunately it didn’t catch the culprit. Does the Tracer module need to be at the same level of the other process in the process tree for it to catch them?

For example, if I have the Tracer module in my app’s supervisor, and the error message is being received 10 levels down the process tree inside a process from a dependency, from another dependency, will Tracer be able to view it?

I am not versed in erlang nor in tracing tools at all, this is all new territory, so I apologize if my question is senseless.

I am also trying recon and I am reading the post on erlang solutions you suggested. Can’t wait to learn more!

sasajuric · June 17, 2019, 11:08am

In theory it should work, regardless of the supervision tree shape. You could try using :all instead of :processes. Another option is to trace receives, instead of sends. In that case you should replace :send with :receive. For more info, check out the docs of trace and trace_pattern.

You can try both approaches on my simple simulation, and once they work, you can move to do the same in your project.

Other than that, I don’t have much to advise ATM.

jweinkam · February 28, 2020, 5:08pm

Fl4m3Ph03n1x,
I am also getting this

07:46:15.376 [error] Unexpected message: {#Reference<0.3581458342.1708130305.15268>, :badarg}

logged by some unknown part of code and have not been able to find out where it is coming from. I have tried the tracer and it did not give anything. I would like to know if you ever found the cause in your project and if you did what you did to be able to find it.

jweinkam · February 29, 2020, 3:57pm

By using a custom formatter for Logger, I was able to determine that in my case this is getting logged by gun. After which I was able to find https://github.com/ninenines/gun/issues/193. I updated to Erlang 22.2.8 and now am no longer getting the Unexpected message.

Fl4m3Ph03n1x · March 3, 2020, 2:32pm

gun was also the cause of the unexpected messages I was getting. Quite interesting