How to deal with detached telemetry handlers caused by errors

DriesCruyskens · May 24, 2023, 10:00am

I am trying to handle Oban telemetry events in my specific case, but my questions applies to telemetry handlers in general. During development, the telemetry listener would crash and detach logging the following error:

Handler “oban-logger” has failed and has been detached

The handler does not re-attach itself automatically and the server needs to be restarted in order to continue development. This would be disastrous if happened in production.

Is there a way to achieve similar robustness as with supervisors (auto restart on failure)? What is the correct way of handling errors in telemetry handlers? My current understanding is you have to make sure that your function 100% can’t crash, but this kind of goes against the ‘let it crash’ mantra of Erlang and Elixir.

Below is a part of my source code for context.

telemetry.ex

@impl Supervisor
  def init(_arg) do
    children = [
      {:telemetry_poller, measurements: periodic_measurements(), period: 10_000},
      {FacadeScan.ObanLogger,
       events: [[:oban, :job, :start], [:oban, :job, :stop], [:oban, :job, :exception]]}
    ]

    Supervisor.init(children, strategy: :one_for_one)
  end

oban_logger.ex

@impl GenServer
  def init(events) do
    Process.flag(:trap_exit, true)

    # https://hexdocs.pm/oban/Oban.html#module-instrumentation-error-reporting-and-logging
    :telemetry.attach_many("oban-logger", events, &FacadeScan.ObanLogger.handle_event/4, [])

    {:ok, events}
  end

  @impl GenServer
  def terminate(_, events) do
    for event <- events do
      :telemetry.detach({__MODULE__, event, self()})
    end

    :ok
  end

def handle_event(
        [:oban, :job, :exception],
        measure,
        %{worker: "FacadeScan.ImageProcessor"} = meta,
        _
      ) do
    # This might crash 
  end

LostKobrakai · May 24, 2023, 10:23am

This is by design. Handler code runs inline wherever events are emitted – in the same process emitting the events – which is the reason for those tight constraints around failure. The upside of that approach is that telemetry doesn’t involve message passing, which can be a problem with large enough data as context and/or volume of events. Reattaching failing event handlers automatically could be devastating to a systems stability, so this is not available.

Those contraints however don’t prevent your handler from moving potentially failing code execution into more safe places within the system instead of running them directly in the handler callback.