Best practices on exception handling and tracking (let it crash vs. try catch tradeoffs)

I have some code that allows me to run a series of functions. I currently surround the execution itself by a try-catch because I don’t want one execution crashing to cause all future executions to crash.

defmodule ES.Queue do
  use GenServer

  def start_link(base_opts, opts \\ []) do
    defaults = [log_success: &noop/1, log_failure: &noop/1, timeout: 10_000]

    {queue_opts, genserver_opts} =
      defaults
      |> Keyword.merge(base_opts)
      |> Keyword.merge(opts)
      |> Keyword.split([:queue_name, :timeout, :log_success, :log_failure])

    GenServer.start_link(__MODULE__, Map.new(queue_opts), genserver_opts)
  end

  def run(pid, func), do: GenServer.call(pid, {:run, func}, 60_000)

  def init(%{timeout: timeout}=state) do
    {:ok, state, timeout}
  end

  def handle_call({:run, func}, _from, %{queue_name: queue_name, timeout: timeout, log_success: log_success, log_failure: log_failure}=state) do
    resp =
      try do
        t1 = :erlang.monotonic_time()
        val = execute(func)
        t2 = :erlang.monotonic_time()
        ms_taken = :erlang.convert_time_unit(t2 - t1, :native, :millisecond)
        log_success.({queue_name, func, ms_taken})
        val
      catch
        type, error ->
          log_failure.({type, error, System.stacktrace()})
      end
    {:reply, resp, state, timeout}
  end
  def handle_call(:inspect, _from, %{timeout: timeout}=state) do
    {:reply, Kernel.inspect(state), state, timeout}
  end

  def execute(func) when is_function(func), do: func.()
  def execute({mod, func}), do: apply(mod, func, [])
  def execute({mod, func, args}), do: apply(mod, func, args)

  def noop(_), do: nil

  def handle_info(:timeout, state) do
    {:stop, :normal, state}
  end

end

However there are a few problems with this approach:

  • When an exception occurs during a mix test, the errors get swallowed. It becomes much harder to track down the error vs. when I remove the try catch.
  • If this code is in production, exceptions won’t get logged out or sent to an error reporting service like Rollbar

Basically, I still want exceptions to be loud. I still want the red error with the stacktrace to be printed out to the logs when I am running tests. I still want errors to get reported to an error reporting service while the app is running in production. What are some good options for this?

Thanks!

If this is a supervised GenServer, it would be restarted if it crashed for any reason. You are also only doing casts, which means it is all done synchronously (the caller waits for the result), meaning that you won’t lose anything if the GenServer gets restarted.

Also, without knowing your requirements, I’m not really sure what this buys you over just a normal module/function interface. If you have a lot of processes trying to use this GenServer at the same time, it will actually be a bottleneck. Especially if it gets used for longer tasks/calculations.

Thanks for the feedback.

The reason for the GenServer is to guarantee serial execution. For example, we have an accounting/billing service where things must be processed FIFO. If I fire off 5 things, executing them in parallel would result in incorrect data.

The GenServer call time in our situation is small compared to the overall time. For example, a typical execution might take 10ms vs. the cost of a call/cast.

The GenServer is supervised, but if it’s restarted, wouldn’t I lose mailbox the messages?

How about doing the execution in another process? It is not uncommon to do this for things that can fail. Your GenServer spawns and waits for the result from the execution. It also monitors the newly spawned process for failures. The execute are still synchronized, the monitoring GenServer will not fail and continues with the other cases even if one fails.

I’m not sure what sort of error logging you are looking for when something crashes but perhaps if the execution is done according to OTP standards (either a GenServer or a process started with proc_lib) I think you may get SASL logging which Logger may pick up.

3 Likes

Thanks @cmkarlsson I really like this idea. I’ll look more into this approach. Or perhaps there’s a way for my process to log the error directly in the catch clause?

I wasn’t aware that the Logger was responsible for capturing errors.

It is the OTP SASL application that takes care of logging crashed OTP processes. Logger can capture these and output in elixir format. You have to configure it to handle_otp_reports and handle_sasl_reports for this to happen though. These options redirects the erlang logger to Logger.

Well you should always be able to log them but they will pass the testing though as you are catching the errors.

Your test should check for a failed execution. The resp from your handle_call could be {:ok, val} or {:error, reason} then your test will fail if execute fails.

Your normal logging through something like Logger should handle this? I assume error reporting is just a configurable Logger handle so if you do something like Logger.error shouldn’t that still work?

You can log the error directly like this:

try do
  # ...
catch
  type, error ->
    Logger.error(Exception.format(type, error, __STACKTRACE__))
end

That said, I agree with @cmkarlsson’s proposal. If you want to ensure that “execution” doesn’t take the GenServer down, doing it in a separate process would give you the strongest guarantees of that.

2 Likes

Thanks I will go the route of isolating each execution into its own process. Not only do the guarantees make sense, but it also does look like it will plays well with the rest of the system. A real exception is actually raised so it will get tracked and printed out properly.

I am having trouble figuring out which OTP building blocks to use. As I understand GenServer.call and Task.await will both bring down the calling process.

I was tempted to use trap_exit but then read the following in the Task.async docs:

Setting :trap_exit to true - trapping exits should be used only in special circumstances as it would make your process immune to not only exits from the task but from any other processes.

Do I need to use something like GenServer.cast and reply manually? Or is it acceptable in this case to set trap_exit to true?