Capturing timeouts with FLAME

barrelltechgh · November 4, 2024, 7:15am

I’m using FLAME to run some code that has a tendency to fail (external libraries that are out of my control). When FLAME works, it works great; when it fails, it will normally work if I just rerun it.

I’m having two main issues:

Sometimes the boot time can take several minutes for a machine. I would say 1/2 of the time it’s less than 1 minute, 1/4 of the time it’s between 1 and 2 minutes, and 1/4 of the time between 2 and 10 minutes (I think 8 minutes is the longest I’ve seen it take).
If the process fails, it fails in a bunch of spectacular and varied ways. OOM, process crashed, process hangs, infinite loops… fun stuff.

What I’ve found works the best is to wrap it in a bunch of try catch rescue blocks, give the FLAME pool a long boot_timeout, and give the FLAME pool a standard normal timeout.

However, when FLAME times out the process, I get this message:

** (EXIT from #PID<0.1418.0>) shell process exited with reason: killed

And no response back to my parent caller of the process.

The only way I know how to resolve this is to wrap it in my own Task and give it a timeout and handle that, but this does not give me the ability to give it extra time for a boot timeout.

This is all the control and error logic I have wrapped around a single call, where really all I care about is reliably returning an {:ok, val} or {:error, reason} tuple:

  @timeout :timer.seconds(60)

  def timeout(), do: @timeout

  # returns {:ok, {spans, words}} or {:error, reason}
  def spanwords(text, language) do
    lang = Lang.find(language)
    pool = to_pool(lang["stanza"])

    FLAME.call(pool, fn ->
      try do
        Producer.spanwords(text, lang["xxx"])
      catch
        :exit, reason -> {:error, {:exit, reason, :runner}}
      end
    end)
  catch
    :exit, reason -> {:error, {:exit, reason, :caller}}
  end

  def spanwords(text, language, retries) do
    res = spanwords_async(text, language)
    base_case = is_nil(retries) || retries < 1

    if spanwords_success?(res) || base_case do
      res
    else
      spanwords(text, language, retries - 1)
    end
  end

  def spanwords_async(text, language) do
    task =
      Task.Supervisor.async_nolink(Gambit.TaskSupervisor, fn ->
        spanwords(text, language)
      end)

    Task.await(task, @timeout)
  rescue
    error -> {:error, error}
  catch
    :exit, {:timeout, _} -> {:error, :timeout}
    :exit, reason -> {:error, {:exit, reason, :task}}
  end

I guess I’m really asking two questions:

Is there a more sophisticated way to capture a timeout from a process? I want FLAME to handle the timeout to properly account for boot times, but still be able to capture that as an {:error, :timeout} tuple
Error handling is still very murky for me in Elixir. Is there a more straightforward way to just so “no matter what return either an {:ok, val} or {:error, reason} tuple” (or a list thereof) instead of various different catches and rescues and Task wrapping etc?