I’m using FLAME to run some code that has a tendency to fail (external libraries that are out of my control). When FLAME works, it works great; when it fails, it will normally work if I just rerun it.
I’m having two main issues:
- Sometimes the boot time can take several minutes for a machine. I would say 1/2 of the time it’s less than 1 minute, 1/4 of the time it’s between 1 and 2 minutes, and 1/4 of the time between 2 and 10 minutes (I think 8 minutes is the longest I’ve seen it take).
- If the process fails, it fails in a bunch of spectacular and varied ways. OOM, process crashed, process hangs, infinite loops… fun stuff.
What I’ve found works the best is to wrap it in a bunch of try catch rescue blocks, give the FLAME pool a long boot_timeout, and give the FLAME pool a standard normal timeout.
However, when FLAME times out the process, I get this message:
** (EXIT from #PID<0.1418.0>) shell process exited with reason: killed
And no response back to my parent caller of the process.
The only way I know how to resolve this is to wrap it in my own Task
and give it a timeout and handle that, but this does not give me the ability to give it extra time for a boot timeout.
This is all the control and error logic I have wrapped around a single call, where really all I care about is reliably returning an {:ok, val}
or {:error, reason}
tuple:
@timeout :timer.seconds(60)
def timeout(), do: @timeout
# returns {:ok, {spans, words}} or {:error, reason}
def spanwords(text, language) do
lang = Lang.find(language)
pool = to_pool(lang["stanza"])
FLAME.call(pool, fn ->
try do
Producer.spanwords(text, lang["xxx"])
catch
:exit, reason -> {:error, {:exit, reason, :runner}}
end
end)
catch
:exit, reason -> {:error, {:exit, reason, :caller}}
end
def spanwords(text, language, retries) do
res = spanwords_async(text, language)
base_case = is_nil(retries) || retries < 1
if spanwords_success?(res) || base_case do
res
else
spanwords(text, language, retries - 1)
end
end
def spanwords_async(text, language) do
task =
Task.Supervisor.async_nolink(Gambit.TaskSupervisor, fn ->
spanwords(text, language)
end)
Task.await(task, @timeout)
rescue
error -> {:error, error}
catch
:exit, {:timeout, _} -> {:error, :timeout}
:exit, reason -> {:error, {:exit, reason, :task}}
end
I guess I’m really asking two questions:
-
Is there a more sophisticated way to capture a timeout from a process? I want FLAME to handle the timeout to properly account for boot times, but still be able to capture that as an
{:error, :timeout}
tuple -
Error handling is still very murky for me in Elixir. Is there a more straightforward way to just so “no matter what return either an {:ok, val} or {:error, reason} tuple” (or a list thereof) instead of various different catches and rescues and Task wrapping etc?