Retrying operation, but how

marcuslankenau · June 23, 2016, 9:34am

Hi there,

I have a (long running/about 4 seconds) operation that is invoked by an phoenix controller. I need to support retry, so if the actions fails after some seconds, I have to retry it. Right now I use a bad style and throw an error, when something goes wrong and at the bottom catch the error, and recursively retry (for a couple of times).

That works, but it feels wrong. I think I would prefer to put the operation inside a new process, and if process fails, I start one again. I could of something like the Task module, but with an retry option. Is there something like that?

Cheers
Marcus

kaqqao · June 23, 2016, 10:56am

I know you can can start Tasks through Task.Supervisor with restart: :transient, which will effectively retry whatever the Task was doing, but not sure how you’d limit the number or retries.
I got the idea from http://blog.danielberkompas.com/2016/04/05/background-jobs-in-phoenix.html

lucidstack · June 23, 2016, 11:06am

With the :max_restarts+:max_seconds options: http://elixir-lang.org/docs/stable/elixir/Supervisor.Spec.html%23supervise/2

marcuslankenau · June 23, 2016, 11:11am

Thannks for the pointers. I will give it a try.

The Task.Supervisor is startet in the supervision tree like any other Supervisor, right?

Cheers

marcuslankenau · June 23, 2016, 12:28pm

I tried to get into a retry but failed. I used this little example:

{:ok, sup} = Task.Supervisor.start_link(restart: :transient, max_restarts: 3)
fun = fn ->
  IO.puts "enter fun"
  :timer.sleep(500)
  raise "Bang"
end
Task.Supervisor.async_nolink(sup, fun)
|> Task.await

I would expect to see three times “enter fun” and then it should fail. But I get one enter fun and it failes. What am I doing wrong?

Cheers

lucidstack · June 23, 2016, 1:00pm

You have to use Task.Supervisor.start_child/2 instead of Task.Supervisor.async_nolink/2.

Swapping async_nolink out for start_child yielded 3 bangs in my console

marcuslankenau · June 23, 2016, 1:08pm

I tried that and I got 3 bangs as well. But then I cannot await the result. I mean I dont get the %Task as result, that I can use to await.

sasajuric · June 24, 2016, 8:13am

It is unclear how the action might fail, but if it’s something you expect might happen (for example a request to an external service), then I’d say rescuing the expected exception and retrying, perhaps with some delay, would be the way to go.

Supervisors are more appropriate to recover from unexpected bugs, and I’d say they make more sense for server processes (GenServer and friends). Such processes are more like internal services which respond to various request. Due to some bug, they might occasionally fail, but after restarting they will probably work again.

In contrast, what you describe is more of a one-off job. It takes some input, does some processing, produces the output and stops. Hence, if there’s a bug, restarting won’t really help you because you’ll start with the same input which will lead you to the same failure.

However, as I said, there might exist some expected failures, such as database or some other external service not responding because of a brief network outage or overload of the other service. By rescuing the expected error, you can explicitly retry and even implement growing retry delays.

It’s also unclear whether the phoenix controller needs to wait for the result of the job. If yes, then I’d just run the job in the same process. Otherwise, I’d start a Task under some supervisor and immediately return the response (e.g. status: :queued) from the controller action.

marcuslankenau · June 26, 2016, 5:38pm

Sasa,
thx for taking the time for the detailed analysis. The action is using other services that fail on a regular basis. Retrying actually does help since the failing requests does not fail caused be invalid input but just overload or whatever.

For my main error group, that happen about 25% of my requests I do a manual retry. For the other reasons that happen seldom I let the process die and start a new on. Being a total elixir newbieI tried do follow what I read to not handle exception but let processes die.

Right now the service that is provided it synchronous but I will go to async later. Maybe then it is better to go with a supervision tree.

I ended up writing a RetryTask module that does the job without using a supervisor but using more or less the interface of Task. I am not happy with writing such infrastructure code. I my opinion Task and Task.Supervisor should support the retry in combination with await. But it works and I hope to use standard components later.

emerleite · August 12, 2016, 9:10pm

I’m using GenRetry. It works perfect. https://github.com/appcues/gen_retry

mbuhot · March 2, 2017, 5:37am

This issue was raised on the Elixir github project.

github.com/elixir-lang/elixir

restart: :transient does not affect Task.Supervisor.async/async_nolink

opened 06:18PM - 13 Apr 16 UTC

closed 06:46AM - 14 Apr 16 UTC

sikanhe

For task supervisors, the processes can only be restarted when using start_child… method. We cannot currently await on a task that we want to retry. ``` {:ok, tas} = Task.Supervisor.start_link(restart: :transient) Task.Supervisor.async_nolink(tas, fn -> 1 = 2 end) // does not restart ``` I understand there is a limitation that a Task is tied to a process, so it cannot possibly know about the new process that could get started. But this was unclear in the documentation. Do you think there is a way to await on a supervised task that be restarted on crash? If so we should implement it because there is many cases where this could be helpful. If not I think we should clarify this in the docs more.

It is not possible to await on a restarted task because the monitor reference can not be reused or new pid detected by the caller.

BogdanHabic · March 2, 2017, 8:53am

But can’t we achieve this behavior with Registry?
So basically instead of a pid, you could get a ref under which the task would register itself?