Failure in an async task does not properly restart application

I have two applications in an umbrella project:

apps
   source
   web

Here, web is a web server built using Plug and Cowboy and source is a collection of “data-sources” each with a fetch function.

The source application has a simple cache based on single process writing and reading from an ets table. Additionally, each of the sources are gen server data-source processes that serialise operations/writes.

Inside the web application, I want to asynchronously call fetch on each of the data-source processes and merge their results.

Most importantly I want my application to restart itself if there is a catastrophic failure. (i.e. I have already accounted the normal error flow).

Each of the applications have supervisors and the workers are simple enough to be trivially restarted.

However, when I fetch each of the sources like this (the code is in the web application):

@data_sources
|> Enum.map(&Task.async(fn -> &1.fetch(member_id) end))
|> List.foldl(%{}, &Map.merge(&2, Task.await(&1)))
|> Poison.encode!

and spoof an error inside one of the data sources, (i.e. raise “error”) then the whole application crashes and it does not restart as I expect it to. I.e. the web server restarts but the source application does not and I get errors that look like this:

Task #PID<0.360.0> started from #PID<0.357.0> terminating
** (stop) exited in: GenServer.call(DataSource.Foo, {:fetch, "my_id_data"}, 5000)
    ** (EXIT) no process

However, using a task supervisor with async_nolink fixes this issue:

@data_sources
|> Enum.map(&Task.Supervisor.async_nolink(Web.TaskSupervisor, fn -> &1.fetch(member_id) end))
|> List.foldl(%{}, &Map.merge(&2, Task.await(&1)))
|> Poison.encode!

This is what I think is happening:

  • Task.async links to the calling process. Since a data-source process terminates, it also causes both the source application and the cowboy-plug web server to terminate. The supervisor restarts the cowboy-plug web server but it does not restart the source application. (In observer, the cache process in the source application is also destroyed - in fact, the entire source application is brought down)
  • Task.Supervisor.async_nolink does not link to the calling process, leaving the cowboy-plug web server alone, and data-source process gracefully restarts and keeps the cache processes running.

Note, running everything synchronously works as expected:

data_sources
|> Enum.map(fn i-> i.fetch(member_id) end)
|> List.foldl(%{}, &Map.merge(&2, &1))
|> Poison.encode!

I’m trying to understand exactly what the issue could be here and whether Task.Supervisor.async_nolink is the proper approach.

Are you wanting ‘all’ the datasource genserver’s to be restarted when any single one dies, or just the one that dies? You do have them all under a supervisor yes? What do you have your application persistency type set to, :permanent (I think) is what it sounds like you want? Why create a task for each, why not martial the calls through the datasources supervisor, send messages to each genserver to do something with a return ref, wait for all to respond and/or timeout, then return (get data work out of the controller in other words), etc…?