I have two applications in an umbrella project:
apps
source
web
Here, web
is a web server built using Plug and Cowboy and source
is a collection of “data-sources” each with a fetch
function.
The source
application has a simple cache based on single process writing and reading from an ets table. Additionally, each of the sources are gen server data-source processes that serialise operations/writes.
Inside the web
application, I want to asynchronously call fetch
on each of the data-source processes and merge their results.
Most importantly I want my application to restart itself if there is a catastrophic failure. (i.e. I have already accounted the normal error flow).
Each of the applications have supervisors and the workers are simple enough to be trivially restarted.
However, when I fetch each of the sources like this (the code is in the web application):
@data_sources
|> Enum.map(&Task.async(fn -> &1.fetch(member_id) end))
|> List.foldl(%{}, &Map.merge(&2, Task.await(&1)))
|> Poison.encode!
and spoof an error inside one of the data sources, (i.e. raise “error”) then the whole application crashes and it does not restart as I expect it to. I.e. the web server restarts but the source application does not and I get errors that look like this:
Task #PID<0.360.0> started from #PID<0.357.0> terminating
** (stop) exited in: GenServer.call(DataSource.Foo, {:fetch, "my_id_data"}, 5000)
** (EXIT) no process
However, using a task supervisor with async_nolink
fixes this issue:
@data_sources
|> Enum.map(&Task.Supervisor.async_nolink(Web.TaskSupervisor, fn -> &1.fetch(member_id) end))
|> List.foldl(%{}, &Map.merge(&2, Task.await(&1)))
|> Poison.encode!
This is what I think is happening:
-
Task.async
links to the calling process. Since a data-source process terminates, it also causes both the source application and the cowboy-plug web server to terminate. The supervisor restarts the cowboy-plug web server but it does not restart the source application. (In observer, the cache process in the source application is also destroyed - in fact, the entiresource
application is brought down) -
Task.Supervisor.async_nolink
does not link to the calling process, leaving the cowboy-plug web server alone, and data-source process gracefully restarts and keeps the cache processes running.
Note, running everything synchronously works as expected:
data_sources
|> Enum.map(fn i-> i.fetch(member_id) end)
|> List.foldl(%{}, &Map.merge(&2, &1))
|> Poison.encode!
I’m trying to understand exactly what the issue could be here and whether Task.Supervisor.async_nolink
is the proper approach.