Why should I uses Tasks instead of Pools?

Fl4m3Ph03n1x · November 28, 2019, 8:56am

Background

A book I am reading makes the following statement:

Tasks will allow us to do one-off jobs and still rely on the rich OTP library. Connection pooling libraries let us share long-running connections across processes.

So, my understanding of this is as follows:

if I have a one off task, I should use tasks
if I need persistent connections to something, I should use pools

Why not always use pools?

And this is my questions. Even if I have one-off tasks, I still think pools are superior because they don’t have to create a new process to do the work, everything is already created.

I could understand the argument here if using pools was significantly harder than using one-off Tasks, but today with libraries like poolboy and books like Elixir in Action I do feel it is easy to understand, create and use pools these days.

Questions

What are the major benefits of using one-off Tasks instead of a Pool?
In what scenario would I be better of using a one-off Task instead of a Pool?

NobbZ · November 28, 2019, 9:06am

The only 2 reason I could spontaniously come up are those:

Pools also always introduce some kind of rate limit or upper bound for concurrent processing, while you can spawn as many tasks as you want. Of course there are pools which can handle this by dynamically spawn processes when exhausted, but if this occurs often, then you are basically back to even slower spawning tasks.
Task is part of the standard library, pools aren’t.

lud · November 28, 2019, 9:49am

Also pooled processes must have a cleanup step if they will perform many, independant, unrelated tasks.

A dedicated Task process will have everything cleanup up upon completion.

hauleth · November 28, 2019, 9:51am

I use pools for related things that can overload something external to the BEAM VM. Otherwise you still use pool, just not directly - the BEAM VM pool of schedulers, adding additional layer of indirection is IMHO a little bit pointless.

Fl4m3Ph03n1x · November 28, 2019, 9:57am

Can you give me an example of when I need to do cleanup? I have trouble seeing a scenario where this matters.

When a worker gets a new task, it simply replaces its state with the new state that’s needed.

So, if understand correctly, no even if I use Tasks, I am still using a pool, the BEAM pool of schedulers. I infer from this that there shouldn’t be any difference from using one-off Tasks and using Pools, except that using the second using an unnecessary level of indirection.

So, why are there Pool libraries in the first place? Why do we need them if simple one-off Tasks use the BEAM VM pool of processes?

lud · November 28, 2019, 10:01am

Well for some reasons people could use the process dictionary, for random numbers related stuff for instance. Or some people could register the process to a registry, use a library that would link/monitor the process from another process, create ETS tables that would not be destroyed.

hauleth · November 28, 2019, 11:42am

Because often we pool other resources as well. Sometimes these resources are limited (RAM when you do image processing, you want to limit amount of the images processed at once to keep RAM usage in reasonable boundaries), sometimes setup can be long and it is better to do it once and then keep it alive for longer time (HTTP keep-alive connections), and sometimes there is mix of both of those (for example DB connections).

Fl4m3Ph03n1x · November 28, 2019, 2:16pm

Well, yes, I could configure Poolboy (for example) to have a maximum number of workers and thus make sure RAM usage is within reasonable limits.
But can’t I do the same with the BEAM VM pool (using one-off Tasks)?

yurko · November 28, 2019, 2:24pm

Another reason is that with one off processes you have less to worry about and you must be a bit more careful with long running processes not to introduce memory leaks e.g. by “touching” a big binary that is then referenced from a process that never dies.

Here are few links:

https://andy.wordpress.com/2012/02/13/erlang-is-a-hoarder/

Generally it’s not something that happens often, but on the other hand creating and killing a process is very cheap, so I’d default to that unless I really want a long lived process.

peerreynders · November 28, 2019, 2:47pm

For me the whole point of the BEAM is that processes are lightweight and creating processes is inexpensive.

This property makes it very attractive to view processes a disposable - much in the same way as memory is “disposable” in a garbage collected language.

This is also helpful when it comes to resilience. If a process is disposed of after a single use, it is much less likely that you’ll end up with a situation where some weird edge case is corrupting your process state over time.

If the process lifetime is short enough, memory may never have to be garbage collected and is simply returned wholesale to the system after process termination.

So, personally I would favour short lived processes over long lived ones - until there is a good reason for making it long lived.

if I need persistent connections to something, I should use pools

More generally, if your process works with resources that are expensive to create and reusable then it should be a worker in a pool. Connection pools are built around that concept because database connections are expensive to create and there is an effective, finite limit to the quantity of simultaneous database connections.

The typical tradeoff is that long-lived processes should be as simple as possible to minimize the danger of state corruption over the long term. So it wouldn’t be that unusual for a long-lived process to delegate a lot of its work to short-lived processes while the long-lived process only manages simple state transitions.

Furthermore, from the resilience perspective, it may be worthwhile making worker processes “perishable” - terminate them once they have been used too often or aged too much and replace them with a fresh worker (and resources).

hauleth · November 28, 2019, 2:56pm

You can do, but this will be applied globally (by reducing amount of schedulers). Alternatively you could use :max_children option in Task.Supervisor or DynamicSupervisor, but that will have problem as there is no queuing mechanism and it will simply return error when trying to start new job. So it all depends on your use case, in the end we have DynamicSupervisor which in most cases will be more than enough, and pools are similar to that (when :max_children applied), but with additional features (like queuing and ending tasks that take too long).