Concurrent Architecture - Task or GenStage/Flow?

bryansray · March 28, 2017, 6:14pm

Let’s say I have a list of names:

names = ["Bryan", "Jose", "Chris", "Dave"] # This list could be somewhat large (100 - 1000's), but omitted for clarity

Pretty simple. Now let’s say I have an API endpoint that takes in a name as an argument and returns you back details about said person. Easy enough … I can certainly make that happen…

Now my question is this: what if I wanted to start up N number of processes to go fetch each users details concurrently?

Is it typical to start up GenServer’s for a small task like this? Do you start them up, let them do their work, and then immediately shut them down? Do GenServer’s stop on their own after a while? Is something like Task better for this? Is GenStage / Flow a viable option now? Is there something better that I’m not aware of?

Any input would be greatly appreciated. Thanks!

idi527 · March 28, 2017, 6:32pm

I would probably use Task for this. GenServers don’t stop on their own after a while, they can be stopped and not restarted though.

Qqwy · March 28, 2017, 6:43pm

I believe that this is exactly what Task.async_stream is for. It will handle the process starting/stopping for you behind the scenes, and make sure to only start as much processes as is useful for your problem (by default it starts as many as your computer has CPU cores). This is better than starting 100 - 1000’s of tasks at once and then collecting the results because these task will start waiting for each-other nevertheless. And besides, most APIs don’t like it if they get too much requests from one person in a small timespan.

Task.async_stream can be seen as ‘Flow lite’ and is, in my opinion, enough to handle this problem.

bbense · March 29, 2017, 3:09am

What Flow buys you over Task.async_stream is that with some tweaking Flow can be “self-scheduling”, however if you already know what is an acceptable rate of queries, then simply setting max_concurrancy in Task.async_stream is a good first step. As a rough rule of thumb, for an I/O bound problem 10x the number of schedulers works well if you are not rate-limited at the remote endpoint. It’s what I use to do ssh scanning of many machines. For a single endpoint, you’d likely want it lower.

Also, if this kind of thing is the primary point of what you are doing you can get significant speed up by tweaking the BEAM startup variables. --erl ‘+A 75’ is an example. See http://erlang.org/doc/man/erl.html

This increases the number of threads in the async IO pool.

A lot also depends on how much you care about failure rates and retrys. Task.async_stream works best for “fire and forget”, if I get a result in the timeout fine, if not I’ll just move on.