How to handle a process that spawns other thousand processes

newton-peixoto · December 7, 2021, 4:58pm

Hi,

I would like to know how can I spawn one process that in my system represent a ‘‘order’’ each order has 1000+ images so I would like to process each image in a separate process but I can only say that the ‘’ order has been processed ‘’ if all those 1000+ images were processed successfully. Is there a way to keep tracking of these ‘‘child processes’’ ?

al2o3cr · December 7, 2021, 5:20pm

Functions like Task.await_many or Task.yield_many could do what you’re looking for.

HOWEVER

Consider carefully if you actually want to launch 1000 processes all at once, versus using a pool of workers. Unless you’re using a VERY large server, most of those 1000 processes will be waiting for their turn to run most of the time.

gpopides · December 7, 2021, 5:22pm

probably using Task would work for you.

I will agree with the previous post though, instead of spawning 1000 processes, a wiser option would be to chunk the images and spawn less processes that process many images.

newton-peixoto · December 7, 2021, 5:35pm

First of all, thanks for your reply! unfortunately I don’t think I quiet get what you said. Using a pool of workers I would be able to open a “order_process” and process all of those images or just a few of them? Would I be able to spawn order_1, order_2, order_3 and each of them ‘‘wait’’ for their images?

al2o3cr · December 7, 2021, 10:23pm

I can’t speak to the part about “order” processes; that’s going to depend on where orders come from and what cares about that “order has been processed” status update.

For a particular order, there are tradeoffs between concurrency and parallel processing overhead. For concreteness let’s assume there’s a list of 1000 images named images and a function process_image that takes an image, does the thing, and returns a result.

The “maximum concurrency” approach would be to start up a new process for every element of images and then wait for all the results:

results =
  images
  |> Enum.map(fn img -> Task.async(fn -> process_image(img))) end)
  |> Task.await_many()

If process_image does a lot of things that involve waiting for external resources, this may speed things up a lot.

On the other hand, if process_image does a lot of things that need CPU time, things won’t speed up much more than the number of schedulers in the system.

That last situation is common enough that there’s a standard function to handle it better by only starting enough processes to keep the schedulers busy: Task.async_stream Using it would look like:

results = Task.async_stream(images, &process_image/1)

Both of these approaches will use all the processing resources available when given enough images.

If you’re expecting to handle multiple orders, this is a problem: what happens when many arrive at once? Processes are cheap on the BEAM, but not free.

This is where ideas like “worker pools” like :poolboy or job-queuing systems like Oban are useful; they allow you to define how many workers should run simultaneously and then balance the work against those. Oban Pro’s batching would be a particularly good fit for this requirement.

newton-peixoto · December 8, 2021, 11:46am

Thank you!! It helped a lot