Should I limit the number of tasks I use?

ejstembler · September 27, 2018, 3:50am

I’m currently reviewing some old Elixir code I wrote back in the 1.0.5 days, and I’m questioning a technique I used at the time. The code uses Task.async to chunk through a list of files to compress (gzip). It works, though I’m wondering if the code I wrote to limit the number of concurrent tasks was idiomatic or even necessary. Here’s the part I’m questioning:

task_await_timeout = Application.get_env(:files_compressor, :task_await_timeout)
min_simultaneous_compressors = Application.get_env(:files_compressor, :min_simultaneous_compressors)
logical_processors = :erlang.system_info(:logical_processors)
n_and_steps = Enum.max([logical_processors, min_simultaneous_compressors])

At the time, my reasoning was to use the maximum of Erlang’s reported logical_processors or a min_simulatenous_compressors configuration value. Though, I noticed on my home laptop which has 2 cores, Erlang reports 4 logical processors. Should I not conflate number of cores with logical processors? If so, what is the most logical and idiomatic way to limit async processes? Or, should I not even bother?

Incidentially, here’s what the task code looks like:

Enum.chunk(filenames, n_and_steps, n_and_steps, []) # https://hexdocs.pm/elixir/1.0.5/Enum.html#chunk/4
  |> Enum.flat_map(fn (filename) ->
    Enum.map(filename, fn (filename) ->
      Task.async(fn -> {_, 0} = gzip(filename) end)
    end)
    |> Enum.map(fn (t) -> Task.await(t, task_await_timeout) end)
end)

Qqwy · September 27, 2018, 4:19am

In general, don’t even bother:

It makes sense to start optimizing for efficiency only after doing benchmarking, because otherwise you’re off to a wild goose chase.
While processes are not free (memory-wise), they are very cheap. Having many does not matter all that much in the grand scope of things (as long as you don’t spawn new ones more frequently than old ones finish, on average).

That said, in newer Elixir vesions, some building blocks were introduced that make most of your boilerplate unnecessary, because it will make those decisions you question for you (based on some rigorous benchmarks for common use-cases of Elixir, IIRC): Task.async_stream.

With it, your code would become something like this:

filenames 
|> Task.async_stream(fn filename -> 
    {_. 0} = gzip(filename)
  end)
|> Enum.to_list

EDIT: Refactoring is awesome, and please do refactor! However, in general, refactor to make code more readable/to reduce mental complexity of the code. (rather than computational efficiency, because that will quickly devolve into premature optimization).

A side note about physical vs logical processors: It is very common to have twice as many logical processors: This is not a bug, but a feature most modern CPUs have, known as Simultaneous Multithreading, or, the Intel-specific implementation, as: Hyper-Threading.
The basic idea is that we (almost) double the amount of executions per second if we load the next instruction for an unrelated thread in parallel to the current one, such that the CPU instruction and loading the next instruction happen side-by-side, rather than sequentially.

ejstembler · September 27, 2018, 4:23am

Thank you for the detailed explaination and information! Looks like I should definitely upgrade this project to the latest version.

rvirding · September 27, 2018, 7:46am

Instead of :logical_processors I would use :schedulers_online. This tells me how many parallel threads I am actually currently using, not how many I may be using. When I start the system I can specify the maximum numbers of threads I want to use with :schedulers, the default is the number of cores. The interesting thing is that at runtime I can actually change how many I want to use [*] and the BEAM will move processes while running. It is quite fun to watch on observer.

[*] Using the call :erlang.system_flag(:schedulers_online, n) where n is not greater than the maximum number.

ejstembler · September 27, 2018, 9:32pm

Thanks @rvirding! I saw some of the other options but wasn’t clear which to use. This makes sense. Thanks for the information!