Blog Post: Careful what data you send or how to tank your performance with Task.async

I ran into an interesting problem recently where simple concurrency on the BEAM via Task.async made my application a lot slower and a lot more memory hungry. This blog post illustrates the issue with a short example, where processing 3 non-trivial actions on a list in parallel is slower than doing it sequentially, and then explains why this happens and what can or can’t be done about it.

6 Likes

Using the term parallel in this context is incorrect, the correct term would be concurrently and this has some very important implications:

  1. You are not guaranteed that the spawned tasks will be running on a separate physical core, as this is decided by the scheduler, there are ways to configure this manually, however you will be breaking the abstraction level of concurrency used in elixir and potentially introduce locks;
  2. As the scheduler switches context between running processes, every process will get a slice of execution time, naturally having more processes on the same thread will make them run slower.
2 Likes

@D4no0 I disagree, parallel is the correct term here or why wouldn’t it be? Yes there is no guarantee that they will all run at the same time - but with 24 available cores and the schedulers otherwise unoccupied I’m pretty certain they will. Unless you intend to make the argument that nothing on the BEAM ever runs in parallel or what am I missing? :thinking:

There is definitely nothing that absolutely stops them from running at the same time, making it concurrent.

If you are not certain 100%, then you cannot claim that your tasks are running in parallel. IMO a setup where this is enforced in all cases is in order to validate parallel claims and benchmarks.

To be fair, that wasn’t very surprising. Parallelizing in such a manner only saves time when (1) there’s lots of it (not just 3 tasks) and (2) the data is not being carried around but is crunched into much smaller pieces and/or sent off to other systems (Kafka, Postgres et. al.).

But the article was informative and interesting, and I thank you for it.

2 Likes

Come on now. Unless you have a strict one-thread-pinned-per-core runtime then you can’t claim it for any runtime, Golang’s and Rust’s tokio’s included.

Fact is that most parallel runtimes do parallelization on a best-effort basis and they do a damn good job at it. There was a rather hilarious article a while ago reposted on HN how the Linux kernel never used more than 4 (or 8?) cores for a while, some years ago. What can a runtime do if the kernel is lying to it? But that’s a separate topic.

So… 100% guarantee? No, but it’s at least 90%.

3 Likes

It still doesn’t feel right to me using this term so freely, there is a reason parallel programming (for example cuda development) is an entire different paradigm with completely different concepts.

Maybe, but my goal is not a strictly academic discussion. :person_shrugging:

1 Like

Unbounded parallelism is generally not a good tactic. Your workload and underlying hardware capabilities should be well understood to help determine how concurrent you want a set of tasks to go.

2 Likes

@D4no0 Technically it executed in parallel if at any point in time during that time 2 bits of code ran at the same time. If you think that for almost 4 minutes of run time with 24 schedulers available to erlang (and 12 physical cores) this didn’t happen then that’s some next level. Of course, it happened more than that and you could look at CPU utilization. You can always make an argument that “maybe it didn’t actually execute in parallel” even with the OS scheduler.

Most of that is beside the point though, what’s important is showing how the runtime, by default not some weird configuration, handles these work loads and what to do accordingly.

@warmwaffles The parallelism here wasn’t unbounded though right - I mean the first example explicitly has 3 tasks.

That said, of course you’re right - which is why I always advocate to benchmark things instead of assuming what’s faster/better as there are many surprises on the way :slight_smile:

1 Like

Totally! It was just a stupid example I made up to try and easily illustrate in a benchmark the problem I ran into with benchee to show-case it without all the context and overload of benchee. :slight_smile:

edit: sorry for multi reply, tried to fold it into one again but couldn’t see how to delete this one

1 Like

Using generic terms and not nitpicking on details:

Concurrency is when the system does different things at the same time, which can create race conditions, transactional problems, etc. People generally have conceptual problems with it because it requires a higher level mental model.

Parallelism is when the system does the same thing, multiple times, at the same time. This is very easy to understand, conceptually: Have that list, run that same function for each list item in its own thread, then we get the results in a new list at the end. But its very hard to do well at low level, when we are talking compilers, hardware, etc. (And I know nothing about it but I know it’s a field of the discipline). But at a higher level the word totally applies: Got this advent of code problem, it’s slow, let’s use Task.async_stream :smiley:

And of course parallelism implies concurrency.

Nice article. I wish I knew a way to dump in the console whenever data copied from one process to another is more than N bytes.

2 Likes

Not completely at the same time though - at one moment it has multiple tasks in progress but it may only ever work on one thing at a given moment. When on stage I like to demonstrate by either speaking or moving at one point in time. Both things are going on, but I’m just doing at a time. Whereas when I speak, walk around and gesticulate I’m doing these things in parallel.

That part is important to me because f.ex. Ruby for a long time (before ractors) could only do concurrency and not parallelism (unless you forked multiple OS-processes).

Interesting thought! With all the tracing and introspection I’d think there should be a way to this - however, I’m not familiar with it.

Theoretically speaking though (as an easy, hacky “fix”) you could write your own wrappers send, GenServer.call etc. that ran :erts_debug.size/1 only in the dev env and then dumped the info across a certain threshold. That should work, not that I recommend doing it though (for big inputs, that function can also take an extraordinary amount of time).

Probably better ways of doing this though :slight_smile:

Not sure I follow what you mean. If you are talking about single-thread-concurrency à la nodejs I am not sure I would call that concurrency. Just “async” stuff.

And yes, not going to change all my calls send() by a custom function. (Or maybe using a copy of the source code and doing it automatically with a script, that could work :slight_smile: )

I think of concurrent as an abstract idea that I can have multiple execution contexts, i.e…processes making progress concurrently.

Whether those processes achieve that progres in parallel or not depends on the available schedulers and logical processors available in the system, noting that you can limit the amount of parallel work the runtime can do by constraining the number of schedulers (+S and +SP options).

By default the BEAM will create a scheduler for every logical processor so that processes will execute in parallel so it is fair to say that processes will be executed in parallel, however exactly what processes get executed in parallel is up to the schedulers.

1 Like

This is the beauty of design of concurrency in OTP. For me this video made it crystal clear from day one how concurrency is handled in elixir, I recommend for everyone to watch who are not 100% sure:

1 Like

Yeah it a classic!

1 Like

Been reading rust closures recently. So, that gave a fresh perspective on closures in elixir.

One question - if the closure does not use send or spawn under the hood , then does copying problem still matter ?

No as long as you stay in the same process then you do not care, basically, as the data is not copied. The closure will use the same pointer to the data as the parent scope does.

3 Likes