Flow and partitions

This is probably a basic question about flows, but:

At the partition stage, where you can specify the number of partitions, does this number only relate to physical processors and nodes?

Or, if the number of partitions exceeds this value, does it also partition against otp processes?

At the reducer stage, you should be able to run the computations in parallel as much as possible right? So it would be an advantage to create as many partitions as possible (supposing you are not copying too much data around)?

You can set as an argument the number of the partitions you want. By default, Elixir will use the same number of processors on the machine and it’s not designed to work in a distributed fashion (https://hexdocs.pm/flow/Flow.html#from_enumerables/2-options).

Flow.from_enumerable([1, 2, 3], stages: 3) # To use 3 processes

If I’m not mistaken, the reducer stage is where you join everything together. So, the computations are run in parallel on the mapping stage, not on the reducer stage. So, every time you call Flow.partition/2, new processes will be used by that computations. For example:

[1, 2, 3]
|> Flow.from_enumerable()
|> Flow.map(fn x -> x + 1 end) # [2, 3, 4]
|> Flow.partition() # A new group of processes will be spawned here.
|> Flow.map(fn x -> x * 2 end) # [4, 6, 8]
|> Flow.partition() # A new group of processes will be spawned here.
|> Flow.reduce(fn -> 0 end, fn x, acc -> x + acc end)

$> 18

2 Likes

Hi thanks for your reply!

Perhaps I’m thinking about Flow the wrong way:

I imagined it was meant to work like map reduce. So the partition stage would be like the shuffle stage, i.e. there has already been some kind of mapping performed - preferably in parallel , giving us pairs:

{key, values}

Then, these are sent to different nodes (bucketed by key) and the values are reduced in parallel on each of the nodes. Once this has been done, the result can be obtained.

Looking at the way this is done in flow, we have: mapping stage, i.e. operations on a collection, then a partition stage, that partitions by some hash key that determines what node the values should go to, and then they are finally reduced. If the operation you want to perform on the values is associative, for example, then there is no reason why it can’t be done in parallel. This is why I was thinking it might leverage it between OTP processes in addition to the number of processor/node partitions.

Would it necessarily be a bad idea to have a version of Flow that could do this? I.e., if you know the reducer operations are associative and if the map stage splits up the input data into tmp files indexed by key that can only read by the process associated to each key , for example, then you could also partition by OTP processes?

Be in mind that Flow does not work in a distributed way - it only works on a single node.

I’m not sure if I understand your question. Can you elaborate more what you want to achieve?

1 Like

Sorry, I have to revive this old topic because I have new… needs about Flow.
I know that at the level of 2017 Flow didn’t work in a distributed way.
Now is it possible to make it run on multiple nodes ? In such way to distribute partitions among nodes ?
Is there an alternative for distributed MapReduce in Elixir ?

1 Like