How to efficiently send data between nodes?

Hi everyone.

In my system, I have two separated nodes which are connected to each other:

  • Node A have access to a database which has a lot of data.
  • Node B doesn’t have access to that database, so instead it asks node A to query that database and send back that data.

This works fine if the amount of data is small, but if there is a lot of data to be sent, it will basically consume all my RAM during the transfer and crash the node with OOM.

If I wanted to do that in the same node, an obvious solution would be to just use streams, But AFAIK I can’t just send a reference to a stream to another node and expect to be able to consume that stream.

For now, what I’m considering doing is using cursor pagination and just do a bunch of requests to the database until it doesn’t return more that, that way I can chunk it and send a controlled amount per request from node A to B.

I was wondering if there is a better way to do that, how do people normally tackle this kind of problems in the Erlang/Elixir world when they want to send huge amount of data between nodes? Maybe there is a version of Stream that works between nodes that I’m not aware?

It will be important to quantify what “huge” means here concretely. It is also important to understand how much memory each node has.

There’s two parts to Stream api. There’s the construction of new lazy enumerables from non enumerable inputs, and there’s the composition of operations on top of existing (lazy) enumerables.

You’re trying to use the second part of your node B, but want to create the input in node A. You can however do both things in node B, still benefit from a stream api, but have node B be responsible to setup the initial lazy enumerable. E.g. using Stream.transform to query node A lazily, like in chunks.

You still need a way to lazily query data between node A and B, where cursor based pagination might be the solution, but you don’t need to stop using Stream.

Ah, I will need to check how big the structure is, but my VM has 4 GB of ram and the kernel will kill my instance with OOM if I get more than 250 results in a list (The number of rows is not important, since it is the joins I do inside the structure that makes it be so big…)

I see, so if I got this right, what you mean is that I can use Stream.resource/3 to create a stream directly in node B that will abstract away multiple queries to node A using cursor pagination to get the data in chunks?

2 Likes

I just validated that idea, works great! Thanks @LostKobrakai for the suggestion!