Stream over Process Borders - how to hand data over between process boundaries?

Hi

Old dog (OO) that tries to get his feets wet with elixir. I am wondering if i overthink my problem here.

I am developing an application like https://reclass.pantsfullofunix.net/ . In short there exists a directory with an unknown amount of files in subtree and i chose to implement that as a GenServer. It encapsulates the functionality to read the files, to subscribe to filesystem events when they get updated by someone else and make calculations with the files. I use Streams there and stuff. Everything works fine.

Now i want to implement some user interfaces that use this GenServer for a command line interface and some phoenix based webapps. And that is where i stumble and probably overthink. How do i hand this data over between process boundaries.

For example i want to ask the genserver for a list of defined nodes and classes. Can i return a stream from the handle_call in my GenServer to the caller? Will that work? Will that work sometimes only? Do you guys just send the whole list at once? Some implementation utilizing a next() method?

Eventually i need to ask the GenServer for the resolved definition of the nodes. Which can be fairly big maps with encoded binaries in them.

What are the patterns to do send big amounts of data between processes.

Just send keywords or articles my way. Would help a lot. My Google Fu is failing me hard.

Thanks

Michael Jansen

1 Like

As Joe Armstrong used to say: send your functions to the data!

A message can be literally any Erlang term, including an anonymous function. So the client process can send a function to the server process which it can run on its data and send back just the result.

PS, welcome to Elixir Forum!

2 Likes

Can’t answer your bigger questions but – it might look like you are sending them but the GC keeps big binaries reference-counted. So just go wild and don’t care much, the GC will take care of it.

2 Likes

Supporting the argument provided by @dimitarvp .

We solve business problems with elixir and OTP, not dwelve into low-level stuff that was already solved for us. Doing stuff from scratch is not encouraged in this ecosystem unless you truly understand the drawbacks and what you can gain.

Create a working solution first, then investigate the performance and bottlenecks (OTP is one the best runtimes for doing this), improve the bottlenecks (if it makes sense), rinse and repeat.

1 Like

Another option: store the node + class definitions in an ETS table (or several), so that other processes can read them without interacting with the GenServer (which writes to the table(s)) at all.

4 Likes

Thanks. Kinda was sure i overthink a bit.

Got quite some nice information here. I guess my solution will use all of the answers.

  1. Just return lists like lists of tuples as is between processes till i see a real problem coming up. I guess that will either be when the list gets REALLY big or when there is more than one node involved. Which i guess is a whole different design problem.
  2. When i need filtered or reduced versions of the data i can send the function to the data and just return the result
  3. When just need to iterate over the whole data returning a stream will work if the stream works on a protected :ets table and all the required data is already in there (eg. no lazy loading) or can be computed from the stored data.

Thanks

3 Likes

One related consideration I didn’t really see mentioned, but is worthwhile keeping in the back of your mind is that the GenServer could be a point of unwanted serialization.

The GenServer will accept requests from other processes. But it will process those requests serially, one after the other. In the case of handle_call, this will look a lot like blocking. So if you send the function to the data in this case, and the data is sufficiently big or the function sufficiently computationally intensive (or some combination thereof), you could find yourself dealing with unexpected operational bottlenecks if the GenServer might be getting other requests from other processes while it’s working your data.

In cases like those you start to look at the previously mentioned :ets approach or spawning off tasks, etc. into other processes to deal with heavy, data intensive processing.

You might have thought of this already, but I wanted to be sure it was part of the discussion nonetheless.

1 Like