Does elixir provide good support for streaming large amounts of data?

Languages like nodejs and golang have built in language and standard library support for streaming data. You can read data from streams and write data to streams, and connect streams together end-to-end to do interesting things that would be very painful to do otherwise. It becomes easy to stream gigabytes of data to and from endpoints without exhausting server resources. I would have thought that erlang/elixir would also provide great support for this, but when I go looking it seems more like everyone is trying, with varying levels of success, to put together their own solution.

Admittedly it is hard to search for on google since you get a lot of false positives from the Stream module. Is there a standard way that Elixir supports this kind of thing? Something like a readable protocol and a writable protocol, so that streaming solutions from different libraries are composable without having to be intentionally designed to be?

1 Like

Just as a quick example of what I mean… I recently built an image processing server in nodejs that compresses images on the fly as the users request them. It was very easy to stream the data from various libraries. I could pipe the request (a readable stream) to the AWS S3 library, because they gave me a writable stream to pipe it into. Then I got back the image data from the AWS library in the form of a readable stream, which I was able to pipe into an image processing library, which provided me a writable stream for doing so. Then I was able to take the readable stream given to me by the image processing library, and pipe it into the response to the original end user. I’m sure anyone that has dealt with nodejs or golang has seen the usefulness of this kind of thing. Is there a standard way of accomplishing similar things in elixir?

1 Like

I’ve used the Stream module in the elixir standard library for this. Ecto has Repo.stream to pull data from a DB. We used it to create large reports in CSV and XLSX.

http://ananthakumaran.in/2017/11/28/stream.html Is a good post covering Enumerable and Collectable protocols.

2 Likes

Thanks. I saw some stuff like that out there about using the Stream module like that, but it puzzled me that there seemed to be issues in some libraries with streaming data anyway.

Do you think this kind of thing could be used for acting as a streaming reverse proxy that can start streaming large responses back the user before the entire response is loaded in memory? And if so, do you think there is a solution to do with somewhat easily with Plug?

edit: I see here that there might be issues with streaming in plug? I’m not sure if this was ever resolved or not.

According to the docs, Plug.Conn implements Collectable, allowing you to pipe a stream into the Conn, and each item is sent as a chunk.

For more advanced usage there is also GenStage and Flow, which is aimed at pushing data between concurrent processes with back pressure.

The largest file I’ve used the standard Stream library for was 30 gigs. It was a CSV I processed using Flow. There didn’t seem to be any problem with the streaming part of it at all. For the situation mentioned in your second comment Frogglet, it doesn’t seem like a stretch at first glance using standard libs

Not sure if you’d want to use Flow for that but it’s a great choice in general for processing streams of medium sized data

The primitives are there to do this, but there could be more ready-built streaming interfaces, sure. I recently used Stream.unfold to un-page a remote json API and also use a Task to fetch one page ahead of what the caller was consuming. It pulls about 150K records daily and bulk upserts into our local db. I’ve been meaning to blog about it. Generalizing it to a reusable library will be a bit harder because different sources use different paging styles.

4 Likes

Please update us if you write a blog about it. Much appricaited.

1 Like