Streaming download very large HTTP file, processing it and handling reconnection errors

hubertlepicki · February 17, 2022, 12:50pm

I need to process a bunch of large CSV files on regular basis. They can be multiple gigabytes in size.

Ideally, I would stream the file over HTTP (I can do it with several HTTP clients it seems), and turn it into Elixir Stream and then consume by the Elixir code the same way I consume any other Stream of bytes. I am doing that with Finch at the moment using stream function: Finch — Finch v0.10.2

The problem is that the processing can be interrupted by HTTP connection errors, the server I am streaming the data from seems to have a connection time limit of 30s and then we’re out of luck and have to start over.

I need to re-connect to the same endpoint, then make a request that specifies HTTP “Range” header to start from the point we stopped, and carry on.

I can implement it myself but I suspect someone may have done it already and I just can’t find a library like that? Anyone?

trisolaran · February 24, 2022, 7:20pm

Interesting question. I don’t know of any library, but I’ve recently built something similar - in my case the source of the data is a rest API providing data about invoices and invoice lines. My code abstracts away the API’s details and creates an Elixir stream that can be consumed. In my case the stream generates a series of invoice lines (each one is a maps), whereas in your case it’s generating bytes.

Also, if I understand your use case correctly, each stream generates a single HTTP request. In my case however, it takes a large number of calls to different endpoints to extract all the data that I need. Similar to your case, sometimes some calls may fail (for different reasons) and I want the stream to protect the consumer from this. Therefore, I built a retry mechanism in the stream that retries failed requests a number of times. As long as the maximum number of retries is not exceed, the consume won’t notice anything and will get its data in the end.

In my case the retry mechanism was simple - just retry the failed request. I guess your case is more complicated because each request fetches a large amount of data and you don’t want to throw away what you’ve already downloaded if the request fails in the middle.

I’m sure you already have good ideas on how to implement the error recovery in case you haven’t found a library.

From my part, a possible idea could be to split the large download into smaller requests that use Range requests from the start, so that none of them exceeds the connection timeout limit.

shamanime · October 24, 2022, 3:13pm

@hubertlepicki care to share what you ended up with?

A friend is having issues with a very similar case.

derpycoder · October 24, 2022, 5:16pm

If retries on API failure works, then maybe the following library can help:

I feel like it’s responsibility of providers to have resumability in mind when designing services. Like chunks or Torrents for download, & multipart or Tus protocol for upload.

If the source is well designed, then perhaps the following library can help:

kodepett · October 24, 2022, 6:01pm

Hi, you can take a look at this article from [Poeticoding] - Elixir stream and large http response processing (Elixir Stream and large HTTP responses: processing text). Hope it help.