Streaming download very large HTTP file, processing it and handling reconnection errors

I need to process a bunch of large CSV files on regular basis. They can be multiple gigabytes in size.

Ideally, I would stream the file over HTTP (I can do it with several HTTP clients it seems), and turn it into Elixir Stream and then consume by the Elixir code the same way I consume any other Stream of bytes. I am doing that with Finch at the moment using stream function: Finch — Finch v0.10.2

The problem is that the processing can be interrupted by HTTP connection errors, the server I am streaming the data from seems to have a connection time limit of 30s and then we’re out of luck and have to start over.

I need to re-connect to the same endpoint, then make a request that specifies HTTP “Range” header to start from the point we stopped, and carry on.

I can implement it myself but I suspect someone may have done it already and I just can’t find a library like that? Anyone?

5 Likes

Interesting question. I don’t know of any library, but I’ve recently built something similar - in my case the source of the data is a rest API providing data about invoices and invoice lines. My code abstracts away the API’s details and creates an Elixir stream that can be consumed. In my case the stream generates a series of invoice lines (each one is a maps), whereas in your case it’s generating bytes.

Also, if I understand your use case correctly, each stream generates a single HTTP request. In my case however, it takes a large number of calls to different endpoints to extract all the data that I need. Similar to your case, sometimes some calls may fail (for different reasons) and I want the stream to protect the consumer from this. Therefore, I built a retry mechanism in the stream that retries failed requests a number of times. As long as the maximum number of retries is not exceed, the consume won’t notice anything and will get its data in the end.

In my case the retry mechanism was simple - just retry the failed request. I guess your case is more complicated because each request fetches a large amount of data and you don’t want to throw away what you’ve already downloaded if the request fails in the middle.

I’m sure you already have good ideas on how to implement the error recovery in case you haven’t found a library.

From my part, a possible idea could be to split the large download into smaller requests that use Range requests from the start, so that none of them exceeds the connection timeout limit.

3 Likes