Streaming from S3 to client with constant memory?

ryanwinchester · March 15, 2021, 5:43pm

Greetings fellow humans,

First, I’ll tell you why I’m not using pre-signed URLs: because this is for sensitive files, that I want to control access to at a much more granular level than signed URLs allow.

I’ve been looking at aws-elixir and ex_aws_s3.

For now, I’m planning to use cloud functions, and want to keep memory limits as low as possible.

So, my objective is to stream a file from S3, to the client, while keeping memory usage constant and not running out.

It looks like with ex_aws_s3 I can do:

ExAws.S3.download_file(..., :memory)
|> ExAws.stream!()
|> Enum.reduce_while(Plug.Conn.send_chunked(), &()...)

but I’m worried about not being able to control the memory usage it could be a problem?

Another option @philss mentioned in aws-elixir in this issue is using a range option, which I guess would give more control.

What do you fine elixir folks think I should do?

evadne · March 15, 2021, 6:31pm

You can use a pre-signed URL with a signature that has an insanely short period of validity. Sign the URL every time. Should be much simpler. Having traffic go through your host can create a massive headache when the time comes to pay your bills!

ryanwinchester · March 15, 2021, 7:33pm

Another issue with the pre-signed URL is when using SSE-C (custom keys for server-side encryption) requires headers to be sent to AWS with the request, so you can’t just redirect them to the URL or give the link out (and also exposes the key to the downloader, although that’s not the end of the world).

I haven’t found a good way to initiate that in the browser without some janky web workers acting as a proxy or something, but my JS expertise is severely lacking these days. I’ve fallen so far behind on frontend stuff

evadne · March 15, 2021, 7:46pm

Another idea

Generate pre-signed URL but keep it server-side
Serve a stream to the client
When the client wants more data, you go and download a few megs from AWS and serve that down the stream

This way you can use whatever that reads chunks off HTTP and you can have more control over how big each chunk sent to the user should be, etc.

Also it should be possible to specify the key even in a pre-signed GET request for S3.

ryanwinchester · March 15, 2021, 8:02pm

I’ll pay $200 USD to anybody who can tell me how to get a working pre-signed GET URL with SSE-C that requires no headers. I’ve wasted so much of my spare time on this already

evadne · March 15, 2021, 8:19pm

TBF…

ExAws.S3.download_file/4: Defaults to a concurrency of 8, chunk size of 1MB, and a timeout of 1 minute.

So you should be ok. The stream isn’t started until the customer starts downloading anyway.

If you want to avoid the server-side round-trip, then you will have to expose the headers and start the download on the client with some JavaScript. You can use XMLHttpRequest to arrange the headers and get a blob back, then make a Data URI out of the blob, and serve it to the user… this works as long as the file is not large.

Can try GitHub - eligrey/FileSaver.js: An HTML5 saveAs() FileSaver implementation for the saving part too

ryanwinchester · March 15, 2021, 8:39pm

I’m probably going to start with a cap of 1GB filesize.

Related to FileSaver, I was looking at GitHub - jimmywarting/StreamSaver.js: StreamSaver writes stream to the filesystem directly asynchronous but thought that a service worker seemed a bit janky. Maybe it’s better than streaming through my own server, though.

evadne · March 15, 2021, 8:50pm

The bottom line IMO is whether you MUST use SSE-C and if it is acceptable in your use case to expose the headers to the client.

ryanwinchester · March 15, 2021, 8:53pm

[] [Re: SSE-C] It’s a must.
[] [Re: headers] It’s not ideal, but I think it’s acceptable in this case.

hauleth · March 15, 2021, 9:17pm

Not an ideal solution, but use Nginx as a reverse proxy for S3 and extract header values from the URL params (or any other part of the query string).

ryanwinchester · March 16, 2021, 12:38am

Streaming 2.4GB zip from S3 through locally running app, through browser:

Using

ExAws.S3.download_file/3
|> ExAws.stream!/1
|> Enum.reduce_while(Plug.Conn.send_chunked/3, &(...))

The memory usage wasn’t frightening, so might be okay, for now.

This is on a 6-core 2019 iMac with 32GB RAM. Not sure if there are any significant tradeoffs happening with memory usage if I go to a 1-2 vcpu environment…

luizdamim · March 16, 2021, 1:48am

Are the SSE-C keys a requirement from your customers or your own architecture? And how many concurrent downloads are you expecting?

A crazy idea, but it’s worth sharing… (only valid depending on your responses above)

You can spin up new S3 buckets on demand, copying files using your full security model and deleting those buckets after the download ends or after a regular interval.

This way, the clients can download files using a much more relaxed security model (but still secure) pre-signed URLs.

evadne · March 16, 2021, 5:01am

There is a default limit of 100 buckets per account unless additional quota is requested. Creation of a new S3 bucket takes a few seconds to a few minutes.

josevalim · March 16, 2021, 6:25am

The concurrency of 8 for download_file is going to increase the memory size (and likely the amount of data references). If the goal is to stream, I would try to go with Finch (and/or Mint) + Range queries.

luizdamim · March 16, 2021, 12:29pm

You can increase the maximum limit to 1000 by sending them an email (at least that’s stated in the documentation).

Depending on the implementation, you can work in the background and notify the user when everything’s read. I know it’s far from ideal, but it’s a possible solution given the constraints.

philss · March 16, 2021, 5:49pm

Cool! I made a small PoC using aws-elixir and Finch that kept the memory usage even lower (ends in 4m05). It’s not a fair comparison because it’s not streaming to the client.

Here is the code: aws-s3-stream-download-poc/lib/download_manager.ex at main · philss/aws-s3-stream-download-poc · GitHub (there is a custom AWS.HTTPClient implementation using Finch inside the project).

ryanwinchester · March 17, 2021, 2:40am

This is excellent!

I’m definitely going to try this out soon, too.

alexandre · March 17, 2021, 1:12pm

Some months ago I wrote a stream downloader with Mint.

I had the opposite problem: stream from clients to S3. I needed to control the chunk size since S3 limits to 5MB.

That was accomplished with:

    url
    |> Downloader.stream_body!()
    |> Downloader.chunk_bytes(5_000_000)
    |> ExAws.S3.upload(s3_bucket, filename, opts)
    |> ExAws.request!()

dimitarvp · March 17, 2021, 7:29pm

Really cool, thanks for sharing this.

ryanwinchester · April 3, 2021, 5:19am

Is this how I use AWS.S3 with SSE-C?

I’m getting a 403, and can’t tell if I’m using the function arguments wrong or if I’m deriving the key incorrectly, so I don’t know where to waste my time