Alternate :multipart plug parser - S3 storage instead of tmp/

harmon25 · June 5, 2020, 4:56pm

Hi folks,

I was able to solve the problem of proxying + encrypting uploads to S3 via NodeJS in a memory efficient stream. This is easier in NodeJS as an express controller can receive the file contents as a stream and that stream can be transformed and piped up to S3 with predictable memory consumption regardless of file size. See busboy

I want the same behavior in Elixir - but this requires a custom multipart parser!

I have created a POC that instead of writing the uploaded byte chunks via Plug.Upload to tmp/ the file is uploaded to S3.
I started from the builtin Plug.Parsers.MULTIPART, and have modified to achieve the above. It is structurally very similar to the normal multipart parser.

If the file is <5 MB it is persisted with a simple s3.put_object, when >5MB the file is persisted via an s3 multipart upload.

One change I made to more easily handle the multipart s3 upload, is altering the read_length option from 1_000_000 bytes (~1MB) to 5_242_880 bytes (5MB). Are there potential negative side effects from doing this? (aside from obvious 5x memory consumption)

This was done because s3 multipart uploads must be 5MB chunks, so this allows chunks based on read_length and no extra chunking logic…

Anyway, here are some questions I have for the community:

Could an Elixir Stream be used here similar to how it is done in NodeJS?

I.E Plug.Upload actually returns a list of streams to files that are lazily parsed from the body
I was not sure how to create a stream from the uploaded chunks and not introduce some memory leak as I cannot control the upload pace
- I need to buffer those bytes somewhere, right?

Is this a good use case for Broadway/GenStage?

Thinking something like this (p: producer, c: consumer):
Multipart Parser ( p ) → Hash?(transform1)(p/c) → Encrypt(transform2) (p/c) → S3Upload ( c )

Should more flexibility be introduced to the builtin parser to enable altering its behavior to allow this type of functionality.

Or

Should this be its own hex package, used as an alternative to the built-in multipart parser.

Thanks for taking to time to read this, I am very open to suggestions and ideas!

If anyone would like to help make this into a published hex package, lets colab!

shanesveller · June 6, 2020, 11:30am

This is perhaps not what you want to hear, but if the ultimate destination is S3, it’s generally considered a better practice to have the client send the content straight to S3. This is done using signed URLs, which is loosely supported but under-documented in ExAws, removing your server compute from the equation altogether. This approach does require that you’re willing to defer any post-processing that you would ordinarily do in-line during the upload, and to put in the work to make that behavior work asynchronously after the fact.

It’s not universally appropriate for all use cases and if uploads aren’t a core focus of the work in question it may not be worth it, since it can be significantly more engineering time than to just do it naively with the application server, especially if it’s the first time a dev/team is implementing this technique.

harmon25 · June 6, 2020, 4:00pm

I anticipated this response

Yea, this approach is most useful when post processing is involved (which is quite common).

I would not go so far as to call my suggested server side approach naive, I would agree though that if the final destination is some object storage, using the built in Plug.Upload would be naive. And if you are deploying to the cloud and the final destination is not some native object storage, it probably should be…

This will consume more resources on your server, but as we all know, dev time is more valuable than compute time. Depending on what else the server is doing and the post processing involved, the server compute required is pretty negligible (I think?, see original questions), and because it is kind-of a stream, memory consumption will be predictable regardless of file size.

Signed URLs have their place - my focus was more on extending the behavior, or standardizing on an approach that extends Plug.Upload, or otherwise to make this less effort. This was an easy problem to solve in NodeJS with express/middleware, I think a the idea for alternative plug parsers to help solve these kinds of problems is also viable in this ecosystem.

Not sure how the liveview file upload stuff is coming, but these ideas could be applied to that somehow, if they are not already… At least the idea of exposing the upload to the server as an IO steam, allowing the consumer to stream it to tmp file or to S3, and perform post processing in between!

dimitarvp · June 6, 2020, 4:54pm

While I am not at all versed in Plug.Upload I’d still expect what you require to exist in one form or another and it’s a bit disappointing that it doesn’t. I’ve done this in Java and Go years ago, however the problems were the shared memory buffers – I mean, you can limit their size so as the memory usage doesn’t explode, sure, but moving synchronised buffers around can get pretty computationally expensive as well.

So I think your idea of having N streams pointing to files inside Plug.Upload might require too much work BUT regarding your other point – and if you are not willing to use S3 pre-signed URLs – then making a mediator on your server is quite trivial in lower-level languages like Go and Rust; however, I’ve never tried doing it in Elixir. Maybe Erlang’s :inet or :socket modules can help?

harmon25 · June 6, 2020, 6:31pm

Yea, I am trying to represent the file parsing as a stream but due to some contention on the conn, or probably that its being accessed in multiple places - this does not seem to work and the stream just hangs. Here is some code:

 Stream.resource(
  fn ->
    IO.inspect("Start Streaming")
    conn
  end,
  fn conn ->
    Plug.Conn.read_part_body(conn, opts)
    |> IO.inspect(label: "read this part!")
    |> case do
      {:ok, tail, conn} -> {tail, conn}
      {:more, tail, conn} -> {tail, conn}
      {:done, conn} -> {:halt, conn}
    end
  end,
  fn _conn ->
    IO.inspect("streaming completed!")
  end
)

The previous attempt going to object storage directly from the parser and blocking the conn worked OK - and behaves more like Plug.Upload, but as you mention N streams to N files inside a Plug.Upload like struct is pretty tricky, and is what I am going for, drawing parallels to nodejs body parsers like busboy

Going lower level may be the only option, but its still just a multipart form we are parsing here, and this would hopefully still work as a simple plug parser…

After looking more at Plug.Conn docs this stands out and is likely the issue i am having:

Because the request body can be of any size, reading the body will only work once, as Plug will not cache the result of these operations. If you need to access the body multiple times, it is your responsibility to store it. Finally keep in mind some plugs like Plug.Parsers may read the body, so the body may be unavailable after being accessed by such plugs.

ityonemo · June 6, 2020, 9:38pm

I encountered a similar situation when I was working for a storage company implementing a s3 compatible storage layer. I think you’re very likely going to have to go lower level. Plug is based on cowboy, which does things in an erlangey way, which usually is the elixirey way… but streams are an elixir concept, and I remember giving up on it and deciding that mixing the two was like a square peg in a round hole. But that could be my deficiencies as a programmer (and a beam programmer) at the time.

harmon25 · June 6, 2020, 11:50pm

I have kind of given up on the stream idea, and instead just a more configurable multipart parser that accepts a module + behavior and some config to allow customizing what happens to the file as the body is being parsed.

I implemented both a Temp and S3 adapter.
Temp is nearly identical to the behavior of the built-in Plug.Upload.
And S3 implements some of the first POC to allow chunk uploading to S3 rather than writing to temp.

I designed it so adapters could implement some custom logic like encryption before writing the file…

It might not be as elegant as using streams would be, but seems to work pretty well!

harmon25 · June 7, 2020, 4:09pm

I went ahead and packaged this up a bit better, and named it Minne, which is a Swedish word for storage. (I am not Swedish, just naming things is hard)

https://github.com/harmon25/minne

Going to publish to hex but would appreciate more feedback on the code to ensure my approach is reasonable before other people start relying on it. Should work in all plug/phoenix apps.

Not 100% sure of my approach to Adapters and how I access the x.__struct__ field to perform an apply/3

It should be pretty simple to create more complex adapters that could do things like encryption, hashing, post processing etc, and have those calculated values passed into the controller so your controller logic can be free of such concerns.

Probably just going to maintain the basic adapters and allow people to include more complex adapters in their own app code. I have also borrowed some tests from Plug to ensure the multipart functionality of the Minne.Adapter.Temp is essentially the same as the built-in multipart parser, except for the different Upload struct.