Phoenix app locks up on browser request only

Hey folks, having an odd issue with one of my Phoenix controllers. I’m serving files out of S3-compatible storage at paths like /storage//.. Because we may eventually want to authenticate access to documents, I’m not using pre-signed URLs with my storage subsystem (Minio) because those can be shared. So the controller retrieves the file from storage and serves it using ExAws.

The issue I’m hitting is that, for large files (77 MB in local testing) my Phoenix process locks up entirely, but only if I request the file from a browser session, and only after the first request concludes. This doesn’t happen for smaller files (12 MB or so). It also doesn’t happen via wget/curl. I can wget these files until I’m sick of hitting up-arrow/enter, and they download very fast, but the instant I load them in a browser, I lock up. Further, once the process locks, wget/curl don’t work. So it isn’t a browser issue. When I visit URLs in a browser, I’m hitting URLs like:

http://localhost:8080/storage/2cc4d1d4-8754-46c6-8c2f-5809fb6f7e6d/ig-ag302.mp3

I.e. I’m hitting the file directly, so this isn’t an issue with layouts or the broader rendering pipeline.

My assumption is that sessions are involved for some reason, but I can’t think of why this might be. I’m using database-backed sessions, so I can see the data they contain. Nothing looks out of the ordinary or particularly large. I’d also understand if 77 MB was being held in memory for some reason, but then hammering my server with wgets should trigger that far quicker than my browser does.

Any thoughts as to what may be happening here? I have a few options if I can’t make this work. I can expose Minio externally and redirect to pre-signed URLs with a short TTL, or maybe stream the file internally. But this is mostly working code, and I don’t much like solving problems without understanding them, so I’d rather not put in a fix without understanding why this is broken. Anyhow, my controller code:

defmodule ScribeWeb.StorageController do
  use ScribeWeb, :controller

  alias Scribe.Documents

  def show(conn, %{"id" => id, "filename" => filename} = params) do
    document = try do
      Documents.get_document(id)
    rescue
      # May not need this anymore but it doesn't trigger in this scenario anyway
      _ ->
        conn
        |> send_resp(:not_found, "")
    end
    case document do
      nil ->
        conn
        |> send_resp(:not_found, "")
      _document ->
        bucket = Application.get_env(:waffle, :bucket)
        case ExAws.request(ExAws.S3.get_object(bucket, "#{id}/#{filename}")) do
          {:ok, %{body: body}} ->
            extension = Path.extname(filename)
            |> String.trim_leading(".")
            conn = conn
            |> put_resp_content_type(MIME.type(extension))
            conn = if Map.has_key?(params, "download") do
              conn
              |> put_resp_header("content-disposition", "attachment; filename=\"#{filename}\"")
            else
              conn
            end
            conn
            |> send_resp(200, body)
          {:error, _} ->
            conn
            |> send_resp(:not_found, "Not found")
        end
    end
  end
end

Thanks.

1 Like

have you already tried using send_download/3 instead of the ‘manual’ put_resp_header+send_resp combination? maybe it just does a thing subtle differently where the browser borks on but curl/wget are more permissive on?

Another thing you could do to break down where things go wrong, is to leave out the AWS get_object and serve a 77 MB file just from disk to see if that aws request is to blame.

Thanks, ran out of time for this project this week but will try it out
on Monday. Not sure why it would work differently, but I’ll report back
on whether it does.

And in case it wasn’t clear, AWS proper isn’t in the equation.
Everything is local, with files being stored in Minio. So it shouldn’t
be a case of failures between me and Amazon.

Thanks for the suggestion.

just FYI, i created an empty phoenix app and pasted more or less your code in an empty controller and can just download a file of 125 MB over and over (in this case it was a large pdf book/manual). So maybe there’s something else at play, one of the things i know that send_download does extra is encoding the filename and there are issues of browsers borking on wrong names but that’s not logical in this case i guess as you describe the first download goes through. I also replaced the AWS get part in my case to just File.read!(filename) of that 125MB file. Maybe there’s something else inside your plug pipeline that interferes? Is it only with that 77 MB file? I would also try a different large file maybe it’s something the file does (still not logical given the first download went through). And does it also happen in all browsers? (should not matter, but still)

Anyway good luck for when you’ve got the time again to work on it!

1 Like

Have you checked your browser developer tool’s network tab? Does it get a timeout?

EDIT: Sorry if this is a stupid question, it looks like you know what you are doing.

EDIT2: to clarify this, I have had issues where a file upload did not seem to finish, very similar to what you described. In my case it was simply an issue of the client thinking a request was over, but the server was still doing something.

1 Like

does it timeout after 60 secs? see issue/fix: phoenix: https://github.com/phoenixframework/phoenix/issues/3190

you might also be so lucky that get_object timeouts (on underlying hackney config perhaps?) - https://github.com/ex-aws/ex_aws_s3/blob/v2.0.2/lib/ex_aws/s3.ex#L464

purely fishing here, so timeouts might not be the issue.

I’m investigating this now, but here’s the oddest thing to me, and I’m
not sure if I included this in my original post.

It isn’t just the download of my large file that hangs. It is every
request to phoenix, authenticated or not, from the browser or curl. It’s
as if hitting a large file URL in my browser hangs the entire app for
everyone. There are no errors in my console output, and the process
appears to still be responsive when I connect to iex. I’m still
investigating my storage controller, but it’s strange to me that the
entire app would hang. Or maybe some process is crashing and not
recovering. If the issue was only in my controller, I’d expect just that
request to fail and everything else to continue.

Does this suggest any other debugging steps? Bothers me that this is
failing and not producing errors.

OK, I’ve spent all week banging my head against this. Here’s where I am now.

  • All I’m literally doing is generating a pre-signed URL and sending
    out a 302 redirect to that. The asset is hosted in Minio, so Phoenix is
    only generating the URL, handing it back, and that’s it. The client then
    hits Minio directly.

  • My requests do time out after 30 seconds. Curl reports “* Recv
    failure: Connection reset by peer” so it seems to be a lower-level
    networking error. There are no HTTP headers, so this isn’t an HTTP response.

  • All attempts to connect to Phoenix hang after this occurs.

  • The iex console in which I’m running my app is still available and
    responding.

  • I’m seeing no errors in the logs. In fact, the logs report that the
    request that hangs my app succeeded.

  • FWIW, this page is rendered by a LiveView. The LiveView otherwise
    works very well, and I don’t have any reason to suspect that it would
    hang the entire app, but I mention it only for completeness’ sake.

  • The large file is also in an <audio/> tag. Given that LiveViews
    are rendered server-side, I’m wondering if the audio tag is consuming
    the content of its src for some reason? I don’t have any reason to
    suspect that it would, but I don’t know what to think at this point.

If this file ate my pod’s RAM or something, I’d understand if the entire
app crashed. But it doesn’t. It’s bugging me that the app as a whole
hangs vs. just the individual request.

And with that I’ve hit the end of my Elixir/Phoenix debugging abilities.
Given that so many things are processes, it seems like I should be able
to figure out what specifically is hanging. How would I go about that?
Console/command line methods please, as I’m a blind screen reader user
and don’t think the Observer GUI is accessible. I do have observer_cli
installed, so if that would help me to debug this, then I’ll happily run
any incantations requested of me and post the results somewhere.

Thanks a bunch. This bug is driving me batty.

1 Like

is there code you could share (maybe even in private) which we can look at to help you?

1 Like

OK, I don’t yet know the root cause, but I know enough to claim it
almost certainly isn’t Phoenix/Elixir’s fault.

We’re developing in a Minikube cluster because it’s just easier to bring
up all of our services in a local VM than it is to do so on the host
system. Unfortunately, something about something we’re doing isn’t
interacting well with every mechanism I’ve used to forward traffic into
the cluster. Skaffold port-forwarding, kubectl, everything is reporting
timeouts and connection resets. Meanwhile, Elixir happily keeps serving
content if I connect directly to the pod. Meanwhile, Skaffold was
entirely silent about its failures or the errors it received from
whatever it uses to do its own forwarding, so I never suspected that the
forwarding mechanism was what crapped out on me. Using kubectl port-forward lays it all bare, and I can see the failures as they happen.

So this seems like a lower-level networking issue, and I’ll adjust my
debugging efforts accordingly. Thanks to those of you who tried helping
me get to the bottom of it. Sorry it never occurred to me that Google’s
code, rather than my own, would so spectacularly fail to do something so
spectacularly simple as serve a dumb TCP pipe. sigh

2 Likes

Are there any ingresses in between? Then also look out for the client-max-body-size setting of those. (See here)

Another is looking at top / resource usage, I see a lot of usage on top of kubernetes that’s underpowered and therefore killed and often restarted without even noticing it. So you could also easily try to allocate double resources and test the upload again.

Good move to test it first with a direct connection to your single app/pod to rule that out.

Goodluck!

2 Likes

It’s actually likely this
issue
. If I
retrieve the IP and hit it directly, everything seems to work perfectly.

Man, wish we didn’t need K8s, or something similar enough, for this project.

Thanks for all the tips.

2 Likes