Leaking filedescriptors in Plug.Static through a Phoenix app

Hi, I am leaking file descriptors in my phoenix 1.4.5 app.

Situation is that I have a phoenix app with a bunch of stuff in /priv/static that has had “mix phx.digest” run against it and also an “uploads” dir symlinked in which has not

I have a bot which appears to be searching me called “AhrefsBot/6.1” and this appears to be requesting image files, then terminating the TCP connection before downloading the complete file (only for images). Nginx is in front of my phoenix app and is logging the disconnect (and referrer).

These early disconnects are causing a file handle leak and eventually the elixir apps stops being able to serve assets due to lack of file handles…

lsof shows a ton of open filehandles, but ONLY for these specific files which are:

  • served by the plug.static module
  • are in the path symlinked into the ./priv/static/ dir
  • were not there when mix phx.digest was run (although this may not be significant)

Restarting nginx doesn’t lower the file handle usage, so it’s not that nginx is keeping connections open. Restarting the elixir process sets file handles to zero again

I’m not clear that the use of the symlink is causing the issue, but it sure looks like the dropping the connection without downloading the whole file is. ie I’m not seeing any other client cause the file handle to leak when viewing these assets

Eyeballing the code, I think this implies some leak in Plug.send_file ? Any ideas?

2 Likes

Hmm, this sounds fascinating… Any chance at making an SSCCE of it that we can clone and just run something like mix deps.get && mix test to try it out?

OK, I’m struggling to repro any part of this against my live server… If I use curl and a speed throttle so that it’s downloading just a few bytes per second then I do NOT see any filehandle get opened??

So given: Plug.static, serving files in a symlinked dir, those files not being in the manifest.json, what sequence of headers/events might cause a filehandle to be opened by the beam on that file? (Server is being restarted each time before the test).

Any thoughts?!
(Yes, perhaps I need to try and get a network trace on what the bot is doing)
Driving me mad having to restart the live server every few hours to keep it alive…

1 Like

I am experiencing a similar issue with a live app.

I have a leak of around 100 file descriptors every 24h and all of them for exactly the same file, our JS bundle, app-{digest}.js, that’s being served by Plug.Static.

As far as I know, I don’t use symlinks for anything. On a local dev machine, I build the file with webpack, run mix phx.digest, build a release of the Phoenix app with Distillery and deliver it to the live host with edeliver.

I have an nginx-based load balancer in front of my app, but that’s managed by an external company, so I don’t have direct access to its logs.

Based on the app logs, I get a lot of bogus traffic from bots.

I tried recreating this situation against a fresh Phoenix app with static files locally, but without any success.

However, I was able to reproduce this against my staging environment. Running this code once gives me (reliably, every time) +2 leaked file descriptors (as calculated with ls /proc/$PID/fd | wc -l):

sock = TCPSocket.new('my-staging.com', 443)
sslsock = OpenSSL::SSL::SSLSocket.new(sock)
sslsock.connect
sslsock.write("GET /js/app-{digest}.js?vsn=d HTTP/1.1\r\nHost: my-staging.com\r\nConnection: close\r\n\r\n")
sslsock.close
sock.close

(It’s in Ruby, sorry).

When I test like that against my staging, the leak also happens:

  • for one more .js file, js/admin-{digest}.js (but bots don’t have access to pages that load this file)

It doesn’t happen:

  • for the other 3 .js files that I have except for app and admin
  • for my CSS bundle, /css/app-{digest}.css
  • if I read the socket before closing (sslsock.read)
4 Likes

great work replicating it…

can you post file sizes of said files - I presume the ones leaking are larger/above some threshold than the ones not leaking…
(eg cowboy keeps the file open awaiting further reads - for smaller files cowboy just reads it all in one go and doesn’t leak)

also what cowboy version are you on and does it leak on latest as well?

also for cowboy2 have you by any chance changed “idle_timeout”?
protocol_options: [idle_timeout: xx] - which defaults to 60 secs - New default timeout of 60s in Cowboy 2 · Issue #3190 · phoenixframework/phoenix · GitHub

1 Like

can you post file sizes of said files - I presume the ones leaking are larger/above some threshold than the ones not leaking…

Looks like it!

File sizes:

leak:

1.5M js/app-{digest}.js
761K js/admin-{digest}.js

no leak:

285K css/app-{digest}.css
88K js/otherfile1-{digest}.js
24K js/otherfile2-{digest}.js
15K js/otherfile3-{digest}.js
236 robots.txt

also what cowboy version are you on and does it leak on latest as well?

Cowboy version: 2.6.3 (latest, so yes, leaks on latest)

also for cowboy2 have you by any chance changed “idle_timeout”?

No, I didn’t change any protocol_options.

1 Like

Aha. Yes, I think it might be a file size limit, but I’m struggling to connect your results with mine, but it seems likely the problem is similar

I’m seeing hardly any incidents of the issue at present (one or two a day tops). I managed to get a TCP dump of the bot which caused my issue, but it’s unexciting,something like:

GET /mearra.nieida/106/image/jpgSFZfiMhRxB.jpg HTTP/1.1
Host: blog.mailasail.com
User-Agent: Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)
Accept: /
Accept-Encoding: deflate, gzip

What I’m not seeing is why this same bot was causing open files before, but far fewer now? I’ve actually got phoenix stuck with 2 open file handles right now and the ability to access a remote console or use Observer if there is something I can dig into?

The latest filehandle open is 305,161 bytes in size, so not so large (305KB). As above, I have an nginx proxy in front (might not be relevant), nothing changed in Cowboy. Most recent filehandle open looks like it was not from a named bot, probable lines from nginx access log say:

194.81.239.147 - - [03/Jul/2019:11:37:26 +0100] “GET /peregrina/65/image/jpgWO37GDgVdz.jpg HTTP/1.1” 499 0 “-” “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 Edge/16.16299” “-”

194.81.239.147 - - [03/Jul/2019:11:37:35 +0100] “GET /peregrina/65/image/jpgWO37GDgVdz.jpg HTTP/1.1” 200 67160 “-” “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 Edge/16.16299” “-”

Another open file handle is (actual fie size: 2097865 bytes) :
54.36.150.105 - - [03/Jul/2019:00:40:27 +0100] “GET /lunaquest38/148/image/jpgJTGeV7uVAD.jpg HTTP/1.1” 200 1069091 “-” “Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)” “-”

I’m struggling to repro whatever is happening using only curl though… It does look like some bytes are downloaded…? So I can’t see that my TCP windows could be wide enough to send nearly 1MB without having any ACK from the client or similar?

I wonder if it’s something to do with not closing the connection down in an orderly way? Very odd…

1 Like

Update, I’m seeing files as small as 41KB leaking file descriptors… For me it’s ONLY those files which are NOT in the Static digest. However, I just cannot figure out how to cause it…

Primarily it’s a bot called “ahrefs”, it’s NOT dropping the connection instantly, and the headers used don’t seem to cause any file to be opened if I simply use CURL with the same headers??? I just can’t figure this out…

Assuming you have an encrypted link up to a gateway (like nginx or so) then unencrypted between the gateway and elixir, then I’d hook up a packet tracer and record anything accessed on those specific files. If we can get the specific packets and their full lifetime then it would become pretty trivial to replicate the issue, thus seeing the reason why it leaks. :slight_smile:

Hi, good idea. Did you see previously where I posted one side of that connection? Also from the nginx log files it doesn’t appear that the robot hangs up the link on the first packet - near as I could see from the one sided trace this is also true. However, my trace is only from the robot to me, not the replies. Nothing jumps out from the trace that seems to tickle this (or perhaps it only happens once?)

The challenge is that it’s happening infrequently enough that I don’t think I can leave a full packet trace running on both sides and in both directions because I may hit disk space limitations before I capture a trace…

The frequency has dropped significantly. It’s as though it only happens the very first time a file is accessed (but equally I can’t see that the static module is building any cache or similar?). So the robot is hitting me with a lot of accesses for html files and also for images, but none of those are causing this effect…

Does that show how odd this is? The robot requests a whole bunch of jpegs and only a few are (now) causing stale file handles. But earlier when the site first went up, we were seeing 500+ stale file handles in a few hours. Non scientifically it seems like it’s only first access which tickles it (but perhaps not)

I can attach to the running phoenix application. Nothing is jumping out at me though - any thoughts on where I could look using Observer to get more info?

1 Like

This may sound stupid but the bot may come from here https://ahrefs.com.

Maybe having these can help you block it with a firewall or something similar.

I also found this article about filtering domains here firewall - How to do domain filtering in Linux? - Unix & Linux Stack Exchange

1 Like

Thanks. I’m keen to find the problem rather than plaster over! Occasionally I have seen some other machines triggering this. So far only things claiming to be Safari (so could be a bot using a different client string, but dunno).

Either ahrefs isn’t asking for the same files again in the same way, or something else, but the problem has become less frequent over time (but ahrefs still seems to be requesting jpgs, only the effect doesn’t happen for all requests… Perhaps only once, the very first time? But if so why?)

This seems like a crawler but modified somehow, perhaps it’s goal is to disrupt your work like a ddos attack?

Also do you have a robots.txt file in your project?

If not maybe the crawler got lost in a loop and lost it’s way trying to index your website and the only part that it can recognize are the jpeg occasionally.

Also this behavior(of the bot) feels like someone that had to much to drink that is searching his way.

It waits a couple of seconds then remembers that this looks familiar. I am going this way, then decides that perhaps not and stops to think again which way to go.

Also those request to jpeg that is making, dose it have time to extract data from them metadata?