Request metrics, time series databases and potential DoS risk

Phillipp · July 16, 2019, 9:18am

Hey,

I think it’s quite common to send request metrics to a time series database like Prometheus or InfluxDB. In Elixir, there is the handy prometheus_plugs package that collects requests and stores them in the prometheus_ex registry from where you can expose it to e.g. /metrics.

We got a little Elixir service at work that, for the most time, ran only on an IP and had no own subdomain. It also ran on a non-default port (7000 in this case). Yet, I got a ton of edgy crawler requests who try to access .env files, gitconfigs, WordPress logins, Bitcoin wallets, you name it. (I can make the list of requested paths public if someone is interested)

Since the default behavior of prometheus_plugs, and I guess this is also true for every other implementation in other languages, is to just expose metrics (counter + histogram) for all these 404 requests, you end up with a ton of time series. My Prometheus keeps data for only 7 days and right now I have time series for 108 paths with a 404 response.

Just yesterday, I executed the following request in Grafana for a time span of only 1 day and it froze the entire VPS.

histogram_quantile(0.9, rate(http_request_duration_microseconds_bucket[20s]))

The solution is to exclude requests with a 404 status.

histogram_quantile(0.9, rate(http_request_duration_microseconds_bucket{status_code!~"404"}[20s]))

But this is a good example of how easily someone can DoS your system in a way you wouldn’t guess right away.

A quick Google search didn’t result in anything useful. It seems like nobody thinks or at least talks about this. There is general advice to not put highly dynamic values into time series labels/ tags, e.g. user id’s. But storing request metrics, including the path, is a very common use case.

I also cannot think of any sane solutions. Just not storing metrics for 404 (or even the whole 4xx range) requests seems wrong as these are still very useful information.

I am interested in the thoughts of the community about this. This is not an Elixir specific topic but affects every system in any language.

hauleth · July 16, 2019, 9:37am

There are few options available:

Store 404 metrics but without path - IMHO most reasonable way of handling the issue, as it will not pollute your metrics with unneeded and potentially harmful paths
Filter all “suspicious” paths before they even hit the prometheus_plug

Phillipp · July 16, 2019, 9:43am

Then better not store them at all. Having just an empty string or / as path will confuse the people who will look at these metrics.

That is a good idea. Would require some adjustments on the prometheus_plug package tho, because it sits directly in the Endpoint and I don’t think you can programmatically exclude those requests from outside the package.

I am working on a honeypot to collect and analyze those requests and will publish the list of suspicious paths. Maybe it would be useful as Elixir lib too.

hauleth · July 16, 2019, 9:53am

No, it would not as you can just add new plug before Prometheus instrumenter and cut the pipe there and return 404.

But in the end there will be not much of the difference between these two solutions. Especially as storing each path as a tag can also be problematic, as it will cause metrics blow-up when you have any user generated paths (and almost all applications have such). Instead it would be better to use route as a tag, and keep related paths in one bucket.

Phillipp · July 16, 2019, 9:59am

Could work. Will investigate that. Maybe as part of the lib with the suspicious paths.

Good point. Currently I don’t have user generated paths. But long time ago, when setting up Google Analytics on a page, I did that.

The problem is when you actually need those precise values (e.g. username in path) to debug something. If the response time for the public user profiles suddenly increases, it would be helpful to see if it happens for all of them or only for some specific users. One has to balance the tradeoffs here.

hauleth · July 16, 2019, 10:01am

Then you use logs. Metrics and logs are different beasts and have different purpose. Do not use metrics as logs or you will have bad time.

Phillipp · July 16, 2019, 10:02am

Using logs for average response times or for response times in general is the wrong way to go too.

idi527 · July 16, 2019, 10:33am

Using logs for abnormally high response times is probably ok.

hauleth · July 16, 2019, 11:09am

Yes, I agree there, but we are talking about investigating abnormal request times for small amount of users. Then logs are way to go. Just in the metrics you gather not only average, but also quantiles and generate some kind of histograms. Then you will notice enormous response times for that particular user while not polluting tags space with enormous amount of different values.

dimitarvp · July 16, 2019, 12:22pm

I am, and I suspect many others. Please publish that when you are able.

Phillipp · July 16, 2019, 12:28pm

A quick export of 404 requests from Prometheus, deduplicated:

iex(10)> SuspiciousPathAnalyzer.analyze
["/GponForm/diag_Form", "/.bitcoin/.env", "/.bitcoin/wallet.dat",
"/.bitcoin/wallet/wallet.dat", "/.env", "/.ftpconfig", "/.git/config",
"/.remote-sync.json", "/.vscode/ftp-sync.json", "/.vscode/sftp.json",
"/.well-known/security.txt", "/Lists/admin.php", "/MikroTik/", "/SEP/",
"/Sep/", "/aastra/", "/admin.php", "/algo/", "/api/.env",
"/api/v1/overview/default", "/api/v1/pod", "/api/v1/pods", "/app/.env",
"/app/provision/", "/asterisk/", "/atacom/", "/baFirmware/", "/backup/.env",
"/backup/wallet.dat", "/bitcoin/.env", "/bitcoin/wallet.dat",
"/bitcoin/wallet/wallet.dat", "/boot/", "/btc/wallet.dat", "/bub/", "/cfg/",
"/cisco/", "/cnf/", "/coin/wallet.dat", "/conf/", "/config/", "/configs/",
"/core/wallet.dat", "/crypto/wallet.dat", "/deployment-config.json",
"/devicecfg/", "/digium/", "/dumpmdm.cmd", "/fanvil/", "/firmware",
"/firmwares", "/ftpsync.settings", "/fw/", "/gateway", "/gateways/",
"/gigaset/", "/grandstream/", "/gs/", "/gswave/", "/hidden/wallet.dat",
"/htek/", "/html/.env", "/laravel/.env", "/linksys/", "/login_sid.lua",
"/mitel/", "/node/wallet.dat", "/obihai/", "/overides/", "/panasonic/",
"/patton/", "/phone-devices/", "/prov/", "/provision/", "/provisioner/",
"/provisioning/", "/reg", "/sangoma/", "/sftp-config.json", "/sip.conf/",
"/sip.config/", "/sip/", "/sipphone/", "/site/.env", "/sitemap.xml", "/smart/",
"/smarty/", "/snom/", "/spa/", "/spectralink/", "/sys/", "/temp", "/tftp/",
"/tftpboot/", "/voice/", "/voip/", "/vpn/", "/wallet.dat",
"/wallet/wallet.dat", "/xml/", "/api/tracking/position", "/", "/index.html",
"/wp-login.php", "/dana-na/jam/querymanifest.cgi", "/dns-query",
"/nice ports,/Trinity.txt.bak"]
107

As I said, I am going to setup some honeypots which will log these requests, so I will have more data after some time.

I will also bundle these in a nice Elixir package.

akoutmos · July 16, 2019, 6:55pm

What I have done in the past when it comes to getting metrics on which routes are getting hit and at what rate, I try to normalize the request path by cross referencing the request with the output from App.Router.__routes__(). For anything that can’t be found in the result from __routes__() I log a warning to also keep track of possibly probing requests. If anyone is interested I can post up the sample code for that.

dimitarvp · July 16, 2019, 8:06pm

Please do. Many of us around here are interested in security and proper monitoring.

alvises · July 16, 2019, 8:11pm

Interesting, keep us updated please!

hauleth · July 17, 2019, 9:43am

Recently there was addition of Phoenix route information to Plug structure which allows you to use that directly instead of using Router.__routes__/0.

akoutmos · July 17, 2019, 10:54pm

That would make life a lot easier if you could point me in the direction of that field in the Plug struct. For now I have been extracting the following fields %Plug{private: %{phoenix_controller: controller, phoenix_action: action}}

hauleth · July 17, 2019, 11:06pm

Phoenix.Route.route_info/4.

akoutmos · July 17, 2019, 11:08pm

Sure thing. The following code is not optimal as it needs to iterate over the routes list until it finds the relevant route, but I think it will demonstrate what I am talking about. Using the Prometheus Plug https://github.com/deadtrickster/prometheus-plugs I do the following when defining the PipelineInstrumenter

defmodule MyApp.PipelineInstrumenter do
  use Prometheus.PlugPipelineInstrumenter
  require Logger

  def label_value(:request_path, conn) do
    phx_controller = conn[:private][:phoenix_controller]
    phx_action = conn[:private][:phoenix_action]

    route =
      MyAppWeb.Router.__routes__()
      |> Enum.find(fn route ->
        route.plug == phx_controller and route.plug_opts == phx_action
      end)

    case route do
      %Phoenix.Router.Route{path: path} ->
        path

      _ ->
        Logger.warn("Could not resolve Phoenix Route for '#{conn.request_path}'")
        "invalid_route"
    end
  end
end

The nice thing about this is that it normalizes the request path label in Prometheus to the route with the route variables stripped. For example /posts/e6f9b384-5f74-4d8e-a123-36557d6a307a will get normalized to /posts/:id which is important in Prometheus as you don’t want labels to have a high cardinality of values. In addition, given that we log the invalid request, we can then look to our logging solution if we see spikes in Grafana for suspicious requests.

akoutmos · July 17, 2019, 11:15pm

That’s awesome! Thanks!!

Phillipp · July 18, 2019, 8:44am

I created a simple honeypot that logs all requests. I also bootstrapped the Elixir lib that will contain them. I will keep you updated on the progress. Right now I have over 120 suspicious requests and hope I can capture a lot more.