Handling invalid UTF-8 strings from URL path params

Here’s the problem I’m working on: bots probing my site for vulnerabilities will try injecting special character sequences into params, like /articles/abc%DE~%C7%1FY, which becomes the binary <<97, 98, 99, 222, 126, 199, 31, 89>>, which is not a valid string.

No doubt this attack targets Oracle Server 2003 or something, I don’t know. It’s not going to cause any harm to my app, but it does end up triggering a Postgrex error because the invalid binary makes it all the way into the SELECT query before being rejected as invalid UTF-8.

I’d like to catch this earlier and return an appropriate 4xx error for invalid input rather than a 500 error when the DB query fails. Plug.Parsers has an option to validate UTF-8 in body and query params, so that a request like /articles/abc?a=b%DE~%C7%1FY will throw a relevant exception, but it seems like the path params aren’t checked in the same way.

I’m not sure how to attack this problem. I don’t want to add a check individually to every controller, since this is an application-wide need. Should the path params be run through the same parser checks as other params, or is there a reason they aren’t?

You could add a plug which checks the :request_path. Something like this:

plug fn (conn, _opts) ->
  if String.valid?(conn.request_path) do
    |> Plug.Conn.put_status(:im_a_teapot)
    |> Plug.Conn.halt()

This is a quick draft based on the docs. You might want to adjust some parts of it, add some content, make it a module based plug or change the status code sent :smiley:

Also this code assumes, that the :request_path is already decoded at this point. If it is not you can use URI.decode/1 to do so.

1 Like

Thanks! I realized I could modify your suggestion to use the Plug validation function and throw a relevant exception, mimicking the behavior of the other param parsing:

  plug :validate_path_utf8
  defp validate_path_utf8(conn, _opts) do
    |> URI.decode()
    |> Plug.Conn.Utils.validate_utf8!(Plug.BadRequestError, "path params")


Phoenix handles this well with a 400 on production. Doing it in a plug feels a bit like a workaround, but it seems to be working well enough.

As to a different part of your question, I don’t think the standard requires the path to be valid UTF8. If I’m right, it wouldn’t be appropriate for the default behavior to require it–leaving it up to devs to check the path for stricter requirements would be right.

(After all, the path used to commonly be a path to a file on disk, and Windows would have supported paths in the local interpretation of 8-bit extension to ASCII…)

Current URL standard seems to be based on valid UTF-8 encoding.

1 Like

I cede to your superior google-fu

Agreed, that’s how I read this sentence in Section 1.3:

Sequences of percent-encoded bytes, percent-decoded, should not cause UTF-8 decode without BOM or fail to return failure.

I imagine this is why Plug.Parsers offers the :validate_utf8 option and sets it to true by default. Most of the time, we should expect valid UTF-8 input, but there are situations where people might want to bypass it.

1 Like

I have never heard of that standard, but instead used RFC 3986 as a reference, where they explicitely state, that the standard does not define any particular encoding, but uses US-ASCII throughout the document.

Section 2, 1. paragraph:

[…] This specification does not mandate any particular
character encoding for mapping between URI characters and the octets
used to store or transmit those characters. […]

1 Like

These new “Living Standards” seem to be quite new. But W3C does say that for HTML the WHATWG standard is current standard:

https://html.spec.whatwg.org/multipage/ is the current HTML standard. It obsoletes all other previously-published HTML specifications.

And that WHATWG HTML Standard refers to this URL Standard.

In Goals section the URL standard also says that one of the goals is to obsolete RFC 3986 and RFC 3987.