Uploaded file security inspection in Elixir/Phoenix

Users upload files. Files may claim to be benign with their mime type (“content-type”). Heck, they can even have their names ending with some well known to be trustworthy letters after the last dot. Yet, if you live long enough you know that neither file name nor (most probably derived from the name itself) “content-type” is something you should rely on. Therefore I always try to check what the uploaded thing actually is and whether I am happy accepting it. I’ve used various methods in the past/other environments, like passing through file or using some packages that were probably wrappers around mechanisms used by file.

How do the Elixir and Phoenix experts do such things these days?

2 Likes

There’s these two:

3 Likes

Do you have any hands-on experience with any of the two? At first glance I’d probably opt for “ExMarcel” as not depending so heavily on the underlying OS configuration, and generally looking more straightforward to use. But it has no releases, can’t be easily installed by adding dependency, neither package nor docs are in “hex” as of today, …

The other one seems more mature and should probably deliver more in terms of performance (?) but I am not so happy installing additional OS dependencies and deciding which runtime options/configuration to use etc.

1 Like

There is another one if You are on Linux…

1 Like

Indeed, I would use gen_magic (depends on libmagic which is ubiquitous and well understood) for content introspection, and within the firm we have an internal service which does more — we would use ffmpeg to inspect media streams and validate codecs used for each track, we would check metadata in PDFs and Word documents, we would call out to ClamAV for virus scanning, etc. The operations in this service are run in a DAG so everything is done concurrently and therefore quickly.

We also have custom magics written to detect special file types where the OS bundled magics would not detect these. Further we also have code to check contents of MSOOXML bundles, MP3 files with extra padding at the start (where they were recorded to enable gapless playback), etc… All together, it is an integrated package with basic detection powered by gen_magic and advanced detection by custom code + other components as required.

The gist is that gen_magic is part of a production system which has run for several years and has undergone extensive soak testing prior to deployment; libmagic is ubiquitous as long as you deploy on Linux or macOS.

As to using ex-marcel, you could specify it as a dependency easily by {:ex_marcel, github: "chaskiq/ex-marcel"} — the caveat being that Git must be installed and available, whenever you need to retrieve this dependency.

3 Likes

Development is on Linux and Mac. Production is on Linux. True, it says “linux only” but the lib looks very similar to some of my early home-brewed solutions, invoking file command and parsing its output. This should therefore work on Mac too.

@evadne - thank you very much for your valuable insight. I haven’t checked the sources thoroughly but judging from what I read I believe you have a C-compiled kind of “interface” process, which receives identification requests, calls upon libmagic to do the detection and returns results back to Elixir somehow, right?

Yes, that is indeed correct; the apprentice process talks with the host using a plain-text protocol.

Others have had good input where you can shell out to something that gives you more confidence about the contents of the file, so it really depends on how confident you need to be about the bits you’re going to store. In LiveBeats, we accept mp3 uploads, and we parse the mp3’s decoding all frames to calculate the duration, so we know at the end with absolute confidence it is a valid mp3:

2 Likes

In this particular case it is about images so I need to check:

  • file size
  • file type (that it’s in fact a supported image format)
  • pixel dimensions (yeah, “back in the days” I also indulged myself in breaking systems by giving them “explosive gifs” to chew) :wink:

I typically do these things in a “short circuiting” way (bailing out on first fail) and try to make each step lightweight and light on dependencies, especially external. Thus for now I’d:

  • limit the max upload size for multipart on Plug.Parsers (I recall default being 8MiB or so) unless there’s a better way for this
  • quick check the uploaded file File.stat (these particular images will have lower file size limit than the general max upload size
  • quick check the actual file type by “shelling out to something” as you mentioned

That’s where I am now. And if all those pass then I’ll have to find whatever lightweight thingy can give me the images’ pixel sizes. IOW I probably don’t need/want to fully parse each file, even if I know that it’s not 100% bullet-proof otherwise

For images, there is ExImageInfo – ExImageInfo v0.2.4 / GitHub - Group4Layers/ex_image_info: ExImageInfo is an Elixir library to parse images (binaries) and get the dimensions (size), detected mime-type and overall validity for a set of image formats. It is the fastest and supports multiple formats.

It looks for specific byte sequences in the headers of a variety of image types and extracts width & height. It will guess file type if not provided. It is pure elixir so helps with managing dependencies.

ExMarcel author here, well… kind of, I’ve just ported the implementation from the Marcel ruby gem, so all the kudos to the Rails team. I love this implementation because it has various strategies, not just the magic bytes and not only for files but for strings too. You can also declare your own types by extending the dictionary and the best part is that it does not depend on any external service or another package.
Also, FYI, I’ve released the package on Hex.

5 Likes