Store, sanitize and parse Markdown

snap · February 15, 2018, 2:43pm

Our app has a textarea where users can input Markdown. We store the Markdown in a database and parse and render it on show. I’m currently using earmark and phoenix_html_sanitizer; sanitize firsts and then parse to html:

use PhoenixHtmlSanitizer, :strip_tags

def markdown(md) do
  md
  |> sanitize_md()
  |> Earmark.as_html!()
  |> raw()
end

def sanitize_md(md) do
  {:safe, sanitized_md} = sanitize(md)

  sanitized_md
end

This works except for the blockquote, because > is sanitized to >.

To get blockquotes to work I could sanitize after parsing to html using the :basic_html mode.
But this would also mean allowing basic HTML as user input, which I’m trying to avoid.

use PhoenixHtmlSanitizer, :basic_html

def markdown(md) do
  md
  |> Earmark.as_html!()
  |> sanitize_md()
  |> raw()
end

So in short I like to support markdown blockquotes, but all HTML should be escaped.
Does anyone have some ideas on this?

Also, what do you consider best practice when storing, sanitizing and parsing Markdown? Would you sanitize markdown before storing it in the DB or before output?

Thanks in advance!

OvermindDL1 · February 15, 2018, 5:11pm

I personally store markdown as-defined into the database (I may upgrade things later and I want the original source), and for displaying I don’t use earmark (because of such issues as you mention) and instead I just put the markdown in a CDATA tag inside a markdown tag of a webcomponent (which is used just like any other element). Without javascript it will show the raw markdown, but with javascript it formats it well (and can be rich if needed).

However Earmark I think is already sanitized (assuming inline HTML is disallowed)? And if not (like not being able to disallow inline HTML) that sounds buggy to me? ^.^;

snap · February 15, 2018, 5:50pm

Thanks for sharing your approach. I’ve used marked in the past, but I’m aiming for a non-js solution this time.

We also store the markdown as-is, just wasn’t sure if I should sanitize/strip html like script tags and iframes. But made more sense to do this before rendering the parsed markdown (raw html).

FYI, Earmark does not sanitize out of the box:

Please be aware that Markdown is not a secure format. It produces HTML from Markdown and HTML. It is your job to sanitize and or filter the output of Earmark.as_html if you cannot trust the input and are to serve the produced HTML on the Web.

If anyone else has other insights I’ll be happy to hear them.

OvermindDL1 · February 15, 2018, 5:51pm

I’d probably pipe it out through an external shell program then and store both the original and the sanitized html from the external program into the database then. ^.^;

amnu3387 · February 15, 2018, 6:04pm

I think it makes more sense to sanitize it prior to saving to the DB - usually when saving you’re not as worried with performance as when serving something from the DB, plus when saving you can “background” the sanitization easily - sanitizing when serving also means it has to be sanitized whenever requested, while when saving it only has to be once.

Phoenix html sanitizer uses html_sanitizer underneath, and html_sanitizer has a markdown option so probably you can use that directly (?), before saving, and then just parse it regularly when serving. I think I would probably sanitize & parse into html before saving, but that would needs some benchmarking to make sure it’s more performative (although again, it means those things would only happen once per saving, and not once per request - which may or may not matter depending on what this is being used for?)

OvermindDL1 · February 15, 2018, 6:41pm

That is why I’d save both versions in the DB, the sanitized one for fast processing to the webpage, and the original in case you need to re-parse it later when the markup processor is updated or has new features or a bug is found or something.

amnu3387 · February 15, 2018, 6:58pm

Or the user wants to edit it

snap · February 15, 2018, 11:58pm

We had the same thoughts. But the only reason I need to sanitize is because of rendering parsed markdown using raw/1, as Phoenix serves :safe strings by default (correct?). So my idea was the most safe place to sanitize is where I render raw html, and not assume the content has been sanitized beforehand.

Yea I’ve noticed html_sanitize_ex is used under the hood, and it’s possible to use the :markdown_html mode directly. But I have the feeling it’s meant to be used after markdown is parsed, only stripping the unsafe html and leaving the rest intact.

Either way, sanitizing markdown always seem to come with a downside (see my initial problem). I lose support for blockquotes in markdown (using strict :strip_tags mode). Or I allow users to use basic HTML as markup language (using :markdown_html or :basic_html mode).
Maybe I’m missing something here, but somehow both options are unsatisfying. Also please correct me if my logic fails here.

Thanks for your input guys

amnu3387 · February 16, 2018, 11:46am

Indeed, the absolute safest point to sanitise is right before outputting it to the request - if you sanitise prior, a DB compromise could render all stored “sanitised” data unsafe, if you sanitise just before outputting it with raw then even a DB compromise would not create problems.

So but the problem you’re facing is that you might have > in the text, which is a markdown token, that also has a safe representation char seq., that when sanitised is replaced, rendering it useless as a markdown token?

Which means you would have > Some blockquoted content and then > Some blockquoted content? prior to passing this into your markdown parsers?

Perhaps you can just wrap sanitize() with a function where you would yourself replace > with an unique token (SNAP_BQUOTE), and after the sanitise (prior to passing it to earmark) you would replace it with > again. If blockquotes can only be bq tokens when they’re at the beginning of the line then you can anchor the regex to match > only when it’s on the beginning of the line and that should keep you free from pain. When replacing back you probably don’t need to worry with anchoring as long as your token is pretty unique, but since you’re already at it you might just anchor it as well.