What's the most balanced way to store multiple text formats?

devilray · February 19, 2025, 9:10pm

While building small in-house LiveView applications, I’ve asked myself the same question with each new entity created: what’s the best way to store formatted text fields?

For example, many developers like to write in Markdown, most general users will want to write in HTML (using a GUI WYSIWYG widget). I prefer the power of a richer markup language such as ASCIIDOC or Org-Mode, and may even like to support others, including plain text, pointing to the need to accept multiple input formats, which need to be stored and presented back to users to later edit in the same language, as required.

For the benefit of fast server response, and low server overhead, I translate Markdown in a validation function, to HTML, saving this in a separate field. And, for the benefit of clean (clean of formatting characters) search results, I want to save a plain-text version to be used for full-text indexing and results presentation.

This makes the database schema quite messy as even with the minimal three choices, there is a variable amount of storage needed, from one to possibly three fields:

Input format	Output format(s)
HTML	plain text
Markdown/Org-mode/Asciidoc/other markup	HTML, plain text
Plain text	n/a

So, if the user fills in a textarea field in plain text, there is no need for further formats to be saved; if they write in HTML, only an additional plain text field should be stored; if Markdown, an HTML translation and a plain text one should be stored, for three fields in total.

While I could hardcode a markup, HTML and plain text field in every table, this seems wasteful as they won’t always be used. If I delegate them to an external table with a one-to-many relationship, I’m imposing an additional join for every query. Arguably the cleanest and most balanced solution would be to denormalise, creating an Ecto virtual field, storing a JSON column to store any translation as necessary, supporting any number of translations according to need.

I’m sure many of you have come across the same issue and have dealt with it in different ways. I’d love to learn how some of you have decided to handle this in your use case and why, with regards to performance, data redundancy, storage tradeoffs mentioned above?

garrison · February 19, 2025, 10:08pm

Here’s what I would do:

Store the raw input from the user in a :content column, and the type (:plaintext, :markdown, :asciidoc, etc) in a :content_type column. Compute and store a :content_plaintext column which contains plaintext optimized for search.

On inserts, you send the original content (say markdown), the type, and the markdown with formatting stripped for search.

Then on render you pull the markdown and render it for the user. You could cache the rendered HTML if you need to, but I don’t think you will if this is an internal app. If this is LiveView you would render once and store it in assigns of course.

For search, simply search on the plaintext column.

derek-zhou · February 20, 2025, 1:48am

I store text in (sanitized) HTML. Markdown etc. are formats primarily for authoring. It makes a bad choice as an inter-change format.

devilray · February 20, 2025, 9:52pm

Thanks @garrison, that’s pretty much what I’ve been doing so far. As I’ve done this more and more, it started to look messy, from a database modelling perspective, creating additional fields that may go unused.

devilray · February 20, 2025, 9:54pm

Thanks @derek-zhou, I agree that markup languages aren’t an ideal interchange format, but they offer the element of least surprise to the author, when they come back to editing, even though they may not be friendly to other users of the system, if any.

garrison · February 20, 2025, 10:15pm

From your post I got the impression that you were storing multiple formats (i.e. Markdown, plus HTML, plus Asciidoc) in different columns in every table.

What I was proposing is to store the raw content in one column (regardless of its type) and then render it to your desired output on-demand. Then keep a second column (an Enum, say) to store the type, and a third column in plaintext (for search). But that would be it, no more columns if you add more types.

Yeah, this is why I wouldn’t go that route - I would be afraid of destroying the original content during an e.g. (Markdown → HTML → Markdown) conversion, which is unlikely to be lossless.

There are definitely cases where you would just store HTML, but for something which can be edited I wouldn’t.

derek-zhou · February 21, 2025, 2:24am

If re-edit is a requirement, then you have little choice but to store the original format, like markdown, somewhere. I would still only store html in the main table, and use an auxiliary table for raw input (and input format). Editing should be a rare event, so one join is no too bad.