While building small in-house LiveView applications, I’ve asked myself the same question with each new entity created: what’s the best way to store formatted text fields?
For example, many developers like to write in Markdown, most general users will want to write in HTML (using a GUI WYSIWYG widget). I prefer the power of a richer markup language such as ASCIIDOC or Org-Mode, and may even like to support others, including plain text, pointing to the need to accept multiple input formats, which need to be stored and presented back to users to later edit in the same language, as required.
For the benefit of fast server response, and low server overhead, I translate Markdown in a validation function, to HTML, saving this in a separate field. And, for the benefit of clean (clean of formatting characters) search results, I want to save a plain-text version to be used for full-text indexing and results presentation.
This makes the database schema quite messy as even with the minimal three choices, there is a variable amount of storage needed, from one to possibly three fields:
Input format | Output format(s) |
---|---|
HTML | plain text |
Markdown/Org-mode/Asciidoc/other markup | HTML, plain text |
Plain text | n/a |
So, if the user fills in a textarea field in plain text, there is no need for further formats to be saved; if they write in HTML, only an additional plain text field should be stored; if Markdown, an HTML translation and a plain text one should be stored, for three fields in total.
While I could hardcode a markup, HTML and plain text field in every table, this seems wasteful as they won’t always be used. If I delegate them to an external table with a one-to-many relationship, I’m imposing an additional join for every query. Arguably the cleanest and most balanced solution would be to denormalise, creating an Ecto virtual field, storing a JSON column to store any translation as necessary, supporting any number of translations according to need.
I’m sure many of you have come across the same issue and have dealt with it in different ways. I’d love to learn how some of you have decided to handle this in your use case and why, with regards to performance, data redundancy, storage tradeoffs mentioned above?