Escaping variables for LaTeX in EEx templates, should I create a format encoder?

KungPaoChicken · June 21, 2021, 9:45am

Hello all,

I am trying to generate PDFs using from EEx templates with variables from a user request. The variables need escaping before being inserted into the template, but I am stuck with the approach. I can substitute the strings in the view, but it doesn’t feel “proper” when compared with Phoenix.HTML.Safe. I have read the part about format encoders and tried to understand how Phoenix.HTML works by reading the source code without much success. Iona adapts Phoenix.HTML for TeX but the documentation stated that it is not ready for production.

Long story short, should I go through the effort to create a format encoder (by modifying Phoenix.HTML) or is something like Regex.replace/4 enough? Thank you!

EDIT: If it helps, you can assume that the variables are plain UTF-8 strings, and they do not contain HTML/TeX code.

Sebb · June 21, 2021, 4:54pm

For generating documents in a template I’d always prefer HTML+CSS over Latex.

With CSS3 Paged Media there is little that can’t be done. OReilly for example publishes its books using CSS3. Until lately CSS3 processors were very expensive (OReilly Atlas uses Antennahouse, which is 5k$ per server-CPU) but there are some good open source tools now:

A very good, ultra-easy and affordable (pay as you go) tool is https://docraptor.com/ which uses https://www.princexml.com/

Here is a very good introduction to print-CSS: https://print-css.rocks/ and https://print-css.com/

KungPaoChicken · June 22, 2021, 9:33am

Thanks for the reply! I don’t mean to dismiss your solutions but we have tried some of them and unfortunately the results are not very good. Here is a brief overview:

We started with HTML → PDF with a prototype using WeasyPrint. Because of inconsistent renders compared to a browser (IIRC flexbox and tables, cf. issues on GitHub), we tried pdfkit (a wrapper for wkhtmltopdf). We encountered some issues with it (IIRC it is something about fonts but in general wkhtmltopdf is outdated) and ended up with pupeteer + Chrome. There the render finally looks alright but margin boxes aren’t supported. We are aware of Paged.js and used it with headless Chrome but for some reason it doesn’t work well enough (we need something to stay at a particular page, while break-before/inside/after works, it also creates a massive empty space where an element could fit). Page numbers didn’t work either.

After investing quite some effort rewriting layout with older CSS and fighting with renderers to get an arguably acceptable output we switched to LaTeX. There are some issues (distro packaging) but in general it went very well, the performance is comparable with headless Chrome yet the output is way better. We had a prototype built with Python3+Jinja2 and I am integrating it into our Phoenix application to simplify the codebase. DocRaptor and PrinceXML are great solutions but I will need to go through some company process, in the meantime I would like to make EEx work with it.

Anyway I created a format encoder based on the one within Phoenix.HTML and changed the list of escape characters. When I look at the output on the browser, the whole template is escaped besides the EEx snippets. What did I miss?

Controller:

conn
|> put_resp_content_type("text/plain")
|> render("invoice.tex", example_invoice)

List of escaped characters (taken from PyLaTeX):

escapes = [
  {?{, "\\{"},
  {?}, "\\}"},
  {?\\, "\\textbackslash{}"},
  {?#, "\\#"},
  {?$, "\\$"},
  {?%, "\\%"},
  {?&, "\\&"},
  {?^, "\\^{}"},
  {?_, "\\_"},
  {?~, "\\textasciitilde{}"},
  {?\n, "\\\\newline%\n"},
  {?-, "{-}"},
  {0xA0, "~"},
  {?[, "{[}"},
  {?], "{]}"}
]

The rest are the same in both engine.ex and safe.ex besides renaming the modules and html_escape/1 to tex_escape/1
config.exs:

config :phoenix, :format_encoders,
  html: Phoenix.HTML.Engine,
  tex: MyApp.TeX.Engine

Sebb · June 22, 2021, 12:17pm

I never used Weasyprint, but from the samples at https://print-css.rocks/ it seems to able to cover basic requirements. If you look at the samples of the different processors, you see that complex layouts are possible

https://www.pagedjs.org/examples/
Samples - PDFreactor
Prince - Sample Documents (docraptor uses this)
Antenna House Formatter | Convert XML or HTML to PDF

I never had anything I could not fix (with prince-engine). Some CSS Knowledge required.
If you can’t invest in a paid product and paged.js is not enough, Latex is the most potent option. But its most definitely more expensive in the long run.

KungPaoChicken · June 22, 2021, 2:33pm

I never had anything I could not fix (with prince-engine).

Good to hear another happy user of Prince, it was on the top of the list of tools to try until everyone saw the price tag for a commercial license. The samples at https://printcss.rocks/ are impressive, but they also highlight the inconsistencies among renderers. It is very difficult implementing a renderer for HTML+CSS with a small team, while working on this I had encountered a bug where the PDF text won’t display in Chrome while it does in Firefox.

It feels like HTML → print is not an important use case to the standards bodies, and Prince is one of the few renderers that implement HTML+CSS+CSS Paged Media correctly. It is going to be difficult convincing the higher-ups to pay $3800 for page numbers, but without it we have to resort to crazy hacks (I seriously considered passing the PDF from headless Chrome to LaTeX for page numbers…).

Playground (Try switching the backends)
Generated PDFs from the lessons

/rant

Anyway I haven’t figure out why the whole template is escaped instead of the Elixir parts, so I moved the Safe protocol to the view and pipe all EEx tags to the escape function, thank you very much @Sebb!