ChromicPDF - PDF generator

silverdr · June 13, 2022, 9:39pm

@maltoe Oh, and BTW - when trying to convert_to_pdfa/2 I am getting:

** (RuntimeError)   /usr/local/bin/gs exited with status 1!

GPL Ghostscript 9.56.1: Unrecoverable error, exit code 1


    (chromic_pdf 1.2.0) lib/chromic_pdf/utils.ex:53: ChromicPDF.Utils.system_cmd!/3
    (chromic_pdf 1.2.0) lib/chromic_pdf/pdfa/ghostscript_worker.ex:77: ChromicPDF.GhostscriptWorker.pdfinfo/1
    (chromic_pdf 1.2.0) lib/chromic_pdf/pdfa/ghostscript_worker.ex:44: ChromicPDF.GhostscriptWorker.create_pdfa_def_ps!/3
    (chromic_pdf 1.2.0) lib/chromic_pdf/pdfa/ghostscript_worker.ex:29: ChromicPDF.GhostscriptWorker.convert/3
    (chromic_pdf 1.2.0) lib/chromic_pdf/pdfa/ghostscript_pool.ex:35: anonymous fn/5 in ChromicPDF.GhostscriptPool.convert/4
    (nimble_pool 0.2.6) lib/nimble_pool.ex:349: NimblePool.checkout!/4
    (chromic_pdf 1.2.0) lib/chromic_pdf/api.ex:88: anonymous fn/4 in ChromicPDF.API.do_convert_to_pdfa/4
    (chromic_pdf 1.2.0) lib/chromic_pdf/api/telemetry.ex:8: anonymous fn/2 in 
[…]

Does that ring any bells?

evadne · June 13, 2022, 10:09pm

Possible to post your test code?

Given:
for _ <- 1 .. 1024, do: elem(:timer.tc(fn -> ChromicPDF.print_to_pdf({:url, "file://test/integration/fixtures/test.html"}) end), 0)

It is pretty good for simple case, prints in 10ms (binary returned inline as encoded. With output to file it runs a bit slower when the document is small.

maltoe · June 14, 2022, 7:28am

Hey @silverdr

Yeah, unfortunately that is about the amount of error information ghostscript usually gives us It’s a blast. Not sure what is wrong exactly, could be incompatibilities with the version of Ghostscript you’re using, or some setup thing. Unfortunately that part of ChromicPDF is rather fragile, though we are using the feature ourselves - currently at Ghostscript 9.55 in an Alpine 3.15-based container. Will take a look into Ghostscript 9.56 when I find the time, created a ticket for it.

Regarding your speed & size concerns: As you said, these are out of ChromicPDF’s influence unfortunately. Still good to know, of course. If people are looking for minimum PDF file size, rendering with Chrome is likely not the way to go.

Size: We use it exclusively for 1-2 pager text documents, which usually clock in at around 30kb. I suspect that the size of the document is dominated by included images & fonts, and image quality.
Speed: As @evadne said, would be nice to see your benchmarks Usually it’s blazing fast for us, and magnitudes faster than anything that starts fresh Chrome instances for each PDF. Of course it is possible that wkhtmltopdf is still faster though, perhaps due to the faster rendering engine. But I have doubts and the “2 to 2.5 times slower” you quote seem a lot.

Thanks for your feedback though!
malte

silverdr · June 14, 2022, 12:12pm

Not really a test code. I print_to_pdf/2 HTML strings of real-life documents and output: the PDF to file. I am not entitled to share those docs but while they are not trivial like the “Hello, world!” type of fixture you refer to, they’re not overly complex either. Less than three pages “Letter”, single font face, an image or two etc. As mentioned, this type is worth about 40 to 50 KiB for wkhtmltopdf, which renders and saves them consistently a tad under one second. With Chrome it is much more unpredictable (higher deviation) but not less than two seconds so far.

silverdr · June 14, 2022, 12:27pm

Thank you for coming back on it. I’ll see if I can get an earlier Ghostscript version on the dev machine. As for the speed, I am wondering what might be the reason if you say ChromicPDF might even be faster than wkhtmltopdf. Do I understand correctly that /unless/ I set on_demand: true option, the default setup is pooled with some default pool sizes mentioned in the docs, right? Maybe there lies something because I haven’t noticed any significant difference between the two setups. But I didn’t specify any pool options. And yes, zombies invaded my machine in this setup so I took it worked

maltoe · June 14, 2022, 1:41pm

Do I understand correctly that /unless/ I set on_demand: true option, the default setup is pooled

Yes, on_demand essentially bypasses the entire supervision tree booting, and instead starts the relevant ChromicPDF processes as well as the external Chrome process when you call print_to_pdf. So, if you’re testing this for example in a .exs script and only print a single PDF, these two modes of operation will in fact appear to behave the same. You should notice a drastic difference when you perform manual tests on the console and print multiple PDFs.

In order to debug this further though, it would be great if you could provide a minimum working example, i.e. some benchmark script with a PDF template that shows the slowness/unpredictability you’re experiencing.

AUGERClement · June 14, 2022, 2:15pm

Do you think your project may evolve and become more versatile, because I still can’t find a elixir library easy to use to just extract the words in a pdf.

silverdr · June 14, 2022, 3:04pm

Roger. I’ll check it all step by step in a day or two and if nothing helps I’ll make a “dummy” HTML document of the type in question and provide it for checking. Tnx so far once more.

silverdr · June 14, 2022, 3:08pm

And what exactly do you mean by that?

evadne · June 14, 2022, 4:20pm

If OCR → tesseract. Write your own integration. Otherwise be careful with deceptive glyphs.

outlog · June 14, 2022, 4:41pm

you can call pdftotext · PyPI using System.cmd or Rambo etc. then parse line by line…

silverdr · June 16, 2022, 6:22pm

OK, so I was able to get back to this and scrutinise my setup details. I thought I found the culprit when I managed to go down to below 300ms on average. The reason for the previous lack of performance (I thought) was that ChromicPDF instead of the current “Chromium” browser, picked a two years old “Chrome” I even forgot I had still installed. And since I blocked Google’s malware updater software from running, it wasn’t updated for over two years. So everything looked great… for a moment. Once things started to work well on the dev machine running “macos”, I moved to the one running GNU/Linux. In the end this is what production env runs. Here I also made sure that the very same, latest “Chromium” version[*] is run in place of previously installed packaged one, and still received times over two seconds (pooled). Yes, the Linux running machine is of similar hardware capabilities so that’s definitely not the up to 10x factor I observe. And yes that machine clocks similar times with wkhtmltopdf (around 800ms) as the “macos” computer I normally use for dev work.

Shall return to this and do some more “tracing”.

* - in both cases I downloaded the same version directly off the “Chromium” project rather than using packaged versions

outlog · June 17, 2022, 9:27pm

if you are benchmarking/looking for speed, give weasyprint a go… I simply use it calling the CLI using rambo (quick copy/paste code below), but you can set it up with ports cmdarek.com - Generate PDFs in Elixir and have the weasyprint instance running at any time, including caching fonts/images for fast response…

might be quite the rabbit hole though:/

code for calling the cli (which obviously incur startup penalty) - (populate_html simply calls EEx.eval_file with the template path and data)

defmodule RamboPdf do
  def create_pdf(item) do
    css = Path.join([PDFfile.path(), "templates/invoice", "invoice.css"])

    html =
      PDF.gen_data(item)
      |> PDF.populate_html("templates/invoice", "invoice.html")

    safe_name =
      PDFfile.fix_name(item.name)
      |> String.trim()
      |> String.replace(" ", "_")
      |> String.to_charlist()
      |> Enum.filter(&(&1 in 0..127))
      |> List.to_string()

    output_file = Path.join([PDFfile.path(), "output", "#{item.room}_#{safe_name}.pdf"])
    # https://doc.courtbouillon.org/weasyprint/stable/api_reference.html#command-line-api

    task =
      Task.async(fn ->
        Rambo.run("weasyprint", ["--encoding", "utf8", "-q", "-s", css, "-", "-"],
          in: html,
          log: false,
          timeout: 20_000
        )
      end)

    rambo = Task.await(task, :infinity)

    case rambo do
      {:ok, %Rambo{err: _err, out: output, status: 0}} ->
        File.write(output_file, output, [:binary])
        output

      error ->
        IO.inspect("error")
        IO.inspect(error)
    end
  end
end

silverdr · June 18, 2022, 11:31am

Thank your for the suggestion. The main reason for going with ChromicPDF was that I need to have the output from browser-printed and server-generated PDF look the same. And that without synchronising and sync-maintaining two different templates. Of course there may be some minor differences between how each of the three major browsers render the printout but tests have shown that those are negligible. IOW I am not really benchmarking for speed but rather trying to understand the reasons for severe underperformance I experience /in some cases/.

silverdr · March 2, 2023, 8:25pm

Coming back after tuning the eventually put into production mode project with ChromicPDF. The good news is that in production, the application containerised with recent Chromium and Debian packaged 9.53.3 Ghostscript works (pooled rather than “on demant”) eventually faster than wkhtmltopdf while producing similarly sized output files (about 50 KiB)! The [very] long story short - both the performance and especially the output size depends heavily on what fonts are available for Chromium to pick for given CSS. Using only non-commercial fonts (my GNU/Linux running laptop) makes a big difference when compared to a macos running desktop with lots of various fonts available for the browsers. In order to get the look and output size I am satisfied with I eventually removed all preinstalled fonts from the container and added only those non-commercial ones I found giving good visual results, and then picking out which give the smallest filesize. All in all, after spending lots of time fine-tuning the container, I am very satisfied with the results. The only (acceptable) issue is that PDFs saved from browser’s window and generated on the server can (and often do) exhibit some visual differences due to different fonts being used. A known problem of course and, unless one wants to supply own fonts for both cases, something to live with. Summing up - once more thank you @maltoe and kudos!

Bumbus · March 15, 2023, 2:02pm

Hi, is there a way to print to total number of pages somehow?
I need to achieve some kind of pagination info in the footer, e.g. Page 1 of 10.

Bumbus · March 15, 2023, 2:12pm

ok found it myself:

"<span class='totalPages'></span>"

nseaSeb · January 5, 2024, 4:17pm

First of all, thank you and Chromic is really cool!
I’m encountering a rather silly problem when deploying on Fly.io, chrome isn’t present, do I just need to add it to my yaml (docker), has anyone had any experience of this?
I’d like to take this opportunity to wish everyone a happy new year

nseaSeb · January 5, 2024, 4:57pm

I try what I read in this post

nseaSeb · January 6, 2024, 1:55pm

Hello, the link shared above is ok, it works for deploying chromic in a fly.io project.

Just in case, I’m trying to pass params to define an author, topic etc in my PDF but that doesn’t seem to be the right approach, any ideas?

 params = [
        info: %{
          author: "Me myself & I",
          subject: "Try to pass params"
        }
      ]
    {_, blob} =
      [
        content: """
        <!doctype html>
        <html>
          <head>
            <meta charset="UTF-8" />
            <meta name="viewport" content="width=device-width, initial-scale=1.0" />
            <script src="https://cdn.tailwindcss.com">
            </script>

        <style>
        table {
        border-collapse: collapse;
        font-size: 10px
        }
        p {
          padding: 8px;
        }
        th {
        font-weight: 700;
        }
        td, th {
        border: 1px solid #999;
        padding: 0.5rem;
        text-align: left;
        }
        </style>
          </head>
          <body>
          <div class="p-4">
           #{table}
           </div>
          </body>
        </html>
        """,
        header: """
        <p style="padding: 8px;">#{filename}</p>
        """,
        footer: """
        <p style="padding: 8px;">Gesti'up</p>
        """,
        size: :a4,
        header_height: "40mm",
        footer_height: "40mm"
      ]
      |> ChromicPDF.Template.source_and_options()
      |> ChromicPDF.print_to_pdf(params)

    conn
    |> put_resp_content_type("application/pdf")
    |> put_resp_header("content-disposition", "attachment; filename=#{filename}.pdf")
    |> put_root_layout(false)
    |> send_resp(200, Base.decode64!(blob))