PDFTron + Elixir

Hi all,

I have been building a PDFTron nif for a work project, and it is pretty awesome.
We are leveraging both the webview javascript library, and the C++ code integrated as a nif (nifcpp) for server side pdf + office doc processing. PDFTron exposes possibly the best(not java) headless docx/pptx/xlsx to pdf conversion I have come across. Only downside, it is not free; however they offer a very generous trial/demo to use for development.

Has anyone tried it before?

I posted some questions to slack, but figured I would also ask them here for more exposure.

  1. Operations on a PDF (creation, modify, conversion) are these dirty? If so, is it CPU or IO bound?
    • Got some confirmation this is CPU bound on slack, which was also my hypothesis.
    • If reading/writing the PDF file from the nif, would that change to being IO bound?
      • Not sure exactly how one would determine if a nif is both IO and CPU bound, which scheduler option should be chosen; for example does CPU take precedent over IO, if they were equally dirty?
  2. Is a nif even the right tool for this? Would a port suffice?
    • I have a nif working, but a bit concerned about safety…
    • Would wrapping the C++ in some rust, and using rustler be of any benefit?
  3. We are passing strings from Elixir -> C++ that represent the contents of a PDF file, and returning back contents as a string of a new pdf file.
    • would using a resource binary be better for the input/output (more performant/safer)?
    • was having some trouble getting this working, if this is the best way, might need some help.
  4. Having a bit of trouble dynamically linking the PDFTron compiled .so file in the context of a mix package (portable), tips?
    • Need to do something like this export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:"$(pwd)/c_src/Lib"
    • Could I just set something in the nif init function? System.put_env("LD_LIBRARY_PATH", ...)
    • Where should I put this .so file? have it in c_src/Lib of the mix project, should it be in priv/ with my compiled .so?

Thanks in advance for taking the time to reply!

2 Likes

For finding the .so file, the most common pattern is using :code.priv_dir/1.

See https://mintcore.se/blog/2017/10/reading-files-from-your-priv-directory-with-elixir

Though with the updated build setup in Elixir 1.9+ it’s more stable to put it in the _build folder. I just copy the Makefile from https://github.com/elixir-circuits/circuits_gpio .

1 Like
  1. I would personally wrap this in an os thread, using enif_thread_create and friends.
    1b. If you’re actually reading/writing the file inside the nif (which i don’t think you should do unless you need mmap or O_DIRECT, which you shouldn’t for pdfs), then it probably depends on the contents of the pdf.
  2. I would probably pick a port instead of a nif, because pdf is inherently unsafe. You could also do something like set up a c_node.
    2a. The inherently unsafe parts of pdf have nothing to do with the forms of safety that rust gives you, so wrapping in rustler will do nothing for you except making your calls at the boundary safe. (I developed Zigler, which lets you wrap in zig; similarly, there would be no safety benefits except making sure your nif interface is correct and marshalling into C types and doing ArgumentError if you mess up)
  3. Resources are only useful if you are passing either a. mutable binaries or b. unserialized data between NIF functions. If your functions aren’t “in place” binary functions. then it’s much better to just marshal into erlang’s binaries.
  4. drop it into priv, and get the priv directory using :code.priv_dir, your life will be so much simpler. I believe Rustler and Zigler do this by default.
2 Likes

Thanks!
I might setup a script that can download the c dependencies into priv directory over https before building for the first time, so as to better keep them up to date…

Thanks @ityonemo,
I have seen some of your Zig related posts - pretty neat stuff!

  1. I like the idea of wrapping it in a thread, cannot seem to find any simple sample of enif_thread_create, any good examples you have seen? I think I will post an example of my nif code to help the discussion.
    If the nif call is wrapped in a thread I presume that helps ensure if it were to crash it is isolated and the beam can keep ticking without much issue? also maybe the vms’ memory space is not as exposed? I presume the trade off between nif vs port becomes less relevant when adding the overhead of a new thread.

  2. If I had it running in a thread, would you still suggest a port? (to continue the thought from above)
    I am sharing/passing full pdf files as arguments to the functions, which can be optimized better in a nif i would think? as the memory can just be shared via pointer as apposed to being copied around, or at-least more efficiently copied around? I may be wrong…

  3. In the case of the pdf blob to the nif, it is un-serialized I think, and can be read only. Would create a new binary for the response, am not really benefiting from marshaling the data - just really a pointer to some bytes in memory

  4. Yea priv seems to be the right place, will play around with setting the LD_LIBRARY_PATH environment variable before loading the nif to see how that works. …

  1. Spawning in a thread offers you no protection whatsoever; a segfault will still crash the entire BEAM.

  2. That’s why ports are still safer; you probably aren’t going to run both a port and a nif. But even ports don’t give you supervision, so zombie processes can be a thing.

  3. That’s exactly the nif vs port trade-off. One last thing to remember is if you do read a binary in a separate thread, to read the binary from the environment of the new thread, otherwise the GC could take the binary away from under you and cause a segfault… I think. This is one of the reasons why nifs are hard.

  4. You don’t typically need to set LD_LIBRARY_PATH for nifs; you can give an absolute path to the nif load call, which I believe is instrumented in @after_compile