DOC, PDF, XLS word/character counting libraries

vadimshvetsov · December 18, 2019, 12:19pm

I’m building realtime CRM for language service companies and I need to count text inside documents for further translation.

I’m wondering maybe somebody knows any libraries for processing or counting doc, xls and pdf files?

NobbZ · December 18, 2019, 12:23pm

Counting words in PDF is hard, as there might be a lot of human readable text that is not accessible computationally due to DRM or just because not beeing embedded at all or many other reasons.

For DOC or XLS I’m not sure about libraries, though at least for DOC and DOCX CLI tools exist that try to convert them to plain text, you can use one of those combined with wc.

I’d try to solve this entirely out of elixir as wordcounting done wrong can keep the full document in memory for a long time.

vadimshvetsov · December 18, 2019, 12:54pm

Thanks. Maybe exist engines for OCR images to text also?

About wordcounting out of elixir - it could be service based on GenServer at another server, not in the app, so the memory would be a problem.