Nice! Let’s wait and see what comes out of your issue.
Hey!
I had recently same issue with extractous. My solution was to copy libtika_native.so and any other lib that is used by extractous to /app/priv/native/ and then point this path in LD_LIBRARY_PATH
I’m doing this like that:
RUN find /app/_build -name libtika_native.so -exec sh -c 'cp $(dirname {})/*.so /app/priv/native/' \;
ENV LD_LIBRARY_PATH=/app/priv/native
It’s working for me in dev env and currently I’m trying to make it work with mix release as well.
Just tested, and same thing works for mix release, I copied all the needed .so from build layer into the runner and point LD_LIBRARY_PATH to the directory
COPY --from=builder --chown=nobody:root /app/native native
ENV LD_LIBRARY_PATH=/app/native:$LD_LIBRARY_PATH
I’m starting to come to the realization if you need text for GenAI workflows, getting content out of pdfs/docx is seemless. In addition to the pdf extraction above with linux pdftotext
, docx (doc and txt also) is also easily (perhaps I haven’t touched steel yet) solved.
defp extract_docx_content(file_path) do
try do
file_charlist = String.to_charlist(file_path)
case :zip.extract(file_charlist, [
{:file_list, [~c"word/document.xml"]},
:memory
]) do
{:ok, [{~c"word/document.xml", document_xml}]} ->
parse_docx_xml(document_xml)
{:ok, []} ->
{:error, "No document.xml found in DOCX file"}
{:error, reason} ->
{:error, "Failed to read DOCX file: #{inspect(reason)}"}
end
rescue
error in File.Error ->
{:error, "File error: #{error.reason}"}
error in ArgumentError ->
{:error, "Invalid file format or corrupted DOCX: #{Exception.message(error)}"}
error ->
{:error, "Unexpected error reading DOCX: #{inspect(error)}"}
catch
{:error, reason} -> {:error, reason}
error -> {:error, "DOCX extraction failed: #{inspect(error)}"}
end
end