Today I learned about an interesting quirk when parsing DOCX files using Saxy in Elixir. I was trying to parse the word/document.xml
file from a DOCX document using Saxy.parse_stream/3
, but kept running into a specific error.
The Problem
When using Unzip.file_stream!/2
to read the XML content and passing it directly to Saxy.parse_stream/3
, I encountered the following error:
** (FunctionClauseError) no function clause matching in Saxy.Parser.Stream."-inlined-parse_prolog/5-"/5
(saxy 1.5.0) Saxy.Parser.Stream."-inlined-parse_prolog/5-"/5
(saxy 1.5.0) lib/saxy.ex:312: Saxy.reduce_stream/2
(elixir 1.17.2) lib/stream.ex:1079: Stream.do_transform_each/3
(elixir 1.17.2) lib/enum.ex:4858: Enumerable.List.reduce/3
(elixir 1.17.2) lib/stream.ex:1027: Stream.do_transform_inner_list/7
(elixir 1.17.2) lib/stream.ex:1052: Stream.do_transform_inner_enum/7
(elixir 1.17.2) lib/enum.ex:2585: Enum.reduce_while/3
However, when I joined the stream into a string and used Saxy.parse_string/3
, it worked fine.
The Cause
The error occurs because Unzip.file_stream!/2
returns a stream of iodata (which appears as a list when inspected) rather than a stream of binaries, which is what Saxy.parse_stream/3
expects. This mismatch in data types causes the function clause error in Saxy’s parser.
The Solution
To fix this, we need to convert the iodata chunks to binaries. Here’s how:
stream =
unzip
|> Unzip.file_stream!("word/document.xml")
|> Stream.map(&IO.iodata_to_binary/1)
{:ok, result} = Saxy.parse_stream(stream, DocxHandler, [])
This approach maps over the stream, converting each chunk of iodata to a binary using IO.iodata_to_binary/1
. The resulting stream of binaries is exactly what Saxy.parse_stream/3
expects, resolving the function clause error.
Key Takeaway
When working with streams from sources like Unzip.file_stream!/2
, always check the format of the data you’re receiving. In this case, converting iodata to binaries was crucial for successful parsing with Saxy.
I hope this helps anyone else who might run into similar issues when parsing DOCX files or working with streams in Elixir!