TIL: Parsing DOCX files with Saxy - Handling Stream vs String Input

Today I learned about an interesting quirk when parsing DOCX files using Saxy in Elixir. I was trying to parse the word/document.xml file from a DOCX document using Saxy.parse_stream/3, but kept running into a specific error.

The Problem

When using Unzip.file_stream!/2 to read the XML content and passing it directly to Saxy.parse_stream/3, I encountered the following error:

** (FunctionClauseError) no function clause matching in Saxy.Parser.Stream."-inlined-parse_prolog/5-"/5 

    (saxy 1.5.0) Saxy.Parser.Stream."-inlined-parse_prolog/5-"/5
    (saxy 1.5.0) lib/saxy.ex:312: Saxy.reduce_stream/2
    (elixir 1.17.2) lib/stream.ex:1079: Stream.do_transform_each/3
    (elixir 1.17.2) lib/enum.ex:4858: Enumerable.List.reduce/3
    (elixir 1.17.2) lib/stream.ex:1027: Stream.do_transform_inner_list/7
    (elixir 1.17.2) lib/stream.ex:1052: Stream.do_transform_inner_enum/7
    (elixir 1.17.2) lib/enum.ex:2585: Enum.reduce_while/3

However, when I joined the stream into a string and used Saxy.parse_string/3, it worked fine.

The Cause

The error occurs because Unzip.file_stream!/2 returns a stream of iodata (which appears as a list when inspected) rather than a stream of binaries, which is what Saxy.parse_stream/3 expects. This mismatch in data types causes the function clause error in Saxy’s parser.

The Solution

To fix this, we need to convert the iodata chunks to binaries. Here’s how:

stream =
  unzip
  |> Unzip.file_stream!("word/document.xml")
  |> Stream.map(&IO.iodata_to_binary/1)

{:ok, result} = Saxy.parse_stream(stream, DocxHandler, [])

This approach maps over the stream, converting each chunk of iodata to a binary using IO.iodata_to_binary/1. The resulting stream of binaries is exactly what Saxy.parse_stream/3 expects, resolving the function clause error.

Key Takeaway

When working with streams from sources like Unzip.file_stream!/2, always check the format of the data you’re receiving. In this case, converting iodata to binaries was crucial for successful parsing with Saxy.

I hope this helps anyone else who might run into similar issues when parsing DOCX files or working with streams in Elixir!

11 Likes

I’ve done further testing and noticed some performance and memory usage gains by flattening IOData more efficiently.

The benchmarking results show a large difference between the methods, which I’m skeptical about. However, it does seem to point in a promising direction. I ran into issues when trying to read straight from the file during the benchmarking, so I had to store the data in a variable before running the tests. The document I tested was a standard 5.9MB MS Word file.

Here’s the function I created:

defmodule StreamIodata do  
  def flatten_iodata_stream(iodata_stream, opts \\ []) do
    chunk_size = Keyword.get(opts, :chunk_size, 65_536)

    Stream.transform(
      iodata_stream,
      fn -> [] end,
      fn iodata, acc ->
        acc = [acc | iodata]
        acc_size = IO.iodata_length(acc)

        if acc_size >= chunk_size do
          # Flatten the accumulated iodata into a binary
          binary = IO.iodata_to_binary(acc)
          # Reset the accumulator
          { [binary], [] }
        else
          # Keep accumulating
          { [], acc }
        end
      end,
      fn
        # After function to flush any remaining data
        acc when acc == [] -> []
        acc -> [IO.iodata_to_binary(acc)]
      end
    )
  end
end

And here’s the benchmark:

defmodule Bench do
  def benchmark_docx_parsing(data) when is_list(data) do
    Benchee.run(
      %{
        "Stream with flatten_iodata_stream (8kb)" => fn ->
          data
          |> StreamIodata.flatten_iodata_stream(chuck_size: 8_192)
          |> Saxy.parse_stream(TestDocxHandler, [])
        end,
        "Stream with flatten_iodata_stream (64kb)" => fn ->
          data
          |> StreamIodata.flatten_iodata_stream(chuck_size: 65_536)
          |> Saxy.parse_stream(TestDocxHandler, [])
        end,
        "Stream with flatten_iodata_stream (524Kb)" => fn ->
          data
          |> StreamIodata.flatten_iodata_stream(chuck_size: 524_288)
          |> Saxy.parse_stream(TestDocxHandler, [])
        end,
        "Stream with iodata_to_binary map" => fn ->
          data
          |> Stream.map(&IO.iodata_to_binary/1)
          |> Saxy.parse_stream(TestDocxHandler, [])
        end
      },
      time: 10,
      memory_time: 2
    )
  end
end

# Storing the file in memory to avoid issues when reading from the file during benchmarking
zip_file = 
  path
  |> Unzip.LocalFile.open()

{:ok, unzip} = Unzip.new(zip_file)

data =  
  unzip
  |> Unzip.file_stream!("word/document.xml")
  |> Enum.into([])

Bench.benchmark_docx_parsing(data)

Here’s the output:

Error trying to determine erlang version enoent, falling back to overall OTP version
Operating System: macOS
CPU Information: Apple M1
Number of Available Cores: 8
Available memory: 8 GB
Elixir 1.17.2
Erlang 27
JIT enabled: true

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 2 s
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 56 s

Benchmarking Stream with flatten_iodata_stream (524Kb) ...
Benchmarking Stream with flatten_iodata_stream (64kb) ...
Benchmarking Stream with flatten_iodata_stream (8kb) ...
Benchmarking Stream with iodata_to_binary map ...
Calculating statistics...
Formatting results...

Name                                                ips        average  deviation         median         99th %
Stream with flatten_iodata_stream (524Kb)      872.32 K        1.15 μs  ±4444.97%        0.88 μs           4 μs
Stream with flatten_iodata_stream (64kb)       871.69 K        1.15 μs  ±4267.88%        0.88 μs        3.04 μs
Stream with flatten_iodata_stream (8kb)        753.12 K        1.33 μs  ±6291.89%        0.88 μs        4.33 μs
Stream with iodata_to_binary map                 1.75 K      570.29 μs    ±68.51%      507.23 μs     1230.03 μs

Comparison: 
Stream with flatten_iodata_stream (524Kb)      872.32 K
Stream with flatten_iodata_stream (64kb)       871.69 K - 1.00x slower +0.00083 μs
Stream with flatten_iodata_stream (8kb)        753.12 K - 1.16x slower +0.181 μs
Stream with iodata_to_binary map                 1.75 K - 497.47x slower +569.14 μs

Memory usage statistics:

Name                                         Memory usage
Stream with flatten_iodata_stream (524Kb)         1.77 KB
Stream with flatten_iodata_stream (64kb)          1.77 KB - 1.00x memory usage +0 KB
Stream with flatten_iodata_stream (8kb)           1.77 KB - 1.00x memory usage +0 KB
Stream with iodata_to_binary map                896.40 KB - 505.46x memory usage +894.63 KB

**All measurements for memory usage were the same**
4 Likes