Transforming Unzip.file_stream error

I’m trying to parse contents of a file in zip file. My idea is to use Unzip to stream contents (the unpacked file might be too large for my taste), and transform it on the fly.

However, something’s weird:

Unzip.file_stream(handle, entry_name) 
|> Stream.map(fn e -> IO.iodata_to_binary(e) end) 
|> Enum.to_list


[
  <<208, 161, 208, 190, 208, 186, 208, 190, 208, 187, 208, 190, 208, 178, 44,
    208, 161, 208, 181, 209, 128, 208, 179, 208, 181, 208, 185, 44, 208, 144,
    208, 189, 208, 176, 209, 130, 208, 190, 208, 187, 209

This works. Replace it with Stream_flat_map or Enum.take(2) or Stream.transform, and

Unzip.file_stream(handle, entry_name) 
|> Stream.transform([], fn e, acc -> {e |> IO.iodata_to_binary, acc} end ) 
|> Stream.run()


** (ErlangError) Erlang error: :data_error
    :zlib.inflateEnd_nif(#Reference<0.60592435.1843265537.68376>)
    (unzip 0.11.0) lib/unzip.ex:188: anonymous fn/1 in Unzip.decompress/2
    (elixir 1.17.0) lib/stream.ex:1039: Stream.do_transform_inner_list/7
    (elixir 1.17.0) lib/stream.ex:1038: Stream.do_transform_inner_list/7
    (elixir 1.17.0) lib/stream.ex:1055: Stream.do_transform_inner_enum/7
    (elixir 1.17.0) lib/stream.ex:690: Stream.run/1
    iex:58: (file)

Something fails somewhere, and I can’t for the life of me figure out where :slight_smile:

For Stream.flat_map and Stream.transform, you have to return Enumerable.t(). In your code snippet its returning binary.
This should work:

Unzip.file_stream(handle, entry_name) 
- |> Stream.transform([], fn e, acc -> {e |> IO.iodata_to_binary, acc} end ) 
+ |> Stream.transform([], fn e, acc -> {[e |> IO.iodata_to_binary], acc} end ) 
|> Stream.run()

Reason for Enum.take(2) failure is different: "Erlang error: :data_error" on an apparently valid zip file · Issue #27 · akash-akya/unzip · GitHub

There’s still some weirdness in this

Unzip.file_stream(handle, entry_name) 
|> Stream.transform([], fn e, acc -> {[e |> IO.iodata_to_binary], acc} end )
|> Enum.to_list()

[
  <<208, 161, 208, 190, 208, 186, 208, 190, 208, 187, 208, 190, 208, 178, 44,
    208, 161, 208, 181, 209, 128, 208, 179, 208, 181, 208, 185, 44, 208, 144,
    208, 189, 208, 176, 209, 130, 208, 190, 208, 187, 209, 140, 208, 181, 208,
    178, 208, 184, 209, ...>>,
  <<209, 130, 208, 181, 208, 186, 209, 129, 209, 130, 209, 139, 44, 208, 176,
    208, 191, 208, 190, 208, 186, 209, 128, 208, 184, 209, 132, 209, 139, 44,
    208, 184, 209, 129, 209, 130, 208, 190, 209, 128, 208, 184, 209, 143, 32,
    209, 133, 209, ...>>,
  <<56, 54, 50, 4, 48, 4, 102, 98, 50, 4, 50, 48, 49, 55, 45, 48, 55, 45, 50,
    52, 4, 114, 117, 4, 50, 4, 208, 183, 208, 180, 208, 190, 209, 128, 208, 190,
    208, 178, 209, 139, 208, 185, 32, 208, 190, 208, 177, ...>>
]

So, there are three elements in the resulting stream

However:

> stream = Unzip.file_stream(handle, entry_name) 
           |> Stream.transform([], fn e, acc -> {[e |> IO.iodata_to_binary], acc} end )
> stream |> Enum.take(2)
> # or stream |> Stream.take(2) |> Stream.run()

** (ErlangError) Erlang error: :data_error
    :zlib.inflateEnd_nif(#Reference<0.60592435.1843527681.127169>)
    (unzip 0.11.0) lib/unzip.ex:188: anonymous fn/1 in Unzip.decompress/2

> stream |> Enum.take(6)
> # or stream |> Stream.take(6) |> Stream.run()
[ ... no error, Enum.take(6) returns all three elements ]

there are three elements in the resulting stream

What is the issue? I am not sure, but different number of elements might be because of internal buffer of streaming operations. Either way it is still stream of iodata. If you want the whole file as binary you can use Enum.into(<<>>) instead of Enum.to_list() or better

Unzip.file_stream(handle, entry_name)  
|> Enum.into(<<>>, &IO.iodata_to_binary/1)

Second issue is for different reason. Unzip checks CRC at the end of file streaming to ensure data integrity, so when you terminate prematurely without reading completely by calling Enum.take(2) it fails because it cannot verify CRC.
For the same reason Enum.take(6) succeeds because there are only 3 chunks and CRC checks out. I explained bit more details here.

I am thinking of adding a flag to skip checking CRC to support cases like this. I haven’t added it sofar since noone requested it, let me know if you need.

2 Likes

Ah, the CRC makes sense! Thankfully, in my case I need to consume the entire file, and the only reason I wanted an Enum.take(2) was for testing purposes.

I’ll see if I can properly do what I set out to do in the first place now :slight_smile:

Welp. It works as a charm :slight_smile: All I needed to do was just to process the entire stream :slight_smile:

Thank you for helping out and rubber-ducking!

1 Like