Reading directly from a zipped stream

Hey there,

a while ago I implemented an Excel Parser Library. So far it’s working but it has the problem that it doesn’t scale with file size.

Background: The (2000) Excel format is just a bunch of compressed (zip) XML files. In order to get access to the underlying XML, I extract the whole archive in memory:

:zip.extract spreadsheet_filename, [:memory]

This of course is very bad in terms of memory usage. In other languages I have found examples where a compressed archive is directly read as a stream, filtered by filename and then processed.

Is there a way to do this in the Erlang/Elixir world?

4 Likes

Looking at the documentation for the :zip library , I think you can get what you want using the :zip.open interface. Open the handle with the :memory option, list all the files and then use :zip.get
to get each individual file.

You could wrap this with Stream.resource if you want to get all elixiry on it.

4 Likes

Thanks for the info. Played around with it but I’m not sure this is the way to go.

the :zip.zip_get function always blocks and returns the whole content of the referenced file. What I actually need is a function that returns chunks of the content so that I can process it right away.

Side Note: I peeked at the zip source code in OTP. It seems as if Erlang is reading the contents in chunks. So from research it seems to me that Erlang does not expose this API.

3 Likes

Did you ever find a solution for this issue? I have a use-case where I’d need to read from a very large compressed file (large for me, that is…22G compressed / 65G+ uncompressed).

1 Like

For this particular problem I had, I did not find a solution.

However I recently started another project and found this library: https://github.com/ne-sachirou/stream_gzip

I’m not sure if it works with .zip files.

If it doesn’t and you have control over the compression used, you could stream process with that:

"x.gz"
|> File.stream!
|> StreamGzip.gunzip
|> Enum.into("")
1 Like

Excellent, thanks :smile:

A little follow-up many months later:

I finally had a little time to play with this and found that StreamGzip would die after a bit of processing. After casting about for other solutions, I found that the core File.stream! supports compressed files - you just need to feed it the right mode:

    File.stream!("./huge-file.gz", [:compressed])
    |> Stream.map(&IO.inspect(&1))
    |> Stream.run()

File.stream! also defaults to a line-by-line output mode instead of to a byte chunk output mode. I ended-up having to used chunk_while to re-order bytes chunks into line chunks with the StreamGzip library since I wanted to emit individual records into RabbitMQ.

HTH

1 Like