Binary pattern matching when input is a stream

jeramyRR · March 24, 2017, 6:30pm

I would like to do some binary pattern matching on very large files (1 - 20gb files). How do I do this using streams?
The spec of some of the files I’ll be parsing is dynamic so I can’t always count on something being the same size in the same place every time.

OvermindDL1 · March 24, 2017, 6:37pm

Might need more of an example of what you are wanting to do, like pseudo-code or something?

You should be able to just parse it out as normal though. And how large are they?

jeramyRR · March 24, 2017, 7:31pm

They range from 1 to 20 gb.

Basically I need to pattern match on the binary to extract a massive amount of meta-data. I don’t want to use File.read! because that loads the whole file into memory and my VMs don’t have 20gb of ram.

Qqwy · March 24, 2017, 7:45pm

You are probably looking for File.stream!(filename, [], chunk_size), which creates a stream that you can consume (and pattern match on) in the sizes of the given chunk.

OvermindDL1 · March 24, 2017, 7:45pm

What format is this data? You said it is dynamic but ‘how’? You’d need to know some bounds on how to split it, at the very least if a ‘section’ of data cannot be more than, oh, 1 meg then you can split at 1 meg boundaries with each section parsing 2 splits at a time with overlap handling to the next or so. Or if you do not want to parse a single file concurrently but rather serially then you can get data in chunks and parse on the fly in whole, or…?

OvermindDL1 · March 24, 2017, 7:46pm

Yep this if you know the chunk size, but how ‘dynamic’ is the parsing?

EDIT: Also, what are you doing with the data?

jeramyRR · March 24, 2017, 8:01pm

@OvermindDL1, here is the format:

The first link under The following documents define the standard: will show you the format

OvermindDL1 · March 24, 2017, 8:06pm

Well that looks horrifying… Yeah using a file stream and matching as you get data, when you run out of data for a match then just hold it until you get the next set of data and then pick up where you left off, a genserver listening to the stream will be perfect for that. It looks like a very serialized format though so no real boost in concurrent reading so this really might be the best way, but it is entirely doable.

jeramyRR · March 24, 2017, 8:12pm

The problem is that I have no idea where to set the chunk size for file.stream. As you can see this spec is all over the place.

OvermindDL1 · March 24, 2017, 8:14pm

Do whatever is efficient for the data size, 1 meg or 10 megs or whatever all should be fine, I’d just send the data over to another process (a genserver) and in that other process hold it in the state and parse over it until it is too short to match, at which point just wait for the next message of streamed data and repeat.

Qqwy · March 24, 2017, 8:32pm

@OvermindDL1 If you follow this approach, you definitely need to use backpressure. However, I don’t think an extra process is needed here; we already have one (besides the current process), namely the process owning the file.

An alternative approach:

Pick a chunk size that is larger than the largest piece you need to match on.
Keep track of a ‘buffer’ bytestring.
You can pattern match on this buffer’s front, and chop the parts of that are consumed.
Whenever this buffer becomes shorter than the chunk size, you append the next file chunk to it, and in this way consume the file lazily.

OvermindDL1 · March 24, 2017, 8:41pm

True true, I was thinking of in the style of {:active, :once} when I wrote that, like a TCP stream. ^.^

jeramyRR · March 24, 2017, 8:49pm

@Qqwy, I like this approach. Once I’m smart enough to implement it this is what I’m going to do. Right now I’m just using File.read and I’m having to skip the files that are over 2gb as it won’t even attempt to open them.

NobbZ · March 24, 2017, 11:18pm

I’d use Stream.transform/3/4 to consume the file and implement the reducer as some kind of automaton which consumes the file and emits a stream of tokens, which is then transformed into its final form.

Constructing the automaton can get pretty tedious when it out grows a certain size…

jeramyRR · March 25, 2017, 12:07am

@NobbZ,
How would you do that with binary as the input?

NobbZ · March 25, 2017, 12:20am

A stream is not a binary, so you can’t.

That automaton changes its state depending on the current state and the next character of input. Whenever a certain state is reached (an acceptance state) a token is emitted depending on the previously accumulated characters and the current state. This is the very basic idea.

I’m to tired though to construct at least a minimal example. I do hope I will be able to do so during the weekend.