Binary pattern matching when input is a stream

I would like to do some binary pattern matching on very large files (1 - 20gb files). How do I do this using streams?
The spec of some of the files I’ll be parsing is dynamic so I can’t always count on something being the same size in the same place every time.

3 Likes

Might need more of an example of what you are wanting to do, like pseudo-code or something?

You should be able to just parse it out as normal though. And how large are they?

1 Like

They range from 1 to 20 gb.

Basically I need to pattern match on the binary to extract a massive amount of meta-data. I don’t want to use File.read! because that loads the whole file into memory and my VMs don’t have 20gb of ram.

1 Like

You are probably looking for File.stream!(filename, [], chunk_size), which creates a stream that you can consume (and pattern match on) in the sizes of the given chunk.

1 Like

What format is this data? You said it is dynamic but ‘how’? You’d need to know some bounds on how to split it, at the very least if a ‘section’ of data cannot be more than, oh, 1 meg then you can split at 1 meg boundaries with each section parsing 2 splits at a time with overlap handling to the next or so. Or if you do not want to parse a single file concurrently but rather serially then you can get data in chunks and parse on the fly in whole, or…?

1 Like

Yep this if you know the chunk size, but how ‘dynamic’ is the parsing?

EDIT: Also, what are you doing with the data?

1 Like

@OvermindDL1, here is the format:

The first link under The following documents define the standard: will show you the format

1 Like

Well that looks horrifying… Yeah using a file stream and matching as you get data, when you run out of data for a match then just hold it until you get the next set of data and then pick up where you left off, a genserver listening to the stream will be perfect for that. It looks like a very serialized format though so no real boost in concurrent reading so this really might be the best way, but it is entirely doable.

1 Like

The problem is that I have no idea where to set the chunk size for file.stream. As you can see this spec is all over the place.

1 Like

Do whatever is efficient for the data size, 1 meg or 10 megs or whatever all should be fine, I’d just send the data over to another process (a genserver) and in that other process hold it in the state and parse over it until it is too short to match, at which point just wait for the next message of streamed data and repeat. :slight_smile:

1 Like

@OvermindDL1 If you follow this approach, you definitely need to use backpressure. However, I don’t think an extra process is needed here; we already have one (besides the current process), namely the process owning the file.

An alternative approach:

  • Pick a chunk size that is larger than the largest piece you need to match on.
  • Keep track of a ‘buffer’ bytestring.
  • You can pattern match on this buffer’s front, and chop the parts of that are consumed.
  • Whenever this buffer becomes shorter than the chunk size, you append the next file chunk to it, and in this way consume the file lazily.
2 Likes

True true, I was thinking of in the style of {:active, :once} when I wrote that, like a TCP stream. ^.^

1 Like

@Qqwy, I like this approach. Once I’m smart enough to implement it this is what I’m going to do. Right now I’m just using File.read and I’m having to skip the files that are over 2gb as it won’t even attempt to open them.

2 Likes

I’d use Stream.transform/3/4 to consume the file and implement the reducer as some kind of automaton which consumes the file and emits a stream of tokens, which is then transformed into its final form.

Constructing the automaton can get pretty tedious when it out grows a certain size…

2 Likes

@NobbZ,
How would you do that with binary as the input?

1 Like

A stream is not a binary, so you can’t.

That automaton changes its state depending on the current state and the next character of input. Whenever a certain state is reached (an acceptance state) a token is emitted depending on the previously accumulated characters and the current state. This is the very basic idea.

I’m to tired though to construct at least a minimal example. I do hope I will be able to do so during the weekend.

1 Like