Matching unstructured strings at runtime

More of a thought experiment than anything else. Does anyone know of an efficient way to match against unstructured strings?

The scenario would be a high volume of messages coming in (with no control over the format other than they’re binary string) and route them to a specific pipeline based on user defined rules.

I don’t actually get what you mean by “unstructured”, maybe give a more concrete example?

Anyway, you can match string using bitstring syntax.

There are limitations, and it won’t work for every use case. But that’s the best thing you can do if it fits your need.

Thanks vfsoraki. By unstructured I mean undefined patterns.

If I use logging as an example, there are a lot of different formats. You can pass them through a tokenizer to break them up into tuples, but without knowing the format ahead of time, it’s difficult to parse them into structured data like a map.

So my question is, what is the most efficient way of matching arbitrary patterns in strings?

Hopefully that makes more sense.

Without resorting to external parsers, as a general rule of thumb:

  • If your data can be structured incrementally, reading bytes (or chunks of them) from the beginning of the string, you can use bitstring matching. Like reading the first byte, then based on that you know what will be the structure of the next bytes (which can be of different formats, but must be a defined format). I think this is more performant that the next one.
  • If you data is solely text, you can also use regex (or multiple of them) to match the whole string in one call (not one go, regex may scan your string more than once).

I don’t think regex will need a sample, but for the bitstring this could be a sample:

def protocol_type(<<1::8, rest::binary>>), do: {:a, rest}
def protocol_type(<<2::8, rest::binary>>), do: {:b, rest}

def parse_a_protocol(<<id::16, name::binary-size(10), compressed::8, rest::binary>>) do
  compressed? = compressed == 1
  data = if compressed?, do: somehow_decompress(data), else: data
    id: id,
    name: name,
    compressed?: compressed?,
    data: data

* and so on for another protocol *

data = *some binary data*
case protocol_type(data) do
  {:a, data} -> parse_a_protocol(data)
  {:b, data} -> parse_b_protocol(data)

That’s just an example, untested. I hope you get the point.