Why can't I use a variable to the left of `<>` in a pattern match? Compiler or Theory?

spencer.christensen · February 8, 2024, 8:15pm

iex(26)> "hello " <> matched = "hello world"
"hello world"
iex(27)> matched
"world"
iex(28)> matched <> " world" = "hello world"
** (ArgumentError) the left argument of <> operator inside a match should always be a literal binary because its size can't be verified. Got: matched

I’m new to Elixir. I understand the basics of pattern matching. I am trying to understand conceptually why the above doesn’t work.

My understanding is that the evaluation of a pattern moves left → right. Therefore, when "hello " <> matched is evaluated the compiler is given a fixed starting place. Since it has a fixed location at the start, it can infer anything moving forward as it knows where it started.

The opposite, matched <> " world" = "hello world", fails. To my human brain this seems like a very simple match operation. However I understand that compilers and human brains have different skillsets and abilities.

I’m fine with this not working in Elixir. I don’t need it to. My main question is whether this a case where the compiler could theoretically be rewritten to have some sort of logic like “if pattern starts with variable, go right → left” or is there a deeper problem here where going “backwards” is a Hard Problem in computer science (CS)? I’m a self-taught dev so please forgive me is there is a rather obvious answer from a formal CS background.

Thanks!

ken-kost · February 8, 2024, 9:40pm

Your intuition is right, and the error is telling

left argument of <> operator inside a match should always be a literal binary

the left side must be a literal binary i.e. a string whose value is known at compile time because compiler needs to know its size to perform the match. I think it is related to going left from right. Not sure if this is the case but compiler could match h with h, e with e… and then bind what’s left to the variable. It could not do that other way around, it would have to infer which is not what compilers are meant to do.

caslu · February 8, 2024, 9:48pm

What is happening to your code is that when you type a:

"hello " <> matched = "hello world"

Your matched variable will be assign with “world” because of elixir = operator (which actually is a matching operator). So when you type matched <> " world" this would be evaluated as “world world” instead a “hello world”. That’s why not works.

matched <> " world" = "world world"

Sorry if i miss interpreted your question bro, my first contribution here as well

benwilson512 · February 8, 2024, 9:51pm

Hey @caslu I think @ken-kost’s answer is more accurate here. Matching isn’t a search of a binary rather it’s a “destructuring” of the binary where you’re pulling out a specific offset from the start into a value. This requires knowing what that offset is. Technically this does NOT need to be at compile time, you can do variable length offsets eg:

iex(2)> n=1; <<stuff::binary-size(n), rest :: binary>> = "hello world"; stuff
"h"
iex(3)> n=2; <<stuff::binary-size(n), rest :: binary>> = "hello world"; stuff
"he"
iex(4)> n=3; <<stuff::binary-size(n), rest :: binary>> = "hello world"; stuff
"hel"

However the point is that at the time the match executes the offset into the binary is known.

derek-zhou · February 8, 2024, 10:36pm

A smarter compiler might be able to do what you want; however, allowing variable at the beginning of a binary match will open a big can of worms. For example:

x <> "b" <> y = "abba"

Which one is correct?

x = "a", y = "ba"
x="ab", y="a"

caslu · February 8, 2024, 10:51pm

that’s fine, i thought i was a syntax question

spencer.christensen · February 8, 2024, 10:56pm

@benwilson512

That use of dynamically sized binaries is super interesting! Your observation that matching isn’t searching but is destructuring is a very helpful one.

I did the next logical thing and broke it on purpose to see the error message with your use of binary-size:

iex(2)> n=3; <<stuff::binary-size(n), rest :: binary>> = "hello world"; stuff
"hel"
iex(3)> n=3; <<rest :: binary, stuff::binary-size(n)>> = "hello world"; stuff
error: a binary field without size is only allowed at the end of a binary pattern, at the right side of binary concatenation and never allowed in binary generators. The following examples are invalid:

    rest <> "foo"
    <<rest::binary, "foo">>

They are invalid because there is a bits/bitstring component not at the end. However, the "reverse" would work:

    "foo" <> rest
    <<"foo", rest::binary>>


└─ iex:3

** (CompileError) cannot compile code (errors have been logged)

What a lovingly crafted error message.

@derek-zhou

Ok, this is what I was getting at. I sensed there was some deeper “this is a Hard Problem ” type of thing going on and you certainly highlighted an example.

x <> "b" <> y = "abba" is obviously impossible to destructure “correctly”.

But, I wonder, is x <> "ba" = "abba" impossible to destructure? Could something be programmed in the language itself to use negative offsets or is there some CS theory going on where backtracking is too potentially resource intensive etc.

derek-zhou · February 8, 2024, 11:01pm

No, but there is another can of worms that you may encounter. What if there is a multi-byte UTF-8 sequence end with the byte “b”? After matching you could have a x that is not a valid UTF-8 string.

For strings, you always want to match from left. You could do right side match with regular expression but that should be frown upon.

sabiwara · February 8, 2024, 11:18pm

I don’t think it would be impossible per se, here is a proof of concept where you can achieve something similar:

message = "hello world"
suffix = "world"

case message do
  <<prefix::binary-size(byte_size(message) - byte_size(suffix)), ^suffix::binary>> -> prefix
end
# "hello "

(I’m cheating because I’m using the message variable from the parent scope, so this wouldn’t be viable as an implementation strategy).

The real problem I assume would be to be able to do it efficiently especially when dispatching over multiple clauses. Binary matching is very optimized in its current implementation, nimble_parsec is a great example leveraging it.
I found this paper when looking for resources explaining how it is being optimized, quite interesting.