Pattern matching against a string

sergio_101 · March 20, 2019, 3:06pm

I am VERY much an elixir newbie. I have taken one elixir course and one phoenix course on Udemy. During that course, I saw the instructor do a pattern match against a string. This is exactly the first step i needs to take for my first phoenix project.

in my project, I am receiving a ton of strings that look like:

<li>Kiwi Jr. (33)</li>
<li>Deep State (29)</li>
<li>Piroshka (29)</li>

where he did something like:

iex(1)> string = "<li>Kiwi Jr. (33)</li>"
"<li>Kiwi Jr. (33)</li>"
iex(2)> "<li>"<>artist<>"("<>playcount<>"</li>" = string
** (ArgumentError) the left argument of <> operator inside a match should be always a literal binary as its size can't be verified, got: artist

obviously, i am missing something … I would like to end up with:

artist = Kiwi Jr.
play count = 33

anyone see where i am putting it in the ditch?
thanks!

LostKobrakai · March 20, 2019, 3:09pm

Unless you know exactly how many characters your artist has you’ll need to resort to using regex or some other tool for parsing the string. Pattern matches cannot do what you seem to be looking for.

blatyo · March 20, 2019, 3:21pm

That error is complaining about this part of your code specifically:

artist <> "("

The way matching on binaries works is that:

it can match a binary literal like “(” because it knows its size
it can bind to a variable, where the length is specified
it can bind the rest of the binary to a variable, when it appears at the end of the match

Because artist and playcount are not at the end of the match, it’ll fail because you’re don’t match the second or third rule.

Hence @LostKobrakai’s suggestion

mudasobwa · March 20, 2019, 3:22pm

Almost true

defmodule Matcher do
  for artist_len <- 1..100, num_len <- 1..10 do
    def li(<<
            "<li>",
            artist :: binary-size(unquote(artist_len)),
            " (",
            num :: binary-size(unquote(num_len)),
            ")</li>"
        >>), do: {artist, num}
  end


    # last resort clause
    def li(input),
      do: Regex.scan(
        ~r"<li>(.*?)\s*\((.*?)\)</li>", input,
        capture: :all_but_first
      )
end

Matcher.li("<li>Kiwi Jr. (33)</li>")
#⇒ {"Kiwi Jr.", "33"}

In 99% of cases it would go through pattern match, making the code faster than Regex.

sergio_101 · March 20, 2019, 3:41pm

okay, this is assuming that the data matches the length requirement in line 2?

in the example i saw, it was something simple like:

iex(6)> s = "categories:1"
"categories:1"
iex(7)> "categories:"<>index=s
"categories:1"
iex(8)> index
"1"

I’ll give this a shot.

Thanks!

mudasobwa · March 20, 2019, 3:45pm

Nope. The code above generates 1001 functions handling all possible combinations of lengths for artist and num in the intervals 1..100 and 1..10 respectively. Plus one sink-all clause in the lengths are not in these intervals.

One cannot pattern-match the binary of arbitrary length in the middle, but a match to the binary of the explicit length is allowed.

NobbZ · March 20, 2019, 3:47pm

Close, but wrong it generates a single function with 1001 heads.

mudasobwa · March 20, 2019, 3:54pm

Indeed. We call it Matching Dragon.

sergio_101 · March 20, 2019, 3:58pm

okay… so, my next question. is there something more “functional” about doing it this way, rather than a straight pattern match? i have spent the past 30 years in the OOP world. I used Lisp maybe 30 years ago, but didn’t know enough to really make the distinction back then.

I know I initially asked about doing it with a pattern match, just because i saw that in a course, and EVERY TIME i need to do regex, i need to look at the docs…

Thanks!

mudasobwa · March 20, 2019, 4:14pm

You probably meant “rather than a straight regex.” Well, it depends . In most cases Regex is just fine. Also, if you are after parsing the (contrived example) ISO8601 representation of a date, you might extract year, month and day straight away:

<<
    year :: binary-size(4), "-",
    month :: binary-size(2), "-",
    day :: binary-size(2)>> = "2019-03-20"
year
#⇒ "2019"

There is no silver bullet.

dimitarvp · March 20, 2019, 5:06pm

If these are guaranteed to be small HTML pieces I’d parse them with Floki or Meeseks and then apply a simpler regex on the text to get the two pieces of data you require.

Regex for HTML or XML is a hard “NO!” even if you do a two-days educational throwaway project.

sergio_101 · March 20, 2019, 5:13pm

SNAP! okay… Floki looks like the jam! this can be parsed much cleaner, as there are a bazillion lines in this fie… all LIs…

Thanks!

1player · March 20, 2019, 11:20pm

Here’s some details on how to easily parse HTML with regexes:

OvermindDL1 · March 21, 2019, 10:49pm

That is horrifyingly beautiful, lol! ^.^

This This This!

Using Meeseeks if you want to read it or Floki if you want to write it or something instead.

Ah the classic. ^.^

sergio_101 · March 22, 2019, 12:43pm

I ended up using Floki… took 2… maybe 3 seconds…