Best way to convert this Ruby code to Elixir?

ejstembler · June 22, 2016, 6:43pm

I have a custom DSL in Ruby I’ve started to convert to Elixir. The tokenizer and parser are similar to what’s in Marc-André Cournoyer’s book Create Your Own Programming Language; it uses Regex pattern matching. In my tokenizer, I have a scan function which uses a case statement with assignments in the when clauses. Normally I wouldn’t do an assignment in a conditional statement, however, in this use case it minimized code in Ruby. What’s the best way to implement or refactor this in Elixir?

Here’s the Ruby code:

    def scan(line)
      case
        when text = scan_new_line then
          tokenize_new_line(text, line)
        when text = scan_whitespace then
          tokenize_whitespace(text, line)
        when text = scan_comment then
          tokenize_comment(text, line)
        when text = scan_version then
          tokenize_version(text, line)
        when text = scan_string then
          tokenize_string(text, line)
        when text = scan_constant then
          tokenize_constant(text, line)
        when text = scan_variable then
          tokenize_variable(text, line)
        when text = scan_type then
          tokenize_type(text, line)
        when text = scan_where then
          tokenize_where(text, line)
        when text = scan_calc then
          tokenize_calc(text, line)
        when text = scan_cell then
          tokenize_cell(text, line)
        when text = scan_identifier then
          tokenize_identifier(text, line)
        when text = scan_ruby_date_time then
          tokenize_ruby_date_time(text, line)
        when text = scan_other_date_time then
          tokenize_other_date_time(text, line)
        when text = scan_ruby_date then
          tokenize_ruby_date(text, line)
        when text = scan_other_date then
          tokenize_other_date(text, line)
        when text = scan_number then
          tokenize_number(text, line)
        else
          options = {token: :unknown, value: @scanner.getch, line: line}
          create_token options
      end
    end

Basically I want to try each scan function until one returns a non nil value. And then use that value in a corresponding tokenize function.

I wonder if there’s a way to method signature match with a RegEx pattern, or somehow use cond and use the result of the conditional value? I know I could do it with a bunch of assignments and if/else statements, but if there’s a better way, I’d like to use it.

I thought about using erlang’s support for leex/yecc, however, I’m not really keen on converting the substantial parser logic to erlang rules. It may be less painful (for me) trying to convert the parser from Ruby to Elixir.

Any thoughts?

gregvaughn · June 22, 2016, 7:18pm

I’m not sure what those scan_* functions do exactly since they take no parameters. In Elixir they’ll need explicit parameters rather than using contextual state. I’m going to write my example as if they take line but adjust as needed.

processors = [{&scan_new_line/1,   &tokenize_new_line/2},
              {&scan_whitespace/1, &tokenize_whitespace/2},
              #etc
]

result = Enum.find_value(processors, fn {scanner, tokenizer} ->
  text = scanner.(line)
  if text, do: tokenizer.(text, line), else: nil
end)

It’s a HOF (higher order function) approach. Look at docs for Enum.find_value for more details (and how to supply a default value) but the gist is that it keeps looking through your scanner/tokenizer pairs until the first one that returns a truthy value.

michalmuskala · June 22, 2016, 7:39pm

I came here to write a very similar reply to the one provided by @gregvaughn. Enum.find and Enum.find_value are very helpful in situations like this.

I would also suggest trying out leex and yecc, which are scanner and compiler provided by erlang. It’s quite easy to work with them with just a basic erlang knowledge. Here’s a great article that can get you started: http://andrealeopardi.com/posts/tokenizing-and-parsing-in-elixir-using-leex-and-yecc/

bbense · June 22, 2016, 8:04pm

Double ++ on the use of leex, there was a recent post by Rvriding that showed how you can use leex to build a custom parser for any tasks that you’d normally use regexp parsing.

gregvaughn · June 22, 2016, 8:07pm

Note also, that a very similar HOF approach can be done in Ruby with procs/lambdas/method instances. However, since Ruby lacks a find_value equivalent, it’s a bit more wordy.

ejstembler · June 22, 2016, 8:53pm

Thanks! That looks like something I can use. I’ll give it a try…

ejstembler · June 22, 2016, 8:58pm

Yeah, I saw that article too. As I mentioned at the bottom of my post, I’m not too keen on representing the parser via erlang rules. The tokenizer is simple enough I could probably use leex. Though my lexer carries forward the line number and position so I can display them later in the parser if there’s an error. Also, my parser is a little complex, it does look-ahead scanning. Though maybe yecc supports look-ahead since it supports custom erlang functions? I don’t know.

In any case, I’ll spend some time diving more deeply into it. Thanks.

ejstembler · June 22, 2016, 9:01pm

Thanks for pointing that out. I found the post you’re referring to. One aspect I wasn’t aware of… @rvirding mentions that leex/yecc is faster since it doesn’t need to evaluate all of my regex patterns one-at-a-time. This is desireable to me, since the Ruby version could slow down when lexing/parsing large files. I think I’ll spend some time diving into this more deeply. Thanks!

ejstembler · June 22, 2016, 9:03pm

Yep. I’ve done something similar in the past with Ruby using an array of hashes which had lambda’s assigned to a key. Good to see something similar can be done in Elixir too. Thanks.

Qqwy · June 22, 2016, 9:35pm

I wonder how many of your regular expressions could actually be represented as simple recursive functions. Some possible candidates might be the ones that parse newlines, whitespace, comments, constants and variable names.

Do I understand it correctly that one of the scan_* functions returns the next n characters that match it (or nil if it doesnt) and tokenize_* basically strips text from the beginning of the line, and tokenizes text as the proper value, before calling scan recursively?

If this is the case, it might also be optimized by combining these steps.

The Higher Order Function approach is a great idea. Underwater, what you end up with is a Monadic Parser Combinator. (You can abstract the HOF-approach a little further and it will become something similar to e.g. Haskell’s Parsec library.) If that sounds scary: It isn’t.

It basically means that you have a way to combine multiple functions, where if the first matches, its output will be used, and if the second matches, that ones output will be used, (and both fail but the third matches… etc.) These can be nested, where this nesting follows the associativity law, so the result of nesting them is the same as if you would flatten the whole thing (but nesting makes it all the more composable and readable).

Here is a very good introduction that assumes very little prior knowledge.

In the end, this makes the tokenizer boil down to one big recursive pattern match. I don’t have the time to write example code right now (Hopefully tomorrow), but it is supposed to be quite readable as well (It reads a lot a definition is Backus-Naur-Form (BNF))

Onor.io · June 23, 2016, 12:06pm

Could you please link that article? I seem to have missed it and I’d like to see what he’s got to say.

bbense · June 23, 2016, 8:13pm

Simulate Regex match guards in functions - Elixir Questions - Elixir Forum

sasajuric · June 24, 2016, 8:33am

At my company we’ve been doing some rich lexing and parsing using @bitwalker’s Combine. It works nice for us because we can maintain column number while lexing (which is AFAIK not possible with leex). Another nice benefit is that the whole code is in Elixir and doesn’t require some custom DSL.

We implemented the thing in two passes. First we run a Combine based tokenizer which produces a list of terms. Then we pass this to the Combine based parser which produces the AST. The second part required some improvisations, since Combine can currently only work with strings. We wrote a custom parser to handle that. A colleague made an initial PR to Combine, but later we found we had to change it a bit. We didn’t push our final changes upstream, partly because of lack of time, but also because they are a bit hacky, so we still want to think about a better solution.

We also wrote a couple of other custom parsers that allow us look-ahead, dealing with recursive grammars, and better error reporting. I expect we’ll push them upstream once we clean up the code.

Overall, I’m quite happy with Combine. We had to dive a bit into the code and create some custom parsers, but once we got a hang of it, it was pretty simple to implement whatever we wanted, including recursive grammar and informative error reporting.

ejstembler · June 24, 2016, 1:11pm

Thanks @sasajuric! That sounds very interesting, I’ll definitely look into Combine. I spent yesterday investigating porting my tokenizer to leex. I found that some of my regex patterns were not supported in leex. So Combine is a welcomed alternative I’ll look into now. Thanks!

Onor.io · June 24, 2016, 7:55pm

Thanks for that link.