Why is column information stripped from elixir's AST?

Why is column information stripped from elixir’s AST? I’ve been working with the debug tools and it’s a little odd that the column information is stripped after parsing.

This is in no way a criticism of elixir, I’m just curious

Because the parsing library that Elixir uses is leex and so forth, which does not have column information, so it is not that it is stripped but that it never existed to begin with.

Remember I have my SpiritEx library if anyone wants to do parsing while having line, column, and byte information, UTF8 aware for line and column. ^.^

1 Like

Actually, ExSpirit, according to the hexdocs :smile:

But that’s a tool to build parser, right? Not an elixir parser (one can write an elixir parser with it, of course)

Er, yes that. ^.^;

Correct, it’s the same level as leex.

Hm… offtopic but regarding my pygmentd question: is it good to lex programming languages? :smile:

For syntax highlighting

Almost no language can be accurately colorized with regex, you almost always need a parser for it to be accurate, that is why so many colorizers, like the one on this very forum, are absolutely horrible. ^.^;

It gets much worse when you have code-changing code too, like lisp or Elixir! ^.^;

yecc the tool that Elixir uses for its parser does not support column information. yecc is used because it’s part of OTP and does not require any additional tools. The easiest way to change it is to add support for column information to yecc itself (and that’s probably quite hard :disappointed:) .

1 Like

yecc! That’s it, I keep getting the name wrong! >.>

It is sadly, it’s been looked at before and the work was, painful.

@OvermindDL1 Leex is the lexer generator that accompanies the Yecc parser generator. These often exist in duos so that a ‘dumb parser’ can quickly perform rudimental work(chunking text into lexemes), allowing the smart(er) parser to remain simpler (and as a result faster) as well.

Leex and Yecc are pretty much two sides of the same coin, so you’re not that far off. :slight_smile:

Yep yep, need both if using those.

Some time ago I started working on a “parsing” toolset for Elixir. So far I managed to implement a lexer generator based very much on leex with two major differences - uses binaries instead of lists and tracks column information. Additionally, it has support for what flex calls “entry states” and in practice, this means switching the lexing function in the middle of the stream - this can simplify many lexers, though steps a bit into the territory of the parser.

The lexer syntax is based on macros and you can have multiple lexers in a module.

The performance is comparable to leex (with about 10% improvement), but there are couple more possible optimisations. Where there’s a huge difference is the memory. On some example lexers, it used even 10x less memory than leex and usually it was about 5x less. This primarily stems from the fact that we never copy the input string only creating sub-binaries into it. It’s also possible to compile the lexer with HiPE (compiling the leex generated one took ages, I waited as long as 2 hours before resigning) - this boosts the performance significantly (about twice as fast) as the lexer does not call to any external module (so there’s no issue with switching between HiPE and non-HiPE code).

The name is a bit pretentions :laughing:, but that’s just work-in-progress. https://github.com/michalmuskala/panacea

Implementing the parser is somewhere in the future.

2 Likes

If you can make your lexer compatible with the elixir parser it could be swapped with the official one. Of course, this assumes that column information is worth the inevitable headaches. It probably isn’t.

On the other hand, if the lexer is compatible with the official one (produces the same output) one can just test for equality or output against the huge corpus of already written elixir code.

The elixir lexer already tracks the column information. But it’s being thrown away by the parser, since there’s no support for it there.

Everyone is confusing leex and yecc. :slight_smile:

Leex is a tokenizer and it doesn’t produce column information but Elixir doesn’t use leex. We have a hand written tokenizer.

Elixir does use Yecc which is a parser. Yecc doesn’t care about column information and we could add column information to the AST.

The reason we don’t pass the column information forward is because when we started, Erlang did not use the column information. However Erlang does use it today, so I don’t see any reason for not doing that now.

We just need to be careful with constructs such as quote, as the column information will be out of place, but for everything else it should be fine.

5 Likes

Making the line information into a keyword list was great. That way you can just add a new key value pair and mantain backward compatibility in most cases!

Heh, awesome. My SpiritEx library I wrote for simplicity (to the way I think anyway) and having column information and better errors and so forth. I’ve not really benched it to much of anything as of yet, but I’ve never had speed issues yet. ^.^

Hmm, I wonder how mine compares to leex… I use a map to hold the context information (ease of use) so that would incur a hit…

@josevalim, I seem to have read/heard somewhere that Elixir 1.6 would have column information. If that true, or am I misremembering?

Misremembering afaik. :slight_smile:

1 Like

That’s a shame. I can hack my way around that limitation pretty easily, though. Thanks :slight_smile:

Actually I can’t hack my way around it in my literate programming tool with hyperlinks (Guaxinim).

The problem is obvious here: https://tmbb.github.io/guaxinim/guaxinim.ex.html#L112 (This is Guaxinim’s output when run on its source). From the compiler tools I can find the line number of all function calls. I use that information to add hyperlinks to the tokens from the syntax highlighter. The problem is that due to the nature of the highlighter (which is dumb and operates on a lexical level only), I can’t distinguish between A.func, B.func and func. If all three function occur in the same line, I won’t know where to link them… This is what happens here. The token dirname appears twice: first as a variable and second as a function call, and I don’t have a good way of distinguish those cases.

This could be fixed if there was column information in Elixir’s AST, because it would be possible to pinpoint the exact location of each token, and not only the line. Maybe I can write a smarter highlighter that takes it into account, but I doubt I’ll be able to do much better without having a real Elixir parser.

That said, would you be open to having column information in Elixir’s AST? I really can’t give you a compelling reason beyond the one I bring here. Maybe for IDEs that might wish to implement Guaxinim-like hyperlinks in the source? I thought I’d better ask here before opening a feature request on Github.

1 Like