Erlang/elixir and leex

221V · January 9, 2018, 1:16am

i want use leex for create ast, for transform bbcodes -> html

here is my code
can you tell me why that not work correctly?
how to solve that?
thanks)

rvirding · January 9, 2018, 2:04am

When multiple rules match leex is defined to return the longest matching pattern. If there is more than one pattern with the same length it will return the first one. This is the same as in lex, on which leex is based.

221V · January 9, 2018, 2:28am

so how i can fix my problem?

can you show some little examples for i better understand your advise please?

kip · January 9, 2018, 3:31am

In your definitions you have

%L = [A-za-zА-Яа-я0-9.]

which is a very loose regex. When leex processes your text it looks for the longest match as Robert indicated. It turns out that this regex will match the entire string you submitted since the rule

%{L}+ : {token, {any_text, TokenLen, TokenChars}}.

will match the entire string. As the longest match it will return that. You can test it in iex with:

iex(1)> r = ~r/[A-za-zА-Яа-я0-9.]+/
~r/[A-za-zА-Яа-я0-9.]+/
iex(2)> Regex.match? r, "test:ninja::Otest:alien:"
true

Test against your regex’s you’ll see that [^\[]+ also matches the entire string.

If I can make a suggestion - separate tokenising (leex) from parsing (yecc), You are trying to do too much I think, which makes it more complicated.

kip · January 9, 2018, 4:23am

A simple (but incomplete) tokeniser which can process your test string follows. T

Definitions.

Tag        = [a-z]
OpenTag    = \[
CloseTag   = \]
Emoticon   = :

Rules.

:\)      : {token, {smile, 1}}.
8-\)     : {token, {smile, 4}}.
:&#39;\( : {token, {smile, 5}}.
:ermm:   : {token, {smile, 6}}.
:D       : {token, {smile, 7}}.
&lt;3    : {token, {smile, 8}}.
:\(      : {token, {smile, 9}}.
:O       : {token, {smile, 10}}.
:P       : {token, {smile, 11}}.
;\)      : {token, {smile, 12}}.

{OpenTag} : {token, {open_tag, TokenLine, TokenChars}}.
{Tag}+ : {token, {tag, TokenLine, TokenChars}}.
{CloseTag} : {token, {close_tag, TokenLine, TokenChars}}.
{Emoticon}{Tag}+{Emoticon} : {token, {emoticon, TokenLine, TokenChars}}.

Erlang code.

rvirding · January 9, 2018, 11:35am

Note one thing: the regular expressions in leex and Regex are not the same. Regex supports the full perl regexs, it implemented internal in :re module which uses the PCRE library, while leex only allows a more limited set. A major benefit of the leex regexs, which are the same as in lex, is that they never need to backtrack which the perl ones do. With PCRE you can easily define a simple regex which will take a long long time to match, for example check here The :re module gets around this by putting a limit on how long the match is allowed to take before giving up.

kip · January 9, 2018, 11:55am

Thanks Robert, and understood. I learnt that the hard way in my first lexers (I know, it is documented in the erlang docs). My example above was only to illustrate of your point about leex using a longest match.

rvirding · January 9, 2018, 12:03pm

Yes, my point was less a comment on your comment and code, rather just a general comment on the regexs which are allowed in leex and why.

For those who haven’t looked at this before it means that all the regexs in leex, and in lex, can be compiled together so you only have to make one pass over the input string to get the next token instead of testing all the regexs one after the other until one matches. The difference in efficiency is significant.

OvermindDL1 · January 9, 2018, 5:31pm

Also, just as a comparison with another parser (not just lexer, although it can lex if you want, this is more powerful and can do it all), if done with my ExSpirit parsing library it would be something like (typed in post, not in an IDE) of which you can output whatever format you want, like an IOList or an AST or whatever you want, here I’ll output an IOList that outputs, oh, html (but you could easily output the same format as the original erlang, but I imagine that is not what you want the output to be):

defmodule BBCodeParser do
  use ExSpirit.Parser, text: true

  # Main interface

  def parse_bbcode(string) do
    case parse(string, bbcode()) do
      %{error: nil, result: result} -> result
      %{error: _error} -> "Error will not actually happen in this example, but I'm showing you how to handle an error"
    end
  end

  # Parser definitions
  defp simple_symbols() do
    import ExSpirit.TreeMap
    new()
    |> add_text(":-)", "☺️")
    |> add_text(":-(", "☹️")
    |> add_text(":cat:", "😺")
    # ... etc more...  You can even dynamically build this if you want and store it into ets or something
  end

  defp run_tag(context, tag) do # Could even add an attributes
    body = context.result
    # Get tag from where-ever and run it with the contents, so ets or inline map or so, using an inline map here
    %{
      "code" => fn body -> ~E"<pre class='code'><%= body %></pre>" # Could even render a template here  :-)
      "bold" => fn body -> ~E"<span class='bold'><%= parse_bbcode(body) %></span>"
      "italic" => fn body -> ~E"<span class='italic'><%= parse_bbcode(body) %></span>"
      # Etc... more...
    }[tag]
    |> case do
      fun when is_function(fun, 1) -> %{context | result: fun.(body)}
      fun when is_function(fun, 2) -> %{context | result: fun.(tag, body)}
      nil -> %{context | error: %ParseException{message: "Unknown tag `#{tag}`", context: context, extradata: body}
    end
  end

  defrule tag_name( char([?a..?z, ?A..?Z, ?0..?9, ?_, ?-]) )

  defrule tag(seq([
    lit(?[), tag_name() |> put_state(:tagname, :result), lit(?]),
      lexeme(repeat(lookahead_not(lit("[/") |> get_state_into(:tagname, lit(&1)) |> lit(?]) |> char()))
      |> get_state_into(:tagname, pipe_context_into(run_tag(&1))), # Convert body
      lit("[/"), get_state_into(:tagname, lit(&1)), lit(?]),
    ])
  )

  defrule bbcode(alt([
    symbols(simple_symbols()),
    tag(),
    char(),
  ]))
end

Or you could even have it output the exact same thing as the original post lexer if you want (though working), but I find it nice to output directly. Plus this version you can even easily make extensible with new bbcode’s without needing a recompilation.

If you add an expect thing around the tag bit after the open tag then you can even have it give you an error back if the user forgets a mis-matched closing tag. ^.^

tmbb · January 9, 2018, 11:03pm

Soamming the thread just to say that ExSpirit is the most powerful and user-friendly parser I’ve ever used in any programing language.

gon782 · January 9, 2018, 11:37pm

Have you ever used used AttoParsec or MegaParsec?

OvermindDL1 · January 9, 2018, 11:45pm

Those have benefit of a static type system though.

Plus mine was modeled on C++'s Spirit::Qi, which was designed for parsing and transformation (as you can see I can even output a plain string back out if I wanted with the above example).

Mine is not ‘the’ fastest, a plain simple matcher in basic erlang would beat it (though more painful to write), plus C++ and Haskell and such aren’t slow interpreted, but for having good error, line/column/byte positions, recursive re-entry while passing optional state and returning it, it is quite powerful.

gon782 · January 9, 2018, 11:49pm

Indeed they do. I didn’t ask in order to disparage your library, but rather to ask if @tmbb has worked with these incredible monadic parser combinator libraries. It’s one of the areas where I feel like Haskell might offer potentially the best or at least a top 3 experience out there. (And yes, I’m aware other languages offer good parser libraries, but I think this is one particular area where the Haskell community has really hit the mark.)

OvermindDL1 · January 9, 2018, 11:52pm

I know, no worry.

Haskell does have great parsers, OCaml has similar modeled ones as well, it seems to be an attribute of the super-strict/static typed languages that they inevitably get created. ^.^

tmbb · January 10, 2018, 3:23pm

I have used those libraries indeed. The abstractions they use were too cumbersome to me, and trivial things were way too hard. I just wanted to parse sole text into a structured representation and suddenly I’m reading about Monads, Arrows, Applicative typeclasses and operator explosions… The way the context is mostly implicit instead of explicit like ExSpirit also makes it more complex to do what I want.

I still don’t get why you need to bring up monads, arrows and the like for a parser, instead of having a simple pure function that converts a context into a context like ExSpirit. I mean, monads, arrows and functors are simple. They’re just typeclasses, but often the fact that something is a monad or an arrow or whatever doesn’t really help with dealing with the concrete problem at hand: A parser is not a monad. It’s a parser! Making it “more monad” makes it “less parser”…

So I do think that ExSpirit is way more user-friendly than Haskell’s stuff. And I doblikr static typing.

This is a general criticism of Haskell. Haskell programmers seem to be in a competition over who uses the highest number of typeclasses per program, often when a simple Enum would suffice.

gon782 · January 10, 2018, 3:26pm

Well, yeah, you have to know the language and its abstractions to use it. The fact that it uses the type system so heavily is what enables composing these parsers into bigger parsers.

OvermindDL1 · January 10, 2018, 4:48pm

Teeeeeechnically the Context in ExSpirit ‘is’ a monad, I even expose a ‘bind’ function (called pipe_result_into) and even a ‘functor’ function (called pipe_context_into), as well as having lots of pre-built specialized binders (which is all the transformation functions). ^.^;

There is one big side-effect of this style though, it is hard to have arbitrary type transformations then. Like take the original Spirit::Qi in C++, you parse out something that kind of looks like the format you want, and it transforms ‘that’ into a multitude of formats as output, where the Haskell versions have to output the proper specific type to start with. You could work around that of course but I don’t really see them doing that as-is.

As well as code length of the Spirit style tends to be shorter as well.

221V · January 11, 2018, 8:26am

thanks, i understand how i can take emoticons,
but how i can take text?

not tags and not emoticons – text without changes ?

upd. hmm, ok
i can change

[^\[]+? : {token, {any_text2, TokenLen, TokenChars}}.

to

[^\[] : {token, {any_text2, TokenLen, TokenChars}}.

and this works, with some overhead, but works)

thanks))

tmbb · January 11, 2018, 3:54pm

Indeed It is a monad with a bind function

Well, pretty much everything that contains something is a Functor… I’m not exactly surprised xD

But remember you have specialized things like pipe_context_around. I think it might be an fmap implementation too, but the point is that those funcions are thought as parser helpers first and category theory abstractions second. In Haskell it seems to be the other way around.

221V · January 22, 2018, 12:58am

the same problem

how fix it?

please help !

( why that fffuuu regularex eat so many???((( )