Exoml - parse xml strings into a tree

A module to decode/encode xml into a tree structure.

The aim of this parser is to be able to represent any xml document as a tree-like structure, but be able to put it back together in a sane way.

In comparison to other xml parsers, this one preserves broken stuff. The goal is to be able to decode the typical broken html document, modify it, and encode it again, without loosing too much of its original content.

Currently the parser preserves whitespace between <xml> nodes, so <pre> or <textarea> tags should be unaffected, by a decode/1 into encode/1.

The only part where the parser tidies up, is the attr=“part” of a <xml attr=”part”> node.

With well-formed XML, the parser does work really well.

3 Likes

Wow, I saw this too late. Started something similar a couple of days ago (https://github.com/mlankenau/elixml). But I am not aiming for html parsing. It is pure XML with XMLNS but other than that there are a lot of things in common. What I needed was the ability to construct and save xml (that you are supporting as well) and searching with an xpath-subset.

I was missing these feature complete xml libs that we have in the ruby world. Thats the long term goal.

1 Like

I’m still curious what the built-in-to-the-BEAM XML libraries are lacking though (a better interface perhaps)?

Honestly I don’t have any strong points for crafting my own stuff. I think it is probably ok to go with the errang lib as well. My experience with SweetXML was a little bit painful but that might not be the fault of the erlang lib.

My only points

  • looks like they use atoms for element names. Nut sure if that is a potential DOS when parsing external data (not critical in our case)
  • They use tuple/list combination to represent the data. Is the a big performance advantage compared to using map/list combination?
  • If there is a bug/missing feature, for me it would be much harder to participate and fix it

Cool, I’ll have a look. Maybe we can join efforts for a great Elixir-native XML-library. (see my dismissal of parsing HTML in my reply below)

Update on parsing HTML. I initially thought of using exoml to parse HTML as well, I dismissed that idea.

Practically HTML cannot just be tokenized into a tree like XML.
HTML is its own very weird language, and cannot be treated as XML at all. They look very much the same, but are insanely different in their inner workings.

So using exoml to build a tree out of a HTML document will lead to problems.

I refer to the introduction of Alexander Borisov, the author of myhtml, explaining the tokenization/parsing issue in greater detail: http://lexborisov.github.io/benchmark-html-persers/

An HTML parser is:

Tokenizer — breaking text down into tokens
Tree Builder — placing tokens in “correct positions” in a tree
Tree follow-up
Someone out of the blue might say: “There’s no need to build a tree for HTML parsing, it’s enough to get tokens.” Unfortunately, they’re wrong.
Actually, for correct HTML tokenization, we should have a tree at hand. Points 1 and 2 go as one.

He gives a great concise example:

Example 2:

<svg><desc><style><a>

The result of correct processing:

<html>
  <head>
  <body>
    <svg:svg>
      <desc:svg>
        <style>
          <-text>: <a>

That is also why I started working on Bindings for myhtml.
See this thread: Myhtmlex - bindings to lexborisov's fast html parser myhtml

HTML is based on SGML, not XML.

Yes, SGML is horrible.

They might be lacking performance, for parsing big xml files.

i’m struggling to parse / load an 80mb xml file.

Isn’t that the purpose of the sax/event built-in erlang xml parsers though?

I found that just reading the .xml file directly with File.read! and splitting on \r\n, then parsing out data row by row is far far faster and more flexible that using any xml parsing library.

I would certainly hope it would be. ^.^
It’s not doing parsing, validation, nothing of the sort, so you just kind of have to hope the file is proper for the structure you expect without changing, but at that point it’s not treated as XML anyway. :slight_smile:

2 Likes

@Overbryd I am looking an easy way to pretty-print/tidy/indent XML with Elixir/Erlang but I cannot find any library that e.g. provides me a indent: " " option when encoding. Do you maybe have some pointers?

That seems like a nice feature to add to exoml. I have added an issue to track progress: https://github.com/Overbryd/exoml/issues/3