Exoml - parse xml strings into a tree

Overbryd · August 23, 2017, 6:26am

A module to decode/encode xml into a tree structure.

The aim of this parser is to be able to represent any xml document as a tree-like structure, but be able to put it back together in a sane way.

In comparison to other xml parsers, this one preserves broken stuff. The goal is to be able to decode the typical broken html document, modify it, and encode it again, without loosing too much of its original content.

Currently the parser preserves whitespace between <xml> nodes, so <pre> or <textarea> tags should be unaffected, by a decode/1 into encode/1.

The only part where the parser tidies up, is the attr=“part” of a <xml attr=”part”> node.

With well-formed XML, the parser does work really well.

marcuslankenau · August 28, 2017, 1:47pm

Wow, I saw this too late. Started something similar a couple of days ago (https://github.com/mlankenau/elixml). But I am not aiming for html parsing. It is pure XML with XMLNS but other than that there are a lot of things in common. What I needed was the ability to construct and save xml (that you are supporting as well) and searching with an xpath-subset.

I was missing these feature complete xml libs that we have in the ruby world. Thats the long term goal.

OvermindDL1 · August 28, 2017, 3:09pm

I’m still curious what the built-in-to-the-BEAM XML libraries are lacking though (a better interface perhaps)?

marcuslankenau · August 29, 2017, 7:04am

Honestly I don’t have any strong points for crafting my own stuff. I think it is probably ok to go with the errang lib as well. My experience with SweetXML was a little bit painful but that might not be the fault of the erlang lib.

My only points

looks like they use atoms for element names. Nut sure if that is a potential DOS when parsing external data (not critical in our case)
They use tuple/list combination to represent the data. Is the a big performance advantage compared to using map/list combination?
If there is a bug/missing feature, for me it would be much harder to participate and fix it

Overbryd · August 30, 2017, 1:29pm

Cool, I’ll have a look. Maybe we can join efforts for a great Elixir-native XML-library. (see my dismissal of parsing HTML in my reply below)

Overbryd · August 30, 2017, 1:37pm

Update on parsing HTML. I initially thought of using exoml to parse HTML as well, I dismissed that idea.

Practically HTML cannot just be tokenized into a tree like XML.
HTML is its own very weird language, and cannot be treated as XML at all. They look very much the same, but are insanely different in their inner workings.

So using exoml to build a tree out of a HTML document will lead to problems.

I refer to the introduction of Alexander Borisov, the author of myhtml, explaining the tokenization/parsing issue in greater detail: http://lexborisov.github.io/benchmark-html-persers/

An HTML parser is:

Tokenizer — breaking text down into tokens
Tree Builder — placing tokens in “correct positions” in a tree
Tree follow-up
Someone out of the blue might say: “There’s no need to build a tree for HTML parsing, it’s enough to get tokens.” Unfortunately, they’re wrong.
Actually, for correct HTML tokenization, we should have a tree at hand. Points 1 and 2 go as one.

He gives a great concise example:

Example 2:

<svg><desc><style><a>

The result of correct processing:

<html>
  <head>
  <body>
    <svg:svg>
      <desc:svg>
        <style>
          <-text>: <a>

That is also why I started working on Bindings for myhtml.
See this thread: Myhtmlex - bindings to lexborisov's fast html parser myhtml

OvermindDL1 · August 30, 2017, 3:21pm

HTML is based on SGML, not XML.

Yes, SGML is horrible.

CharlesO · October 18, 2018, 12:49pm

They might be lacking performance, for parsing big xml files.

i’m struggling to parse / load an 80mb xml file.

OvermindDL1 · October 18, 2018, 3:24pm

Isn’t that the purpose of the sax/event built-in erlang xml parsers though?

CharlesO · October 27, 2018, 11:22am

I found that just reading the .xml file directly with File.read! and splitting on \r\n, then parsing out data row by row is far far faster and more flexible that using any xml parsing library.

OvermindDL1 · October 29, 2018, 6:57pm

I would certainly hope it would be. ^.^
It’s not doing parsing, validation, nothing of the sort, so you just kind of have to hope the file is proper for the structure you expect without changing, but at that point it’s not treated as XML anyway.

tobstarr · January 16, 2020, 8:13am

@Overbryd I am looking an easy way to pretty-print/tidy/indent XML with Elixir/Erlang but I cannot find any library that e.g. provides me a indent: " " option when encoding. Do you maybe have some pointers?

Overbryd · January 16, 2020, 1:46pm

That seems like a nice feature to add to exoml. I have added an issue to track progress: https://github.com/Overbryd/exoml/issues/3