Myhtmlex - bindings to lexborisov's fast html parser myhtml

Overbryd · August 30, 2017, 1:53pm

Code: https://github.com/Overbryd/myhtmlex

The current state of development is that the binding is able to parse a given HTML-document into a tree-structure.
The intentions of myhtml are noble. It aims to be portable, fast and correct.

On a 2,5GHz Core i7, it takes 3.4ms to parse a 131K HTML document into a tree like this:

{:html, [{"lang", "en-US-x-Hixie"}],
 [{:head, [],
   [{:meta, [{"charset", "utf-8"}], []}, {:title, [], ["HTML5"]},
    {:script, [{"src", "link-fixup.js"}], []},
    {:style, [{"type", "text/css"}],
     ["\n\n     .applies thead th > ...

Here is the output from Benchfella on decode/1:

Settings:
  duration:      1.0 s

## BasicHtmlBench
[13:37:40] 1/1: decode

Finished in 2.08 seconds

## BasicHtmlBench
benchma iterations   average time 
decode         500   3401.20 µs/op

But a word of caution, since this the binding is currently implemented as a dirty-nif (no joke intended, dirty-nif is a thing).
So it will not load on systems that do not enable dirty schedulers in the Erlang VM.
Since is implemented just as a nif, any failure in the binding or myhtml will bring down the whole Erlang VM!

So see this as a proof-of-concept for now, I am still working on the Port / C-Node versions of this binding.
In the long run, running it as a C-Node might be the best option.

mischov · August 30, 2017, 4:09pm

I am very interested to see where this project goes.

I need to a fast html-parsing library in Erlang/Elixir. So falling back to c, and to myhtml especially, is a natural move.

Another direction you can go is to use a Rust NIF (via Rustler). Given Rust’s focus on safety, this might make using a NIF a little less dangerous.

There are, in fact, Elixir libraries that leverage Rust to parse HTML: html5ever_elixir which parses html into a structure rather like Myhtmlex, and my own meeseeks_html5ever which is specially adapted for Meeseeks.

I can’t speak about any comparison between the performance of Myhtmlex to one of these Rust parsers because I couldn’t get Myhtmlex to build, but it takes about 25ms on my machine to parse the 349Kb HTML file from this benchmark with one of the Rust-based parsers.

Overbryd · August 31, 2017, 8:00am

because I couldn’t get Myhtmlex to build

Thanks for giving it a try. Would you mind opening an issue for your broken build?

Another direction you can go is to use a Rust NIF (via Rustler).

I have seen html5ever/html5ever_elixir.

My decision to write another binding, was of multiple reasons.

I saw that html5ever does not (yet) pass all tests of the tree building spec of the html5lib.

Another reason was to experiment if I could get the insane speed advantage of myhtml over html5ever to Elixir.
At least in this benchmark myhtml outperforms html5ever 9X.

A dependency like myhtml, which built just upon C and nothing else, keeps the whole binding very small and concise.

And naturally I really love to experiment and try out new things

I’ll keep you posted on how this goes.

Overbryd · August 31, 2017, 11:34am

Small update on performance:

I spoke with lexborisov and fixed a few wrongdoings in my binding.

Unnecessary calls to free and unnecessary tree cleanups have been removed. Most of them are managed by myhtml_parse.
I removed the unnecessary initialisation of empty lists.
The micro-benchmark now has a correct context setup.

My micro-benchmark now gives a fair comparison between a 0 to 100 tree build, that includes initialising a new myhtml tree, and building a tree from a referenced myhtml tree. The referenced myhtml tree only needs to be parsed once on myhtml side, the rest is pure tree building code.

As expected most of the time is lost in building a tree in Erlang terms (it takes as long as parsing the html in C). But it is still pretty damn fast.
These small improvements got me close to the performance I was looking for.

## BasicHtmlBench
[13:30:05] 1/5: decode
[13:30:07] 2/5: decode with ref
## FileSizesBench
[13:30:09] 3/5: github_trending_js.html 341k
[13:30:12] 4/5: w3c_html5.html 131k
[13:30:14] 5/5: wikipedia_hyperlink.html 97k

Finished in 10.55 seconds

## BasicHtmlBench
benchmark name                iterations   average time 
decode with ref                     1000   1776.38 µs/op
decode                               500   3061.24 µs/op
## FileSizesBench
benchmark name                iterations   average time 
wikipedia_hyperlink.html 97k        1000   1185.00 µs/op
w3c_html5.html 131k                 1000   1799.45 µs/op
github_trending_js.html 341k         500   5313.43 µs/op

mischov · August 31, 2017, 12:28pm

I would like that.

I would like that even more.

Overbryd · September 7, 2017, 2:26pm

The nif-variant is now available as a package:

{:myhtmlex, "~> 0.1.0"} https://hex.pm/packages/myhtmlex/0.1.0

Eiji · September 7, 2017, 3:22pm

@Overbryd: This is only my opinion, so don’t be sad about it
I already worked on scraping projects where library like it was used, so here (from my experience) parsing speed is not so important and I don’t see too many use cases to add it to normal web app - i.e. where normal James Smith is waiting for result.

From what I see and personally feel scraper owner does not care so much about speed of it’s project - of course faster is better, but not when it could fail, so only from my side this library is not much useful, but there are really similar tasks where performance is much more important. One good example is parsing spreadsheet files - especially if you will add ability to stream row by row (if possible with nif) and support all math features used in fields.

So from here if I would honestly suggest you something then I would say that you should focus on spreadsheet documents, because:

both projects are similar - parsing HTML vs XML
parsing data in format that user often generates (some users use it really, really often) should be as fast as possible
I believe that any at least learning (but again also similar - to not say that your work is totally useless) project that have bigger chances to be used is like have double motivation to finish it

Personally I’m really interesting for similar projects that have more use cases for end user. Maybe your experience from this library could be used to parse spreadsheet files? I would really like see something like:

MySpreadSheetEx.stream(path)
# and
MySpreadSheetEx.get(spreadsheet, row, column) # complex math in this cell

If you will have a version of such library I could be a first developer that use and/or test it.
Please consider my suggestion basing on your experience with this library.

I can also see another really similar parsing project where speed could also be a big advantage. Library for MathML! Imagine that you will write a functions to parse user input and convert it to Elixir function or MathML and also another functions to combine rest cases (MathML -> Elixir function, Elixir function -> user output and so on). Again from start I see some cases that library like that will be really interesting. Imagine that developer is using HTML form to easier generate string and convert it to MathML without even know it.

Summary: performance in parsing HTML files is not as big advantage as in similar projects to try this library, I believe that your project could evolve to one or more projects that could be used even in production! So keep going into it and let me know how much faster C parsers than Rust parsers you could write in similar projects. I will be definitely interested in trying them (again only in cases when speed could be as same or more important than safely)! I’m really interesting watching how your work is going to be more useful in some similar tasks.

mischov · September 7, 2017, 4:16pm

Parsing speed may or may not be that important - a couple dozen ms is probably nothing compared to the time required to fetch a document, but something like 4x+ faster parsing could lead to less infrastructure being required for large scraping projects.

What is important is an accurate HTML parser that doesn’t require Rust. You yourself struggled against Meeseeks’s Rust dependency for a while, so a C NIF could lower that particular barrier of entry.

Eiji · September 7, 2017, 5:53pm

@mischov: When parsing speed is not as much important then more important is code safely than dependencies. Yup it does not require to compile rust nif code, but now it requires to compile C nif code.
Of course ANY rule have it’s own edge-cases, but in scraping case generally nobody cares what nif are you using unless there are specified requirement for project like code safely.

Dependencies are compiled rarely (comparing to main project), so at least I don’t care about it - especially after found asdf-rust plugin for asdf, so I do not need to compile Rust from source.

Anyway as a developer I prefer to compile Rust + nif and have safe environment to not be confused in small 5-minute home projects than any faster parser (even 100x faster) in any other language. Developer have always full hands of work, so one or two compilations in background is nothing surprising and they usually don’t care about them unless they are using too more resources, so they can’t continue they work.

As I mentioned this changes when normal end user is using specified project. User does not care what developer use. It should work and be as fast as possible. So parser speed and project dependencies does not matter unless you are providing solution for end user.

Look that most of Windows users will still use this OS for years. There are lots of awesome projects and lots of really hard work already done. We, as developers, understand they motivations and really appreciate they work and skills that they train. We could wait for next releases, but … When everything is going to end user then all that things suddenly does not matter. You are making specified project for client then you need to choose dependencies that much him needs. They don’t care that your project requires Rust or not - this is advantage only for us and not end users.

From that point I can see that really similar project - like parsing spreadsheet could really interest end user, because his document is parsed faster and he have more time - it’s especially important when working with much more than few spreadsheet documents. I know lots of people using spreadsheets everyday and they are importing and exporting them to lots of apps. Here speed is really important, because it’s not home project - you have thousands of documents from hundreds of users or even more. Here every 1 additional second is means exactly 1 lost second, because users depends on result of that work and then they can continue their work.

I wanted to say that starting from (again only for example) fast spreadsheet parser could be a better idea, because it could be better tested by bigger number of interested people and it’s also possible that your project could be tested in production environment - that is really big advantage. When your skills are much bigger and you received lots of support then another extra parser - even if it will be used only in home projects is both much more profitable and it’s just a matter of time. That project is much more easier even if you will not get any support for it, because you already have experience with similar project that is used by more people.

ah, btw. we already talked about Rust compiling
I already used your Rust HTML parser and it works awesome. I have parsed automatically lots of small pages and personally I don’t feel that I need any faster parser. It’s already fast and don’t know any scraping project and can’t imagine myself any my future private project that will require faster parser especially when that parser does not guarantee same stability as yours.

mischov · September 7, 2017, 6:27pm

I don’t disagree with you that speed is less important than stability (and accuracy), but myhtml seems to be a well tested, accurate HTML5 parser, and as long as the NIF is well implemented I think that people may find it a stable, lower-friction alternative to depending on a Rust library (everybody can compile C… in theory).

I rather like Rust, and am happy to use html5ever, but I am also interested in making it simpler for people to get started with Meeseeks. Myhtmlex hints at a way of doing that.

Keep up the good work, Lukas!

CharlesO · September 7, 2017, 6:44pm

Have you had any success in this?

Eiji · September 7, 2017, 6:59pm

@mischov: yup, as long as he will have 100% coverage, all errors catched and he will use same dependency versions it should work, but he is human and can’t expect from him more than from me - everyone could miss something

Maybe someone could implement compiler for C and other languages that automates this task. That would be awesome (if possible).

Rust makes it much more easier - when I have compiled my project dependencies then I’m almost sure that it will work as expected - rest only depends on coverage.

But, hey! I’m not someone that joined here and want to say only bad words. I really like some points like:

Asynchronous Parsing, Build Tree and Indexation
Passes all tree construction tests from html5lib-tests
Tested by 1 billion HTML pages (by commoncrawl.org)

That points are really interesting!

Yes, please do!

Personally I don’t wrote any parser, hmm not yet.
Maybe it’s time to change it.
I used one in RoR.

Eiji · September 7, 2017, 7:05pm

@CharlesO: Can you collect some info for me about it?
I would like to see complete example spreadsheet documents (for all formats) - let’s say in UTF-8.
If I will see how to parse them then maybe I will make a parser in my free time.
but I can make it only in Elixir - don’t yet worked with NIFs

CharlesO · September 7, 2017, 7:52pm

My attempt was in parsing .xls spreadsheets without a NIF. It was fast enough. Pushed data to ETS, so it could work hand in hand with the xlsxir lib which handles .xlsx

Overbryd · February 25, 2018, 10:28am

Update: {:myhtmlex, "~> 0.2.0"} is out now!

I comes with best of both worlds: Stability and ultimate parsing speed.
You can now operate in two configurabe modes: Myhtmlex.Nif for ultimate parsing speed, and Myhtmlex.Safe (default) that transparently operates a supervised C-Node for you.

I am very happy to announce these changes, and providing a showcase on how to implement fast but safe C integrations using a C-Node.

Source: https://github.com/Overbryd/myhtmlex
Package: https://hex.pm/packages/myhtmlex
Documentation: https://hexdocs.pm/myhtmlex/Myhtmlex.html

BitGonzo · February 25, 2018, 11:59am

Just in time! Thanks, and well done on the work!

sntran · September 19, 2018, 2:14pm

Thank you very much for the hard work. I came across this just for the purpose of having a HTML parser without installing Rust. I like the safety of Rust, but I don’t have the luxury to install Rust everywhere I deploy my app, while C is almost everywhere.

The extra speed gain is very welcoming, especially when I need to do this on multiple responses.

Again, keep up the good work.

P.S. I have a small suggestion. You indicate that this is an Elixir/Erlang binding. I would suggest you write your binding in Erlang, and provide an Elixir layer on top of it. That way, both Erlang users and Elixir users can use your library independently.

At the moment, it’s not straightforward to use the library in Erlang (without Elixir installed).

mischov · September 19, 2018, 7:47pm

Technically I don’t think you need Rust where you deploy your application, you need Rust where you build your release. They can be one and the same, but they don’t have to be.