Html tools in Elixir

f34nk · March 5, 2018, 11:03am

The landscape of available Elixir packages for html tooling is overseeable but in that sense also very focused. Each library is there for a distinct use case.

For that reason I compiled a quick overview of the available html tools in Elixir.
Right now I only covered: Floki, HtmlSanitizeEx, Meeseeks, Myhtmlex, ModestEx

In my humble conclusion, I state that benchmarking is not very useful since the goal and main strength of each library is different. Also the tested methods are not really comparable, since the implemented overhead is very different between each library. It is safe to say that all libraries perform very very fast.

All in all, I would say, the focused nature of the tools makes it easy for the user to pick the right tool for the job.

However, the ecosystem of tools is still quite young. There is room for improvement.

Please feel invited to discuss missing features or differences to other libraries in other languages.

Best, f34nk

mischov · March 5, 2018, 2:44pm

I think it might be worthwhile to indicate whether a parser is HTML5 compliant (so Floki with html5ever parser but not default parser, Meeseeks, Myhtmlex, and ModestEx).

Also, Floki arguably has facilities for manipulating nodes: https://hexdocs.pm/floki/Floki.html#map/2.

Edit: Also, it’s probably beyond the scope of an overview of HTML tools, but I kind of wish there was an explicit indicator of whether a library is intended to be used with XML- I see a fair amount of people reaching for Floki to work with XML when it just doesn’t have parsers intended for it.

f34nk · March 5, 2018, 7:01pm

Good point. I will add a column to the table.

Also true. I forgot. It is a feature that is easy to oversee however. I will also add this info to the table. Is Floki.map/2 only limited to changing attributes?

Yes I also agree here. But how would you like to see this information represented?

Really? That seems odd. Xml and html are completely different languages. Maybe they use the default parser and hope for the best.

Also popular resources like awesome-elixir only have a section for “XML” but not explicitly “HTML”. I think this is misleading!

I opened up an issue on awesome-elixir to add a new section “HTML”.

mischov · March 5, 2018, 8:00pm

As far as I can tell it only works for changing an element’s tag or its attributes.

Yeah, a lot of people use the mochiweb_html parser, which does an ok job with XML, but I’ve even seen people using the html5ever parser and that’s just a bad idea.

People using html5ever for parsing XML was one of the main motivations I had for adding an XML parser to Meeseeks- if they were going to try to do it anyway I wanted to provide a good parser for them to use.

mischov · March 5, 2018, 8:10pm

Also it might be worth adding some kind of footnote for the Meeseeks/Floki benchmark.

f34nk · March 5, 2018, 8:35pm

Do you plan to add functions that let you manipulate nodes to Meeseeks?

Actually that was the motivation for me to start ModestEx, because I missed those features in other libraries. Or is it just me and people only need to parse HTML and not change it?

mischov · March 5, 2018, 8:45pm

I’m pretty hesitant to at the moment- adding nodes in the right place as per the HTML5 spec, etc, is a pretty complicated topic, as is figuring out efficient ways to update what I currently treat as an immutable structure (the Document).

Meeseeks has from the first been designed as a tool to search for and extract data from HTML (and now XML), which is the purpose I use it for, and for the foreseeable future I plan to limit it to that.

mbenatti · March 5, 2018, 11:33pm

Hey @f34nk, I only used floki in my projects and it suits for my needs.
I’m taking a look at your project https://github.com/f34nk/modest_ex and it seems awesome!

Good work dude, if I found something useful or missing about html tools I’ll post here!

f34nk · March 6, 2018, 8:37am

Thanks man.
But what about drab, EEx and Phoenix.HTML?
Strictly speaking these are also “html tools”.

I would love to see more html processing on the backend side, rather then pushing half backed data to the client and let them figure out the rest.
I mean, website performance (and with that user experience) has gotten so worse since everything is being done “on load” or “async”.

f34nk · March 6, 2018, 5:26pm

https://github.com/h4cc/awesome-elixir#html

mbenatti · March 6, 2018, 11:36pm

You’r right, I agree with you!
I used both in my projects, including drab!
I miss more activity by the community for tools like Drab and for components like https://nico-amsterdam.github.io/awesomplete-util/phoenix.html…

I like Drab and its powerful however I can’t see companies/projects showcases using it yet.

f34nk · March 7, 2018, 8:31am

PhoenixFormAwesomplete looks interesting. How is it different to drab?

IMO those projects are born out of the unique connection of phoenix and websockets. It just makes it much more obvious that you can do something like that. Or do I miss something? I am not a UI or even frontend guy, so maybe I don’t see the full picture here.

mbenatti · March 7, 2018, 6:53pm

Drab is a framework, it handle with dom using websockets, he has your “own” controller caled “commander” and it uses Phoenix as a base, so is a complement.

https://github.com/nico-amsterdam/phoenix_form_awesomplete/#installation, is like an “html component” to create (autosuggest) with on-demand Ajax calls, utilizing Lea Verou’s Awesomplete widget, he uses this lib https://github.com/LeaVerou/awesomplete behind the scenes!

f34nk · April 25, 2018, 8:24am

Hello!
I updated the parsing benchmarks on the elixir_html_tools repo and added some test cases with small filesizes - which was missing before.

To my surprise Floki and Meeseeks are incredibly fast for data sizes smaller 1kB. You can check out the runtime distribution here.

So, packages like Myhtmlex or ModestEx, that are using C Nodes, come with additional latency.

Use this potentially helpful and fairly unscientific table to help you decide:

Type	Isolation	Complexity	Latency
Node	Network	Highest	Highest
Port	Process	High	High
Port Driver	Shared	Low	Low
NIF	Shared	Lowest	Lowest

When there is time I will setup a dirty-NIF test, too.

Cheers

dimitarvp · April 25, 2018, 1:14pm

IMO 90% of the JS devs just have no idea how to use them. And have in mind I am a JS hater so I ain’t gonna be one of these guys that tell you “just use it right” – but in this case it’s partially true. I’ve seen some rare JS website gems that are incredibly fast and smooth even on spotty 3G on an iPhone 5c.

That being said, I fully agree with you. And I am going back to server-side rendering more and more with time.

mischov · April 25, 2018, 1:28pm

@f34nk Feedback for the updated benchmarks.

All parsers except ModestEx return html encoded into a list of tuples.

Meeseeks returns it as a Document, which is a flat map of node id to node struct.

You also appear to be using the :mochiweb_html parser for Floki, which is the non-HTML5 compliant one, so you’re comparing apples to the HTML5 compliant oranges of the other parsers. Of course, AFAIK it’s impossible to run Floki’s HTML5 parser on the latest version of OTP, but that’s a different problem all-together.

Also curious if when you’re benchmarking, are you disabling CPU throttling (as mentioned here)? I’ve found that can reduce variation in run times when benchmarking.

Finally I’m interested why the averages shown in the text results don’t appear to be reflected in the images: for instance it appears when looking at the images that 50k Floki is faster than 50k Meeseeks, but according to the text 50k Floki averaged 16633.17 µs/op while 50k Meeseeks averaged 12018.79 µs/op.

f34nk · April 25, 2018, 1:37pm

Thank you! I will update that.

Also true. I will change that, too.

Interesting. I will check that out.

That’s odd.
One is the output of benchfella, the other benchee.
Do you have time to clone and repeat the bench yourself?

I actually did some benchmarking in C for my package and came out with different results too. I will have to investigate further to be sure what’s going on.

mischov · April 25, 2018, 2:38pm

Hardware variation aside, it seems like the included graphs might just be a little wonky

The graphs I generated were more in line with the textual output and clearly showed Meeseeks parsing smaller input slower and larger input faster than Floki (which is not surprising when Floki is using the :mochiweb parser).

f34nk · April 26, 2018, 9:07am

@mischov

I repeated the benchmarks and my benchee and benchfella results are different.

My benchfella shows the same result like you described. But the benchee measurements are somehow different.

I removed the benchee graphs from the repo and did some updates in the README. I hope this is better now.

Thanks for the feedback!