This release changes how extractors are implemented and makes various other extractor-related improvements.
Reimplementation
Previously extractors were implemented as callbacks in the private Document.Node
behaviour. This didn’t play to the strengths of behaviours because document nodes are a closed set of structs and the extensibility of having extractors as callbacks didn’t provide any concrete benefits. It did, however, set an example that obscured the fact that you can easily write your own extractor functions, and split up extractor implementations over the various node types making them harder to understand.
I refactored this, removing the Document.Node
behaviour and moving that functionality to modules under Meeseeks.Extractor
.
Fixes and Improvements
I also made a number of improvements to extractors, such as using iodata in string-building extractors instead of string concatenation, improving the performance of whitespace collapsing, and making whitespace collapsing optional in those extractors that do it (data
, own_text
, text
).
The release also contains a couple extractor-related fixes, one from pclewis that removes the unnecessary spaces that comments were being wrapped in when encoding them to HTML, and another that limits the adding of a space between sibling nodes during text extraction to when the preceding sibling did not end in whitespace[1].
A big “thank you” to Philip for his fix!
Compatibility
By this point you may be concerned about how this release will break your existing use of Meeseeks. Good news- as long as you use the public API the changes to how extractors are implemented are completely backwards compatible. The two fixes mentioned above do, however, mean that using data
, html
, own_text
, or text
may yield slightly different results than before.
If you are for some reason using the private Document.Node
callbacks directly on document nodes, sorry, that behaviour is gone and the callbacks are no longer implemented on nodes. If you are using the private Document.Node
helper functions that called into those callbacks you will be happy to learn those still work fine.
Is it fast tho?
I want to get an updated version of the Meeseeks vs Floki bench out, but there are some technical difficulties there because html5ever_elixir
won’t compile on Erlang/OTP 22 so I have nothing to compare against. I am considering comparing against the :mochiweb_html
parser for a round since that’s what a lot of people are being forced to use anyway, but I’m still a bit reluctant because that’s more of an apples to oranges comparison.
[1] Efficiently finding out if a binary ends in whitespace is hard. I am really glad José had to figure it out for String.trim
and I was able to adapt his solution.