Meeseeks - A library for extracting data from HTML and XML with CSS or XPath selectors

Release v0.11.2

This release includes several fixes to CSS selectors.

Previously names, idents, and strings in CSS selectors would not accept escaped or unicode characters, now CSS selectors accept both escaped characters and Elixir-style unicode code points. A big thank you to @Ripster for creating the issue that led to these fixes.

iex(1)> import Meeseeks.CSS
Meeseeks.CSS

iex(2)> Meeseeks.one("<div id=\"~\"></div>", css("#\\~"))
#Meeseeks.Result<{ <div id="~"></div> }>

iex(3)> Meeseeks.one("<div class=\"❤\"></div>", css(".❤"))  
#Meeseeks.Result<{ <div class="❤"></div> }>

This release also includes some minor improvements to the CSS selector parser errors.

7 Likes

Release v0.12.0

This release fixes Meeseeks.html/1 so that when encoding attribute values double quotes are always used and & and " are escaped as character entities and when encoding text <, >, and & are escaped as character entities.

This should be considered a breaking change since the output for Meeseeks.html/1 may be slightly different, but means for instance that round tripping

"<span>&lt;script&gt;Hello&lt;/script&gt;</span>"

through Meeseeks.parse then Meeseeks.html produces

"<span>&lt;script&gt;Hello&lt;/script&gt;</span>"

instead of

"<span><script>Hello</script></span>"

A big thanks to @ericlathrop for bringing the issues to my attention and proposing a solution.

5 Likes

Release v0.13.0

This release adds support for Erlang/OTP 22 (and Elixir 1.9, though that was working fine before), and removes support for Elixir 1.4, Elixir 1.5, and OTP 19. I had only planned on removing support for Elixir 1.4 and OTP 19, but a recent change in Rustler made 1.6 its minimum version. Sorry for any inconvenience that this might cause.

4 Likes

Release v0.13.1

Since the minimum supported version of Erlang/OTP is now 20 it was possible to switch the NIF from working asynchronously to using a dirty scheduler, which simplified the NIF’s implementation and may speed things up (I’ll get an updated version of the Meeseeks vs. Floki bench out soon).

3 Likes

Ooo very cool!

1 Like

Release v0.14.0

This release changes how extractors are implemented and makes various other extractor-related improvements.

Reimplementation

Previously extractors were implemented as callbacks in the private Document.Node behaviour. This didn’t play to the strengths of behaviours because document nodes are a closed set of structs and the extensibility of having extractors as callbacks didn’t provide any concrete benefits. It did, however, set an example that obscured the fact that you can easily write your own extractor functions, and split up extractor implementations over the various node types making them harder to understand.

I refactored this, removing the Document.Node behaviour and moving that functionality to modules under Meeseeks.Extractor.

Fixes and Improvements

I also made a number of improvements to extractors, such as using iodata in string-building extractors instead of string concatenation, improving the performance of whitespace collapsing, and making whitespace collapsing optional in those extractors that do it (data, own_text, text).

The release also contains a couple extractor-related fixes, one from pclewis that removes the unnecessary spaces that comments were being wrapped in when encoding them to HTML, and another that limits the adding of a space between sibling nodes during text extraction to when the preceding sibling did not end in whitespace[1].

A big “thank you” to Philip for his fix!

Compatibility

By this point you may be concerned about how this release will break your existing use of Meeseeks. Good news- as long as you use the public API the changes to how extractors are implemented are completely backwards compatible. The two fixes mentioned above do, however, mean that using data, html, own_text, or text may yield slightly different results than before.

If you are for some reason using the private Document.Node callbacks directly on document nodes, sorry, that behaviour is gone and the callbacks are no longer implemented on nodes. If you are using the private Document.Node helper functions that called into those callbacks you will be happy to learn those still work fine.

Is it fast tho?

I want to get an updated version of the Meeseeks vs Floki bench out, but there are some technical difficulties there because html5ever_elixir won’t compile on Erlang/OTP 22 so I have nothing to compare against. I am considering comparing against the :mochiweb_html parser for a round since that’s what a lot of people are being forced to use anyway, but I’m still a bit reluctant because that’s more of an apples to oranges comparison.


[1] Efficiently finding out if a binary ends in whitespace is hard. I am really glad José had to figure it out for String.trim and I was able to adapt his solution.

4 Likes

Release v0.15.0

This release adds support for Elixir 1.10 and makes a couple correctness related improvements.

Safer tuple tree parser

The first of these improvements is that the tuple tree parser is now much more strict about not parsing invalid input. Thanks to pclewis for pointing out an issue that led to this work.

No XPath attribute steps outside of predicates

The second of these improvements is that XPath attribute steps outside of predicates are now prohibited (rather than just broken).

For example, xpath("\\p[@class]") which returns elements with class attributes is allowed, but xpath("\\p\@class") which would return the class attributes themselves is prohibited. If you do need to extract a selected element’s attribute use the attr extractor.

Meeseeks.all(doc, xpath("//p[@class]")) |> Enum.map(&Meeseeks.attr(&1, "class"))

Thanks to @OldhamMade for reporting the issue that prompted this work.

Other changes

There are also some minor improvements to the project documentation, and contribution guidelines have been added.

4 Likes

Since html5ever_elixir recently released a new version that works on Erlang/OTP 22 I have finally been able to release a Meeseeks vs. Floki Benchmark update.

Here’s an excerpt from the Trending JS scenario, but go ahead and check out the whole benchmark for a more complete description.

Name                     ips        average  deviation         median         99th %
Meeseeks CSS           23.22       43.07 ms     ±2.73%       42.79 ms       47.22 ms
Meeseeks XPath         19.47       51.35 ms     ±4.03%       50.77 ms       60.82 ms
Floki CSS              14.01       71.39 ms     ±3.85%       71.31 ms       83.36 ms

Comparison: 
Meeseeks CSS           23.22
Meeseeks XPath         19.47 - 1.19x slower +8.28 ms
Floki CSS              14.01 - 1.66x slower +28.32 ms

Memory usage statistics:

Name              Memory usage
Meeseeks CSS           3.66 MB
Meeseeks XPath         6.57 MB - 1.80x memory usage +2.91 MB
Floki CSS             22.23 MB - 6.08x memory usage +18.57 MB
3 Likes

One problem I lately had with Meeseeks was that it complained about an invalid encoding while Floki just parsed the document in UTF-8. I can try and find a few such defective HTML files if you like. But in such situations I appreciate my tool’s ability to cope with the problem instead of erroring out.

1 Like

Please do create an issue if you believe you’ve found an error, yes.

If it’s an issue with not parsing a content type text/html; charset=ISO-8859-1 or some other charset as UTF-8, yes, that’s a known issue for Meeseeks.

And for Floki

It’s just that the mochiweb_html parser will try to treat the other encoding as UTF-8 and give you back gibberish or incorrect results at times while the html5ever parser that Meeseeks uses by default will just not work.

Erroring is the correct response, imo. If there’s a problem it’s better to know about it early and handle it correctly (as suggested in the Floki issue, by converting from whatever charset to UTF-8, then parsing) than to incorrectly parse and potentially return wrong answers or gibberish and not provide any context as to why.

At one point html5ever attempted to provide a mechanism for parsing from other encodings (from_bytes) but that was removed. In practice it’s difficult- I believe to do content sniffing correctly you need to provide information from the HTTP request as well, so it’s not just a matter of parsing HTML any more and consequently not necessarily appropriate for a HTML parser to handle.

I agree and I usually write my own libraries like this. But HTML and XML are a very notable exception. I remember back in 2007 a guy was writing an RSS parser and aggregator client and he actually had to normalise invalid XML in his program so it becomes a valid XML that’s parseable. It’s the sad reality of the web, and HTML is much, much worse.

What I would do if I was in the place of html5ever would be to add a setting that allows several encodings to be “tried” before giving up with an error.

If you think it would be valuable to add such functionality to Meeseeks somehow (at the Elixir or Rust levels, maybe as a pre-parsing step, maybe as an option in parsing, maybe ???) and are willing to implement such functionality I suggest opening an issue so we can discuss how it might make sense to proceed.

Here was how from_bytes was implemented in html5ever: Add Parser::from_bytes, with BOM detection and Content-Type charset. · servo/html5ever@2f4f64b · GitHub

And here is the HTML5 spec for encoding sniffing: HTML Standard

1 Like

Release v0.15.1

This is a tiny release that fixes XPath selector parsing to allow unicode characters.

The problem was reported and fixed by @yanshiyason. Thank you!

2 Likes

Release v0.16.0

Wow, it’s been over a year since the last release! Hope everybody has been as well as circumstances allow.

This release adds support for Elixir 1.12 and Erlang/OTP 24, and is brought to you mostly due to the excellent work of @jeroenvisser101 who figured out how to upgrade from Rustler v0.21 to v0.22.

This release also drops support for Elixir 1.6 and Erlang/OTP 20.

9 Likes

Meeseeks is super fast! I’m parsing Wikipedia XML dumps and it churns through it nicely.

2 Likes

Release v0.16.1

This release uses an updated version of meeseeks_html5ever that supports compilation on Apple M1 devices. Thank you to @Sgoettschkes who reported the issue and tested the solution.

4 Likes

Any plans for support for Rustler Precompiled and update ex_doc dependencies?

1 Like

Are you having issues with ExDoc?

There was recently as issue created with a related PR on meeseeks_html5ever addressing Rustler precompilation. I need to find an opportunity to review the situation- it may require me to stop supporting a number of versions of Elixir long before I planned and I’ll need to weigh up the pros and cons there if I do.

No issues, just the look and feel of generated documentation was improved. Simply compare documentation for floki and meeseeks.

1 Like

Hello any Meeseeks users, I am considering changing Meeseeks minimum supported version to 1.11, which moves up the minimum version a bit more than I usually would in one go.

As of Elixir 1.14’s release my standard operating procedure would have been to move the minimum supported version up to 1.9, which was released about 3 years ago, but I believe supporting Rustler precompilation will require 1.11 (released 2ish years ago).

I don’t like making such a drastic change in the minimum supported version, but supporting Rustler precompilation could help people use Meeseeks without having Rust installed and speed up compile times.

If you would have an issue with the minimum supported version of Elixir being moved to 1.11 please let me know so I can try to account for that in my decision making. Thank you!

3 Likes