Meeseeks - A library for extracting data from HTML and XML with CSS or XPath selectors

mischov · June 13, 2017, 1:09am

Yep- parses the XML into a Meeseeks.Document, so CSS or XPath (or custom) selectors just work.

I saw people trying to parse XML with Floki or Meeseeks because they were readily available and easy to use, and it was driving me a bit crazy because the html5ever parser in particular is awful at parsing XML, so to preserve my sanity I added an XML parser.

OvermindDL1 · June 13, 2017, 2:07pm

Entirely this, HTML5 is not XML (oh I so wish we got XHTML2 instead of HTML5, bleh) and things will absolutely not parse as expected in an XML document for certain tags and certain constructs.

mischov · June 29, 2017, 10:54pm

Release v0.7.1

Fixed a compilation problem involving OTP 20 and NIFs.

mischov · July 13, 2017, 8:52pm

Release v0.7.2

The largest change is that Meeseeks.html/1 and Meeseeks.tree/1 can now be called on a Meeseeks.Document to output the whole document’s HTML or tuple-tree structure.

A bug related to parsing doctypes was also fixed.

mischov · August 29, 2017, 5:20pm

Release v0.7.3

A super minor release, just fixing a couple warnings that popped up with Elixir 1.5.

On a tangentially related note, the much slowed rate of releases has nothing to do with my commitment to the project- I just haven’t encountered any bugs or thought of any improvements.

The big next steps for the library (barring the appearance of a bug or improvement) will be a chores such as a style unification and further improving the documentation (if anybody has felt let down by some portion of the documentation, please speak up). I also plan to release some articles about web scraping with Elixir and Meeseeks (if you have any ideas for examples or articles you wish were written on the subject, get in touch).

OvermindDL1 · August 29, 2017, 6:25pm

Nor have I, I’ve recently used it on my discord/irc bot for web page scraping (instead of my previous regex horrors) and it is such bliss.

brightball · August 29, 2017, 8:52pm

That is a pretty significant endorsement right there.

mischov · August 29, 2017, 9:22pm

It provided me a moment of happiness on what has otherwise been a CSS-filled day.

mischov · September 18, 2017, 6:14pm

Release v0.7.4

This release adds nil input propagation to the extractors, a convenience which enables code like

def media_url(item) do
  Meeseeks.one(item, css("enclosure")) |> Meeseeks.attr("url") ||
  Meeseeks.one(item, css("media|content")) |> Meeseeks.attr("url")
end

The above will work correctly whether item has an enclosure element or not, whereas before it would have thrown an error when it tried piping nil into Meeseeks.attr("url").

I also fixed a bug related to CSS selector tokanization.

mischov · September 23, 2017, 9:12pm

Release v0.7.5

This release fixes a fun bug (discovered by @aclemmensen) where meeseeks_html5ever was incorrectly panicking when it tried to call remove_from_parent on a node with no parent.

Thank you for your help, Asbjørn!

aclemmensen · September 23, 2017, 9:24pm

You’re very welcome! Thanks for making this library in the first place. It’s a pleasure to work with.

mischov · September 24, 2017, 11:24pm

Release v0.7.6

This release fixes another panic-related problem discovered by @aclemmensen (thanks!), and makes miscellaneous other fixes, generally in an attempt to get to them before Asbjørn does.

mischov · January 27, 2018, 8:32pm

No release this time, though once Rustler releases with a fix for a compilation problem related to ERTS 9.2, I’ll be releasing as well.

Instead, I’d like to get feedback and suggestions for an issue that recently came to my attention regarding to how much memory Meeseeks uses when parsing large HTML files (spoiler, it’s a lot).

mischov · February 8, 2018, 9:08pm

Release v0.7.7

Fixes a compilation problem involving OTP 20.2 and NIFs.

OvermindDL1 · February 9, 2018, 3:56pm

Whoooo! And it works.
Thanks much!

brightball · February 9, 2018, 5:31pm

Is there an option to use a streaming parser like and XML SAX parser?

mischov · February 9, 2018, 6:35pm

While I think it would be correct to say that the underlying parsers (html5ever and xml5ever) are streaming parsers, Meeseeks does not support using them in any streaming fashion (Meeseeks expects a document to query).

OvermindDL1 · February 9, 2018, 7:26pm

Maybe using meeseeks_html5ever or so directly could do it?

mischov · February 9, 2018, 7:30pm

Not really, the document is build in the Rust portion of meeseeks_html5ever.

If you wanted a stream of events you’d need to create some kind of consumer (in Rust) for the html/xml5ever parsers that somehow streamed the events into Elixir?

brightball · February 9, 2018, 7:52pm

Just wondering. With XML at least, DOM parsing makes it easier to hop around a document and get various pieces of data, but SAX parsers are tremendously more efficient if you’re scripting a large volume.

I haven’t used meeseeks yet to know if it was an option but just figured I’d toss it out there because of the memory issue on large documents mentioned above.

HTML parsers tend to have to correct a lot of malformed HTML so I didn’t know if SAX was viable there.