Meeseeks - A library for extracting data from HTML and XML with CSS or XPath selectors

meeseeks
Tags: #<Tag:0x00007f039bea3e68>

#21

Yep- parses the XML into a Meeseeks.Document, so CSS or XPath (or custom) selectors just work.

I saw people trying to parse XML with Floki or Meeseeks because they were readily available and easy to use, and it was driving me a bit crazy because the html5ever parser in particular is awful at parsing XML, so to preserve my sanity I added an XML parser.


#22

Entirely this, HTML5 is not XML (oh I so wish we got XHTML2 instead of HTML5, bleh) and things will absolutely not parse as expected in an XML document for certain tags and certain constructs.


#23

Release v0.7.1

Fixed a compilation problem involving OTP 20 and NIFs.


#24

Release v0.7.2

The largest change is that Meeseeks.html/1 and Meeseeks.tree/1 can now be called on a Meeseeks.Document to output the whole document’s HTML or tuple-tree structure.

A bug related to parsing doctypes was also fixed.


#25

Release v0.7.3

A super minor release, just fixing a couple warnings that popped up with Elixir 1.5.

On a tangentially related note, the much slowed rate of releases has nothing to do with my commitment to the project- I just haven’t encountered any bugs or thought of any improvements.

The big next steps for the library (barring the appearance of a bug or improvement) will be a chores such as a style unification and further improving the documentation (if anybody has felt let down by some portion of the documentation, please speak up). I also plan to release some articles about web scraping with Elixir and Meeseeks (if you have any ideas for examples or articles you wish were written on the subject, get in touch).


#26

Nor have I, I’ve recently used it on my discord/irc bot for web page scraping (instead of my previous regex horrors) and it is such bliss. :slight_smile:


#27

That is a pretty significant endorsement right there.


#28

It provided me a moment of happiness on what has otherwise been a CSS-filled day. :slight_smile:


#29

Release v0.7.4

This release adds nil input propagation to the extractors, a convenience which enables code like

def media_url(item) do
  Meeseeks.one(item, css("enclosure")) |> Meeseeks.attr("url") ||
  Meeseeks.one(item, css("media|content")) |> Meeseeks.attr("url")
end

The above will work correctly whether item has an enclosure element or not, whereas before it would have thrown an error when it tried piping nil into Meeseeks.attr("url").

I also fixed a bug related to CSS selector tokanization.


#30

Release v0.7.5

This release fixes a fun bug (discovered by @aclemmensen) where meeseeks_html5ever was incorrectly panicking when it tried to call remove_from_parent on a node with no parent.

Thank you for your help, Asbjørn!


#31

You’re very welcome! Thanks for making this library in the first place. It’s a pleasure to work with. :slight_smile:


#32

Release v0.7.6

This release fixes another panic-related problem discovered by @aclemmensen (thanks!), and makes miscellaneous other fixes, generally in an attempt to get to them before Asbjørn does.


#33

No release this time, though once Rustler releases with a fix for a compilation problem related to ERTS 9.2, I’ll be releasing as well.

Instead, I’d like to get feedback and suggestions for an issue that recently came to my attention regarding to how much memory Meeseeks uses when parsing large HTML files (spoiler, it’s a lot).


#34

Release v0.7.7

Fixes a compilation problem involving OTP 20.2 and NIFs.


#35

Whoooo! And it works. :slight_smile:
Thanks much!


#36

Is there an option to use a streaming parser like and XML SAX parser?


#37

While I think it would be correct to say that the underlying parsers (html5ever and xml5ever) are streaming parsers, Meeseeks does not support using them in any streaming fashion (Meeseeks expects a document to query).


#38

Maybe using meeseeks_html5ever or so directly could do it?


#39

Not really, the document is build in the Rust portion of meeseeks_html5ever.

If you wanted a stream of events you’d need to create some kind of consumer (in Rust) for the html/xml5ever parsers that somehow streamed the events into Elixir?


#40

Just wondering. With XML at least, DOM parsing makes it easier to hop around a document and get various pieces of data, but SAX parsers are tremendously more efficient if you’re scripting a large volume.

I haven’t used meeseeks yet to know if it was an option but just figured I’d toss it out there because of the memory issue on large documents mentioned above.

HTML parsers tend to have to correct a lot of malformed HTML so I didn’t know if SAX was viable there.