Meeseeks - A library for extracting data from HTML and XML with CSS or XPath selectors

mischov · March 27, 2017, 1:44am

import Meeseeks.CSS

html = HTTPoison.get!("https://news.ycombinator.com/").body

for story <- Meeseeks.all(html, css("tr.athing")) do
  title = Meeseeks.one(story, css(".title a"))
  %{title: Meeseeks.text(title),
    url: Meeseeks.attr(title, "href")}
end
#=> [%{title: "...", url: "..."}, %{title: "...", url: "..."}, ...]

Meeseeks is a library for parsing and extracting data from HTML and XML with CSS or XPath selectors.

GitHub: https://github.com/mischov/meeseeks
HexDocs: https://hexdocs.pm/meeseeks/Meeseeks.html

Features

Friendly API
Browser-grade HTML5 parser
Permissive XML parser
CSS and XPath selectors
Rich, extensible selector architecture
Helpers to extract data from selections

Why?

Meeseeks exists in the same space as an earlier library called Floki, so why was Meeseeks created and why would you use it instead of Floki?

Floki is a couple years older than Meeseeks, so why does Meeseeks even exist?

Meeseeks exists because Floki used to be unable to do what I needed.

When I started learning Elixir I reimplemented a small project I had written in another language. Part of that project involved extracting data from HTML, and unbeknownst to me some of the HTML I needed to extract data from was malformed.

This had never been a problem before because the HTML parser I was using in the other language was HTML5 spec compliant and handled the malformed HTML just as well as a browser. Unfortunately for me, Floki used (and still uses by default) the :mochiweb_html parser which is nowhere near HTML5 spec compliant, and just silently dropped the data I needed when parsing.

Meeseeks started out as an attempt to write an HTML5 spec compliant parser in Elixir (spoiler: it’s really hard), then switched to using Mozilla’s html5ever via Rustler after Hans wrote html5ever_elixir.

Floki gained optional support for using html5ever_elixir as its parser around the same time, but it still used :mochiweb_html (which doesn’t require Rust to be part of the build process) by default and I released Meeseeks as a safer alternative.

Why should I use Meeseeks instead of Floki?

When Meeseeks was released it came with a safer default HTML parser, a more complete collection of CSS selectors, and a more extensible selector architecture than Floki.

Since then Meeseeks has been further expanded with functionality Floki just doesn’t have, such as an XML parser and XPath selectors.

It won’t matter to most users, but the selection architecture is much richer than Floki’s, and permits the creation all kinds of interesting custom, stateful selectors (in fact, both the CSS and XPath selector strings compile down to the same selector structs that anybody can define).

What probably will matter more to users is the friendly API, extensive documentation, and the attention to the details of usability seen in such places as the custom formatting for result structs (#Meeseeks.Result<{ <p>1</p> }>) and the descriptive errors.

Is Floki ever a better choice than Meeseeks?

Yes, there are two main cases when Floki is clearly a better choice than Meeseeks.

Firstly, if you absolutely can’t include Rust in your build process AND you know that the HTML you’ll be working with is well-formed and won’t require an HTML5 spec compliant parser then using Floki with the :mochiweb_html parser is a reasonable choice.

However, if you have any doubts about the HTML you’ll be parsing you should probably figure out a way to use a better parser because using :mochiweb_html in that situation may be a timebomb.

Secondly, if you want to make updates to an HTML document Floki provides facilities to do so while Meeseeks, which is entirely focused on selecting and extracting data, does not.

How does performance compare between Floki and Meeseeks?

Performance is similar enough between the two that it’s probably not worth choosing one over the other for that reason.

For details and benchmarks, see Meeseeks vs. Floki Performance.

Eiji · March 30, 2017, 3:51pm

I don’t have Rust compiled, so I can’t test it yet, but I have some questions:

Do you want to support dataset JavaScript API? It could be useful to fetch some data.
You are using some structs - they are good for pattern matching, so how I could change Meeseeks.Result to (for example) Meeseeks.Document.Element?
Do you want to support custom CSS selectors? For example parent ! selector from CSS 4 Selectors? Example: !div.parent > p.child. Or user selectors like: p:custom-selector, so it’s possible to add dynamically custom selectors for example from parameters (CSS selector) and plug-ins (custom handler dynamically loaded) combination.

mischov · March 30, 2017, 5:14pm

Thank you for your questions.

I haven’t looked into supporting the dataset API, but I can imagine making a helper function to convert a node or result into a map whose keys and values would come from data- attributes.

Do you think that would be enough?

I haven’t provided a helper function to go directly from a result to a node, and maybe I should. Currently you would need to do:

`Meeseeks.Document.get_node(result.document, result.id)` .

I am currently only targeting CSS3 selectors, but that might change in the future.

I am undecided on whether I want to support custom selectors in the CSS selector macro I provide, but I can definitely see the utility of allowing users to hook their own custom selectors into the CSS selector syntax.

I have, however, done my best to provide support for custom Meeseeks selectors, and one could without much difficulty adapt the code I use in my `css` macro to make a custom `css` macro, which would be as easy to use as:

```elixir
iex> import Your.CSS # instead of Meeseeks.CSS
...
iex> Meeseeks.all(source, css("!div.parent > p.child"))
...
```

Edit: Brainfart, ! doesn’t mean “not,” it means “select me.” Also, forum has no strike-through?

brightball · March 30, 2017, 5:33pm

Just wanted to give you credit for a great name choice.

Eiji · March 30, 2017, 9:27pm

I have two ideas. Simpler:

iex> import Meeseeks.CSS
Meeseeks.CSS
iex> html = Tesla.get("https://news.ycombinator.com/").body
"..."
iex> for story <- Meeseeks.all(html, css("tr.athing")) do
       story
       |> Meeseeks.one(css(".title a"))
       |> Meeseeks.dataset
       |> Map.fetch!("id") # data-id attribute
       |> String.to_integer
     end
[1, 2, 4, 9, 13]

and version with casting:

iex> import Meeseeks.CSS
Meeseeks.CSS
iex> html = Tesla.get("https://news.ycombinator.com/").body
"..."
iex> for story <- Meeseeks.all(html, css("tr.athing")) do
       story
       |> Meeseeks.one(css(".title a"))
       |> Meeseeks.dataset(cast: :auto)
       |> Map.fetch!("id") # data-id attribute
     end
# or:
iex> for story <- Meeseeks.all(html, css("tr.athing")) do
       story
       |> Meeseeks.one(css(".title a"))
       |> Meeseeks.dataset(cast: %{"id" => :integer}) # cast only data-id attribute
       |> Map.fetch!("id") # data-id attribute
     end
[1, 2, 4, 9, 13]

So what we need is to create simple callbacks like:

result = Meeseeks.one(story, css(".title a"))
first_node_result = Meeseeks.Document.get_node(result.document, result.id) .
second_node_result = Meeseeks.one_node(story, css(".title a"))
assert first_node_result == second_node_result

mischov · March 30, 2017, 11:59pm

I opened an issue on dataset, but I’m going to think about node a bit.

At the very least, your Meeseeks.one_node suggestion would need to return {document, node} because otherwise there could be no guarantee that the caller would have the Meeseeks.Document capable of resolving the node ids contained in node. This is why Meeseeks.Result is how it is.

mischov · April 3, 2017, 11:55pm

Release v0.3.1

I added the discussed dataset extractor and now raise a more helpful error when you try to select using a string instead of selectors.

I’ve been dipping my toes into Rust and I should have an interesting (performance related) release soon.

OvermindDL1 · April 4, 2017, 2:57pm

Ooo, any details? Are you using Rustler to integrate to the VM or using a port?

mischov · April 4, 2017, 3:01pm

Rustler. I’ve already been using html5ever_elixir which is hansihe’s Rustler NIF for html5ever but I’ve specialized things a bit for Meeseeks.

More details: Use meeseeks_html5ever instead of html5ever_elixir · Issue #2 · mischov/meeseeks · GitHub

mischov · April 9, 2017, 12:21am

Release v0.4.0

The largest change was the switch from html5ever_elixir to meeseeks_html5ever, which was a performance driven change (see the issue for details).

Additionally, the :not() CSS selector now supports lists of selectors.

Meeseeks vs. Floki Performance

Since a lot of this release was focused around performance, I put together a benchmark comparing performance between Meeseeks and Floki for a couple real-world-ish scenarios. Benchmarking is tricky, but I’ve done my best to create something useful.

I go into a lot more detail on the benchmark, but in short the results are:

In the “Wiki Links” benchmark, Meeseeks and Floki perform similarly
In the “Trending JS” benchmark, Floki is about 1.4x slower than Meeseeks

It’s performance benchmarking, though, so take that with a grain of skepticism.

mischov · April 10, 2017, 9:23pm

Release v0.4.1

I recently ran across a bug in the CSS selector tokenizer that was breaking descendant combinators that were followed by a wildcard or pseudo-class, so I wanted to get a patch out for that before it confused somebody.

I also added CI.

mischov · May 13, 2017, 6:47am

Release v0.5.0

XPATH SELECTORS!

They weren’t particularly easy to implement, but the library is stronger for the work that went into them. I doubled my test count trying to make sure the implementation was accurate, but XPath is… interesting and there’s every chance a few slipped by, so let me know if you find one.

Fun fact- in XPath 1.0 it’s valid to take a substring starting at -INFINITY and continuing for INFINITY characters.

This release also fixes some bugs related to element namespaces and a bug related to lack of html5ever version limits that was causing compilation problems for meeseeks_html5ever.

brightball · May 13, 2017, 3:50pm

I would be really interested in seeing benchmarks of these taken while Phoenix is also being hit with a benchmark. Since BEAM does such a good job of balancing concerns I’m interested to see the effect on web requests when the server is running heavy workloads with a NIF.

Buttons840 · May 13, 2017, 5:10pm

Very excited about the xpath support. Thank you!

zambal · May 13, 2017, 5:46pm

Those results would depend entirely how well behaved the nif is. In the case of meeseeks, I think it shouldn’t behave very differently from a normal elixir library as it uses a thread pool for parsing, so the impact on the BEAM scheduler should be minimal.

mischov · May 13, 2017, 11:23pm

What @zambal said.

meeseeks_html5ever parses most HTML (all but the smallest) on non-Erlang threads before sending the result to the appropriate process, so it shouldn’t interfere with the BEAM scheduler.

mischov · May 15, 2017, 3:47pm

I added XPath versions to this Meeseeks vs. Floki benchmark.

Results generally suggest XPath selectors are a little slower than CSS selectors, but not so much slower that one should feel bad about using XPath selectors.

What makes a fast XPath selector is different than what makes a fast CSS selector, however. Avoiding early filters and providing elements (ie. div) instead of wildcards before filters are both simple ways to improve XPath performance.

mischov · May 24, 2017, 12:11am

Release v0.6.0

The largest change is the addition of Meeseeks.select, which allows users to provide a custom Meeseeks.Accumulator.

Users should probably stick with Meeseeks.all or Meeseeks.one unless they have a very good reason to need a custom accumulator, but Meeseeks.select is another step forward in making Meeseeks as flexible as possible.

I also added a Document.ProcessingInstruction node type which will never occur if you’re parsing HTML with meeseeks_html5ever (because in HTML5 processing instructions are parsed as comments), but which will occur if you’re parsing from a tuple-tree with {:pi, ...} nodes. I also updated XPath selectors to properly work with Document.ProcessingInstruction nodes.

mischov · June 5, 2017, 8:42pm

Release v0.7.0

Meeseeks now ships with a permissive XML parser based on xml5ever.

Meeseeks.parse("<random>XML</random>", :xml)

I also updated Meeseeks.data/1 so that it gets the data from CDATA nodes that were (correctly) parsed as comments by the html5ever parser.

OvermindDL1 · June 12, 2017, 3:12pm

Ooo, general XML with selectors? I may have use for this.