Meeseeks - A library for extracting data from HTML and XML with CSS or XPath selectors

meeseeks
Tags: #<Tag:0x00007f039b396b10>

#1
iex> import Meeseeks.CSS
Meeseeks.CSS
iex> html = Tesla.get("https://news.ycombinator.com/").body
"..."
iex> for story <- Meeseeks.all(html, css("tr.athing")) do
       title = Meeseeks.one(story, css(".title a"))
       %{title: Meeseeks.text(title),
         url: Meeseeks.attr(title, "href")}
     end
[%{title: "...", url: "..."}, %{title: "...", url: "..."}, ...]

Meeseeks fills a similar role as Floki, but tries to more strongly emphasize usability and extensibility.

Github: https://github.com/mischov/meeseeks
Hexdocs: https://hexdocs.pm/meeseeks/Meeseeks.html

Features

  • Parses HTML with html5ever

    This is a plus because html5ever is HTML5 spec compliant, but also a minus because it requires that the Rust compiler be installed. Being HTML5 spec compliant was a pretty big deal for me, particularly because mochiweb_html can do unexpected things when trying to parse malformed HTML (it was unusable for some of my purposes).

  • Well fleshed-out CSS and XPath selectors implementations

    Chances are that the CSS selector you want to use is supported by Meeseeks- a couple potentially useful pseudo classes aren’t implemented, but they are definitely on the “do you really need that?” end of the spectrum.

    Meeseeks also supports most of the XPath 1.0 selector syntax.

    Learn more:

  • Extensible selectors

    Need to do something CSS or XPath selectors won’t let you do? No problem! Meeseeks selectors are just structs implementing the Meeseeks.Selector behaviour, and it’s really easy to create custom selectors.

    Learn more:

Usability

Usability is always a bit subjective, but beyond creating a simple, documented api, I’ve made an effort to provide helpful errors and print useful data in IEx.

For example, Meeseeks.Results show up in IEx like #Meeseeks.Result<{ <p>1</p> }>.

There is still plenty of room for improvement, though, and, as you use the library, if you hit a pain point (like a particularly opaque error) I would like to hear about it.

Performance

The short answer is, if Floki was fast enough for you Meeseeks should be too

It appears that Meeseeks can be faster than Floki in some circumstances, but not all. For more details (and some numbers), see this benchmark.

Next Steps

As more people start using the library I hope to get feedback on usability and other issues that will help me continue to improve things, but either way I’ll be rounding off corners when I run into them.

This is 0.x.x software without any production use, so there may be bugs and I won’t promise not to make breaking changes, but it’s working pretty well for me and ready for you to use. Give it a go!


ModestEx - Pipeable transformations on html strings (with CSS selectors)
Crawler Data
#2

I don’t have Rust compiled, so I can’t test it yet, but I have some questions:

  1. Do you want to support dataset JavaScript API? It could be useful to fetch some data.
  2. You are using some structs - they are good for pattern matching, so how I could change Meeseeks.Result to (for example) Meeseeks.Document.Element?
  3. Do you want to support custom CSS selectors? For example parent ! selector from CSS 4 Selectors? Example: !div.parent > p.child. Or user selectors like: p:custom-selector, so it’s possible to add dynamically custom selectors for example from parameters (CSS selector) and plug-ins (custom handler dynamically loaded) combination.

#3

Thank you for your questions.

  1. I haven’t looked into supporting the dataset API, but I can imagine making a helper function to convert a node or result into a map whose keys and values would come from data- attributes.
Do you think that would be enough?
  1. I haven’t provided a helper function to go directly from a result to a node, and maybe I should. Currently you would need to do:
`Meeseeks.Document.get_node(result.document, result.id)` .
  1. I am currently only targeting CSS3 selectors, but that might change in the future.
I am undecided on whether I want to support custom selectors in the CSS selector macro I provide, but I can definitely see the utility of allowing users to hook their own custom selectors into the CSS selector syntax.

I have, however, done my best to provide support for custom Meeseeks selectors, and one could without much difficulty adapt the code I use in my `css` macro to make a custom `css` macro, which would be as easy to use as:

```elixir
iex> import Your.CSS # instead of Meeseeks.CSS
...
iex> Meeseeks.all(source, css("!div.parent > p.child"))
...
```

Edit: Brainfart, ! doesn’t mean “not,” it means “select me.” Also, forum has no strike-through?


#4

Just wanted to give you credit for a great name choice.


#5

I have two ideas. Simpler:

iex> import Meeseeks.CSS
Meeseeks.CSS
iex> html = Tesla.get("https://news.ycombinator.com/").body
"..."
iex> for story <- Meeseeks.all(html, css("tr.athing")) do
       story
       |> Meeseeks.one(css(".title a"))
       |> Meeseeks.dataset
       |> Map.fetch!("id") # data-id attribute
       |> String.to_integer
     end
[1, 2, 4, 9, 13]

and version with casting:

iex> import Meeseeks.CSS
Meeseeks.CSS
iex> html = Tesla.get("https://news.ycombinator.com/").body
"..."
iex> for story <- Meeseeks.all(html, css("tr.athing")) do
       story
       |> Meeseeks.one(css(".title a"))
       |> Meeseeks.dataset(cast: :auto)
       |> Map.fetch!("id") # data-id attribute
     end
# or:
iex> for story <- Meeseeks.all(html, css("tr.athing")) do
       story
       |> Meeseeks.one(css(".title a"))
       |> Meeseeks.dataset(cast: %{"id" => :integer}) # cast only data-id attribute
       |> Map.fetch!("id") # data-id attribute
     end
[1, 2, 4, 9, 13]

So what we need is to create simple callbacks like:

result = Meeseeks.one(story, css(".title a"))
first_node_result = Meeseeks.Document.get_node(result.document, result.id) .
second_node_result = Meeseeks.one_node(story, css(".title a"))
assert first_node_result == second_node_result

#6

I opened an issue on dataset, but I’m going to think about node a bit.

At the very least, your Meeseeks.one_node suggestion would need to return {document, node} because otherwise there could be no guarantee that the caller would have the Meeseeks.Document capable of resolving the node ids contained in node. This is why Meeseeks.Result is how it is.


#7

Release v0.3.1

I added the discussed dataset extractor and now raise a more helpful error when you try to select using a string instead of selectors.

I’ve been dipping my toes into Rust and I should have an interesting (performance related) release soon.


#8

Ooo, any details? Are you using Rustler to integrate to the VM or using a port? :slight_smile:


#9

Rustler. I’ve already been using html5ever_elixir which is hansihe’s Rustler NIF for html5ever but I’ve specialized things a bit for Meeseeks.

More details: https://github.com/mischov/meeseeks/issues/2


#10

Release v0.4.0

The largest change was the switch from html5ever_elixir to meeseeks_html5ever, which was a performance driven change (see the issue for details).

Additionally, the :not() CSS selector now supports lists of selectors.

Meeseeks vs. Floki Performance

Since a lot of this release was focused around performance, I put together a benchmark comparing performance between Meeseeks and Floki for a couple real-world-ish scenarios. Benchmarking is tricky, but I’ve done my best to create something useful.

I go into a lot more detail on the benchmark, but in short the results are:

  • In the “Wiki Links” benchmark, Meeseeks and Floki perform similarly
  • In the “Trending JS” benchmark, Floki is about 1.4x slower than Meeseeks

It’s performance benchmarking, though, so take that with a grain of skepticism.


#11

Release v0.4.1

I recently ran across a bug in the CSS selector tokenizer that was breaking descendant combinators that were followed by a wildcard or pseudo-class, so I wanted to get a patch out for that before it confused somebody.

I also added CI.


#12

Release v0.5.0

XPATH SELECTORS!

They weren’t particularly easy to implement, but the library is stronger for the work that went into them. I doubled my test count trying to make sure the implementation was accurate, but XPath is… interesting and there’s every chance a few slipped by, so let me know if you find one.

Fun fact- in XPath 1.0 it’s valid to take a substring starting at -INFINITY and continuing for INFINITY characters.

This release also fixes some bugs related to element namespaces and a bug related to lack of html5ever version limits that was causing compilation problems for meeseeks_html5ever.


#13

I would be really interested in seeing benchmarks of these taken while Phoenix is also being hit with a benchmark. Since BEAM does such a good job of balancing concerns I’m interested to see the effect on web requests when the server is running heavy workloads with a NIF.


#14

Very excited about the xpath support. Thank you!


#15

Those results would depend entirely how well behaved the nif is. In the case of meeseeks, I think it shouldn’t behave very differently from a normal elixir library as it uses a thread pool for parsing, so the impact on the BEAM scheduler should be minimal.


#16

What @zambal said.

meeseeks_html5ever parses most HTML (all but the smallest) on non-Erlang threads before sending the result to the appropriate process, so it shouldn’t interfere with the BEAM scheduler.


#17

I added XPath versions to this Meeseeks vs. Floki benchmark.

Results generally suggest XPath selectors are a little slower than CSS selectors, but not so much slower that one should feel bad about using XPath selectors.

What makes a fast XPath selector is different than what makes a fast CSS selector, however. Avoiding early filters and providing elements (ie. div) instead of wildcards before filters are both simple ways to improve XPath performance.


#18

Release v0.6.0

The largest change is the addition of Meeseeks.select, which allows users to provide a custom Meeseeks.Accumulator.

Users should probably stick with Meeseeks.all or Meeseeks.one unless they have a very good reason to need a custom accumulator, but Meeseeks.select is another step forward in making Meeseeks as flexible as possible.

I also added a Document.ProcessingInstruction node type which will never occur if you’re parsing HTML with meeseeks_html5ever (because in HTML5 processing instructions are parsed as comments), but which will occur if you’re parsing from a tuple-tree with {:pi, ...} nodes. I also updated XPath selectors to properly work with Document.ProcessingInstruction nodes.


#19

Release v0.7.0

Meeseeks now ships with a permissive XML parser based on xml5ever.

Meeseeks.parse("<random>XML</random>", :xml)

I also updated Meeseeks.data/1 so that it gets the data from CDATA nodes that were (correctly) parsed as comments by the html5ever parser.


#20

Ooo, general XML with selectors? I may have use for this. :slight_smile: