Meeseeks - A library for extracting data from HTML and XML with CSS or XPath selectors

Release v0.8.0

This release:

  • Ensures Elixir 1.6 compatibility
  • Adds a .formatters.exs and formats the project
  • Fixes some typespec errors, thanks to @OvermindDL1 for raising the issue
  • Adds Document.delete_node/2, thanks to @willbarrett for the contribution
  • Adds get_root_ids/1, get_node_ids/1, and fetch_node/2 to Document
  • Improves the safety of many Document functions by raising when node_id does not exist in the Document (before they might have raised or might have handled the problem gracefully)

Special thanks to Will, who becomes the first code contributor other than myself.

4 Likes

Looking for feedback as to whether I should add Meeseeks.fetch_one and Meeseeks.fetch_all, and what the :error value should look like if I do.

2 Likes

@mischov Is there an easy way, to get an xpath of a found node based on the node_id of that node? I simply need to get a path for a given text and get element from that path from another page to check if they match. All the info I need is there, but to be honest I donā€™t want to reinvent the wheel :wink:

Though I was unable to find any such functionality. So if there is no such functionality Iā€™m curious if you would accept PR for such functionality, getting xpath based on id of a node. Iā€™m not saying I will find a time to write it soon, just exploring options here :wink:

1 Like

Ha! Thatā€™s an interesting one!

No, that functionality doesnā€™t exist in Meeseeks.

I would be open to a PR for that functionality (probably as a Document function, but perhaps as an extractor too).

1 Like

Cool Iā€™ll keep that in mind when writing stuff for the app i need to write.

Release v0.9.0

Meeseeks.Error

Prior to v0.9.0, errors in Meeseeks were all over the place- sometimes they returned {:error, string} or :error, sometimes they raised RuntimeErrors or ArgumentErrors or one of an assortment of custom Meeseeks exceptions.

To combat this, I have added a Meeseeks.Error struct that implements the Exception behaviour and used it throughout the library.

I go into more details about the rationale and implementation in this issue, but the quick takeaway is that this kind of error struct is flexible, plays nicely with pattern matching in places like case and with, and makes it easier to provide useful errors to users.

This is a breaking change because it modifies the returned or raised type of errors. If your Meeseeks-related code handles {:error, ???} or catches one of the old Meeseeks exception types, you will need to make changes.

I apologize for the inconvenience, but this change should lead to safer, more friendly code in the future.

Meeseeks.fetch_all and Meeseeks.fetch_one

The more that I use Elixir in anger, the more I appreciate functions that return {:ok, ...} or {:error, ...}.

In light of the feedback I received on this issue I decided to add Meeseeks.fetch_all and Meeseeks.fetch_one which work like Meeseeks.all and Meeseeks.one respectively, but wrap the result in {:ok, ...} if there is a match or return {:error, %Meeseeks.Error{type: :select, reason: :no_match}} if there is not.

Now itā€™s easier to write code like:

with({:ok, qt} <- Meeseeks.fetch_one(doc, css(".qt"))) do
  ...
else
  {:error, %Meeseeks.Error{type: :select, reason: :no_match}} ->
    ...
end

My thanks to those who provided feedback.

Other

A bug related to Meeseeks.html was fixed, see this issue for more details

4 Likes

Just popping in to say that Iā€™m still loving meeseeks. I think Iā€™ve used every selector it has (and made one as well) for parsing both large amounts of html and xml both. ^.^

3 Likes

Release v0.9.1

A small release fixing a couple bugs and some typespec problems, primarily thanks to work by @asonge.

The first bug fix is that Document.get_nodes/1 now raises instead of adding a nil to the returned nodes if a node is - impossibly - not found in the document.

The second bug fix is that Document.get_nodes/2 now actually works right.

2 Likes

Release v0.9.2

Super tiny update to allow the css and xpath macros to accept vars.

iex> import Meeseeks.XPath
Meeseeks.XPath
iex> path = "//li[last()]"
"//li[last()]"
iex> xpath(path)
%Meeseeks.Selector.Element{...}

It is worth noting, however, that using a var (or string interpolation) in the css or xpath macros moves the creation of the selector to run time, while using a static string literal allows it to be created at compile time. If your use case permits, prefer xpath("//li[last()]").

2 Likes

Release v0.9.3

This release fixes a Dialyzer-related problem identified by @sztosz and correctly diagnosed by @NobbZ. Thanks for your help.

4 Likes

Release v0.9.4

This release fixes some XPath selection bugs discovered by anulman.

3 Likes

Release v0.9.5

This release fixes another selection bug, again discovered by anulman.

2 Likes

Lol, awesome finds by @anulman, great update relate as always, this explains why my bot got update notifications, thanks much!

1 Like

Release v0.10.0

This release adds support for OTP 21.

5 Likes

Release v0.10.1

This is a very minor release adding ā€œsupportā€ for Elixir 1.7. In truth itā€™s been working fine with 1.7 this whole time, but now Travis CI ensures that fact.

In addition to that I added a bunch of older Elixir+OTP combinations to also be tested by Travis CI. Meeseeks started out on Elixir 1.3 and OTP 19 (a combination on which it still runs fine, thanks to the awesome Elixir team), and rather than just testing that and the latest combination I now also test some past combination existing between those two.

5 Likes

Release v0.11.0

As of this release Elixir 1.3 is no longer tested or otherwise supported, and Elixir 1.8 is tested and supported. The minimum tested combination is now Elixir 1.4.0 and Erlang/OTP 19.3, and the maximum tested combination is now Elixir 1.8.1 and Erlang/OTP 21.0. Sorry Elixir 1.3 users, it was time.

This release also pulls in the recently released Meeseeks_Html5ever v0.11.0, which makes parsing faster and more memory efficient on Erlang/OTP 21.

Finally Iā€™d like to note that as of a couple days ago Meeseeks is now over two years old!

I want to give a big ā€œThank you!ā€ to all the people who have helped me design, implement, and test Meeseeks over the past couple years. Your collaboration and feedback have made the headaches worthwhile.

5 Likes

In honor of Meeseeks 2nd birthday the first post in this thread has been updated and now includes an examination of why Meeseeks was created when Floki already existed in the space, and why you might (or might not) want to use Meeseeks instead of Floki.

3 Likes

Iā€™ve added an update to the memory use issue discussing the positive impact of Erlang/OTP 21.

2 Likes

Iā€™ve updated the Meeseeks vs. Floki benchmark to the latest versions of both, as well as the latest Benchee which means thereā€™re now memory numbers.

Meeseeksā€™s edge in speed is even smaller than before - Meeseeks does a little more now than before, and Floki got faster on OTP 21 - though the memory measurements seem to clearly favor Meeseeks.

4 Likes

Release v0.11.1

This release makes a couple of small improvements.

Firstly, Meeseeks now returns a better error when trying to parse a string of non-UTF-8 encoded content. Thank you to anulman for reporting that problem.

Secondly parse/2 now accepts :tuple_tree as a type, and parsing tuple trees with parse/1 has been soft deprecated and will emit a warning. Additionally, parsing tuple trees with unexpected node structure will now return a parse error instead of silently ignoring the node. Thank you to @axelson for reporting the parsing issue and discussing whether parse/1 should accept tuple trees.

4 Likes