FastRSS - A very quick RSS feed parser

FastRSS

Parse RSS feeds very quickly:

  • This is rust NIF built using rustler
  • Uses the RSS rust crate to do the actual RSS parsing

Speed

Currently this is much faster than most of the pure elixir/erlang packages out there that I tested.

In benchmarks there are speed improvements anywhere between 6.12x - 50.09x over the next fastest package (feeder_ex) that was tested.

Compared to the slowest elixir options tested (feed_raptor, elixir_feed_parser), FastRSS was sometimes 259.91x faster and used 5,412,308.17x less memory (0.00156 MB vs 8423.70 MB).

See full benchmarks

Usage

There is only one function it takes an RSS string and outputs an {:ok, map()} with string keys.

iex(1)>  {:ok, map_of_rss} = FastRSS.parse("...rss_feed_string...")
iex(2)> Map.keys(map_of_rss)
["categories", "cloud", "copyright", "description", "docs", "dublin_core_ext",
 "extensions", "generator", "image", "items", "itunes_ext", "language",
 "last_build_date", "link", "managing_editor", "namespaces", "pub_date",
 "rating", "skip_days", "skip_hours", "syndication_ext", "text_input", "title",
 "ttl", "webmaster"]

The docs can be found at https://hexdocs.pm/fast_rss.

Supported Feeds

Reading from the following RSS versions is supported:

  • RSS 0.90
  • RSS 0.91
  • RSS 0.92
  • RSS 1.0
  • RSS 2.0
  • iTunes
  • Dublin Core

Links

GitHub: https://github.com/avencera/fast_rss
Hex: https://hex.pm/packages/fast_rss
HexDocs: https://hexdocs.pm/fast_rss/FastRSS.html
Benchmarks: https://github.com/avencera/fast_rss#benchmark

Why?

I needed to parse some podcast RSS feeds from iTunes. At first I tried elixir_feed_parser but I noticed it was a bit slow on some of the larger feeds. Recently, I have also been enjoying working with Rust. I remembered that Rustler was a thing, and I always thought it was interesting. But I never had a chance to use it.

I thought trying to make a Rust NIF to parse RSS feeds would be a fun learning exercise. It turned out to be not be too much effort (thanks @hansihe and @scrogson). The hardest problem I had was dealing with some annoying problems with deploying on alpine.

I wasn’t planning on releasing this as a hex package until I did some benchmarks. The first version was pretty dumb, I would pass the parsed xml data from Rust as a stringified json and decode it on the elixir side using Jason, so I wasn’t expecting much in terms of performance. But I was surprised to see it being between 16x-42x faster. That’s when I decided to release it as a hex package.

Since then I’ve made it a bit smarter (I encode the Rust struct directly into an elixir map). And I added some other packages to the benchmarks. I’m sure it can still be made much smarter.

Of all the other packages I tested, FeederEx was the fasted pure elixir/erlang package. But FastRSS is still 6.12x - 50.09x faster.

11 Likes

I don’t plan on using RSS soon (although plans can change rapidly) but I thank you for posting your project because it gave me a good reference comparison with my own Rustler NIF efforts. :+1:

2 Likes

You’re welcome! Do you have any feedback for me? This is my first NIF and i’m relatively new to rust. So any feedback is definitely appreciated. Thanks!

Are you respecting the need of the BEAM for NIFs to return within about 1ms?

If you aren’t the simple solutions would be to either make the NIF behave asynchronously or - if you’re willing to work with a minimum version of Erlang/OTP 20 - use a dirty scheduler (which Rustler makes easy).

1 Like

Yes I think this will be a problem on larger RSS feeds. I think I want to handle it async like this: https://github.com/rusterlium/html5ever_elixir/blob/master/lib/html5ever.ex

1 Like

I suggest dirty scheduler unless you know you need to use Erlang/OTP before 20.

This would probably be all you need to change for that:

rustler::rustler_export_nifs! {
    "Elixir.FastRSS.Native",
    [
        ("parse", 1, parse, SchedulerFlags::DirtyCpu)
    ],
    None
}
1 Like

Good idea I’ll do that for now.

Run some benchmarks and maybe later do it the other way.

Thank you

Release v0.3.0

Thanks @mischov

3 Likes

It shouldn’t really until your system gets loaded down, in which case the dirty ones can let other actors keep working instead of freezing up the thread that it was executed out. Long running (more than 1ms or so) should always be dirty. :slight_smile:

3 Likes

I actually thought that I would decrease performance. I remember reading something about performance overhead of dirty schedulers? But was that for the entire BEAM?

They do have a performance overhead, but it’s fairly small overall (although it’s massive compared to work being done in less than 1ms).

1 Like

This article on dirty scheduler overhead (from 2015) suggests their overhead is (or was at the time) something like 10us.

2 Likes

Nice, not bad at all.

That’s awesome, basically nothing in this case.