I would like to parse Atom and RSS feeds in an application I am working on. I’ve looked at the existing packages on Hex to help me with this task and they all seem to be unmaintained. Sure, the world of Atom/RSS feeds isn’t changing much, so once you have a package working, it doesn’t have to change much either. I could fork one of those and fix warnings for the latest Elixir version.
From those packages, fast_rss seems to be the best choice, although I would rather not have to install the Rust compiler only to parse feeds.
I thought about using Elixir Ports to rely on a pre-compiled binary of a RSS/Atom parser from another programming language. This would certainly widen my options, but add complexity.
What would you recommend? Do you have experience with some of the Atom/RSS parsers?
Last time I ran into this I didn’t like any of the options so I just wrote the parsers by hand using Saxy, which worked great.
I am not a fan of the NIF approach for stuff like this because the BEAM’s guarantees are actually very useful for parsing arbitrary internet content in a multitenant system. Throwing them away feels unwise. At some point I will probably write a new RSS library but it’s not a top priority at the moment.
You can properly write a faster parser in Elixir anyway, NIF comes with huge cost especially if you are moving data back and forth. Good rule of thumb, most BIFs are very simple, I wonder why
RSS/Atom feeds are small, simple XML files, Floki with its pure Elixir parser works very well on them. I tracked 10s of thousands feeds and have not seen any problem.
Because they are meant to be timesliced! But yeah, in general if you’re writing code in a beamlang it’s because you want to take the multitenant side of the tradeoffs rather than the batch processing side.
Does the HTML parser support CDATA? I would think not, right?
It’s very common to embed content in RSS feeds using CDATA (e.g. HTML content). Saxy handles it fine.
My impression was that real men would be blowing up their atom tables, but it seems like maybe the xmerl sax parser does use strings? I’ll keep that in mind.
It is actually somewhat annoying to parse RSS in practice. The spec is extended in a bunch of places and there is a lot of weird stuff out there in the real world. I do think there’s room for a good RSS package, I just didn’t find one that I liked at the time. I would like to write one but I currently have a storage engine-shaped backlog that I’m trying to squash before the new year
I am a recovering perfectionist (long process). To me you either settle for one good ready-made complete software package, use a library and patch over its imperfections, or roll your own and enrich it on demand.
I have found maintaining other people’s stuff a thankless work that very rarely materializes time and energy savings over long-enough usage so I end up doing either option one or option three. (And option one is very rarely useful if you want to integrate stuff in your own program so you’d use f.ex. Port and such, which becomes a hassle after an hour of work.)
And yes if memory serves :xmerl does not use atoms but it has been years and I might not remember well. Same as for JSON, people overdo the parsing conveniences, I often parse JSON as recursive string-keyed maps and just work with that; Elixir’s excellent pattern-matching does multiple things at once there and is unbeaten by any language in those areas except maybe OCaml / Haskell / Rust (if you can stomach the increased number of coding lines).
I don’t love dependencies (as you know) but RSS is one of those things where it’s better to have a battle-tested library because there are a lot of edge cases in the real world. HTML also used to be one of those things until everyone sat down, wrote down all of the edge cases (html5), and said “enough”. And then there’s Markdown, where they tried to write down the edge cases but it didn’t work (lol).
I think if I was writing a new RSS parser right now I would collect a large corpus of feeds and test against a well-worn parser. That should surface the bugs quickly enough.
That’s my option three. I’d literally start collecting RSS and Atom feeds and just extend my newborn library with the edge cases I found when needing to process the feeds I care about (and I’ll say as much in the README of the library). I see no point in bringing in a big dependency “just in case”.
Frak “just in case”. I am done with that.
…Well, not quite, but let’s just say that I am mostly done with it. I am paranoid at work but being a senior and even half-PM and half-CTO, I need to prioritize aggressively. It helps a lot with killing the misguided perfectionism.
Absolutely. I shifted topics to home-grown stuff, my apologies.
I share your reservations: I too think using a NIF for something as trivial as ingesting the occasional XML is heavy-handed. And I am very intrigued by how did @derek-zhou write his software though it’s likely commercial and will never be open-sourced.
The only thing that annoys me when parsing outside input with Elixir is that the BEAM holds on to 64+ bytes long binaries and you have to take special care not to introduce a huge cyclic graph of them that will never be GC’d. If my Golang wasn’t as rusty as it is currently I’d likely write anything in that ballpark with it.
Or learn how to copy segments of binaries without introducing the huge GC cycles and just use :xmerl – that’s what I would do if I absolutely had to stay with the BEAM (which all of us that work with Elixir for money do).
Shameless plug here, but take a look at Gluttony. It’s Sax-based, and it’s pretty fast. I haven’t been using it for a while, but the code is stable and pretty extendable / easy to maintain
.
On the other hand if you know what you’re doing the refc binary slicing behavior is extremely powerful. I managed to design a whole storage engine that never actually deserializes the blocks from disk and only copies out what it needs just-in-time. But more on that later.
In this case you would indeed have to be careful to copy out the binaries while parsing. I would think libraries like Saxy and Floki do this by default, right?
There are huge advantages to be had staying within the BEAM, and that’s what I was getting at with the NIF comment. In Elixir you can build server-rendered multitenant apps that can just do stuff like fetching an RSS feed from the web and parsing it without any indirection. It’s a very unique value proposition. My understanding is that Go is the closest (goroutines) but they still have a global GC which is not enough.
Now you got me interested. This is one of those things that I know are important but never got around to them. If you plan a write-up on the topic then do ping me.
No idea. I would not bet on it unless I specifically go and check.
Theoretically correct but in practice Golang’s GC is one of the most iterated on and perfected (maybe except the JVM one). They have literal microseconds worth of GC pauses even on high-traffic services.
It uses saxy, xmerl, or erlsom. I was actually trying to make my own rss feed aggregator and reader and got pretty far with using my structural parser in GitHub - ducharmemp/nosh but it definitely didn’t get up to compliance standards like gluttony or other libs.