Recommendations to parse Atom and RSS feeds?

I was just playing with xmerl this week and sadly it parses XML generating atom keyed trees.

Yes, we have the option of using xmerl_sax_parser and handle the string parsing ourselves:

2 Likes

Thank you for the grounding, I was unsure myself.

2 Likes

Thanks for confirming. Note that there is some further guidance from the erlef for the sax parser.

2 Likes

CDATA is just string, which is handled in Floki’s built in parser here:

Yes, it is common practice to embed HTML content in the feeds. I handle 2 most common cases:

  • directly insert the html DOM tree within the RSS item This is probably not 100% kosher XML but it is very easy to handle. I just take the HTML sub-node verbatim.
  • embed the html fragment as an encoded string. I just take the string, call Floki again on it to parse the html structure

I believe the above 2 cases handle the 99% feeds in the wild. They are embarrassingly easy to handle so I will skip posting my not very elegant code. If you see something more exotic than this, please give me a url, I’d like to play with it.

1 Like