Library to safely parse XML (by avoiding random atom creation)

NobbZ · April 29, 2022, 5:35am

I am writing an application where I need to parse XML, and intuitively I reached out for sweet_xml as its a wrapper around xmerl.

Though while I was inspecting the parsed result, I realized that attribute and element names appeared as atoms, and as the XML I receive will be beyond my control, I want to avoid “random” atom creation. Some HTML documents, that slip through could bring the system down…

So does anyone have a typ for an XML library I could use instead?

If it can directly transform the XML into a struct thats a plus, but not a strict requirement. Though if it can’t do structs directly, XPath is a requirement.

I explicitely do not want approaches that work like pythons xml2dict.

voltone · April 29, 2022, 11:04am

There’s erlsom | Hex, which uses strings rather than atoms for element and attribute names…

josevalim · April 29, 2022, 12:18pm

I have also used saxy successfully. Specially useful if you want to get only certain parts of the XML.

derek-zhou · April 29, 2022, 3:45pm

I just use Floki with its builtin mochi based parser. Safe enough for me (string keys). It will not validate the XML though.

odix67 · April 29, 2022, 8:10pm

I’ve tried a few XML parser implementations for a “proof of concept” software, I had the same concerns with xmerl regarding atom creations, our use case was to process large, very large xml documents and as I have some SAX knowledge (> 20years) I finally choosed the same es Jose, saxy. Yes it doesn’t support xpath, it doesn’t support namespaces, it doesn’t support validation and what not, but it is quite fast and easy to use. It was my first challenge in elixir to implement some missing things, namly some kind of namespace support (normalizing aliases), building structs and partialized parsing (I hope, these extensions could be open sourced some time, but this has to be clarified) and these tasks wasn’t really hard

Adzz · April 30, 2022, 11:03am

We currently do this at work:

Saxy (to get xmerl)
Then we use this sweet_xml with data_schema to query that xmerl.

There is also meeseeks GitHub - mischov/meeseeks: An Elixir library for parsing and extracting data from HTML and XML with CSS or XPath selectors.

Adzz · April 30, 2022, 11:09am

I would also say the atom thing might not be a problem. You can monitor the atom table and increase the default if you need. Depends on the kinds of responses you are parsing, but it will probably reduce the memory footprint over binaries, I would guess that’s why xmerl uses them.

NobbZ · April 30, 2022, 3:19pm

The XMLs that I expect to parse are user provided, so atoms are out of the question. Reading the OPMLs is sadly a strict requirement, similar to how RSS is based on XML, and there also everything can be in the servers response

So trusting those external inputs to not spill my atom table, monitored or not, is out of the question.

Though indeed I started playing with Meeseeks yesterday, as I found some articles online that complained about xmerls xpath performance.

So it’ll be Meeseeks for now.

Adzz · April 30, 2022, 6:03pm

Nice, incidentally I am working on speeding up and reducing the memory footprint of our XML parsing at work the moment. If I land on something I can share, though we may have different use cases.

brightball · April 30, 2022, 6:07pm

SAX parser is definitely what you’re reaching for.

Adzz · April 30, 2022, 6:10pm

yea we are using saxy already

brightball · April 30, 2022, 6:20pm

In that case the only likely way to reduce the footprint is to try to clear out whatever is processed before you move to the next step. I’ve never done it before but I’ve thought about it a lot because I expected it might be a challenge with the copy on write approach.

Potentially send out pieces you’re working with to their own process so that the smaller chunks can easily be garbage collected?

Just speculating.

ityonemo · May 1, 2022, 8:19pm

One thing to look out for if you’re using ets tables to store binary fragments retrieved from the xml, you might be causing your source xmls to never get gc’d, since a fragment is just a offset+length on the original binary

Rich_Morin · May 2, 2022, 8:43pm

I find myself in violent agreement with NobbZ statement “trusting those external inputs to not spill my atom table, monitored or not, is out of the question.” I also wonder how many Elixir libraries (to say nothing of app-specific modules) are trusting external inputs not to crash the BEAM.

One of the most pleasant things about Elixir is that we don’t have to worry about all of the pitfalls introduced by thread programming. However, it strikes me that there may be a set of Elixir-specific pitfalls. Is there a “best practices” document that lays out potentially unsafe coding practices of this sort? If not, perhaps some of the smarter folks on the list could create one…

-r

cmo · May 2, 2022, 11:17pm

Slightly tangential but there is the efficiency guide.

https://www.erlang.org/doc/efficiency_guide/users_guide.html

voltone · May 3, 2022, 6:54am

crusso · May 3, 2022, 6:17pm

Very cool list of things to inspect and keep in mind. Interesting reading.

markholmes · May 4, 2022, 12:36am

I needed to parse large (1-3gb) XML documents with hundreds of thousands of records recently. My first attempt was using SweetXML and XPath but the memory consumption was very high, so I switched to ultimately switched to Saxy successfully. It was a bit confusing to get started, dealing with pretty complex XML files, but my solution ended up working well.

I’d be happy to go into greater detail if anyone’s interested, but just wanted to second using Saxy.

SweetXML was great for a smaller set of data, it just got unwieldy.

Adzz · May 6, 2022, 1:52pm

Just to call out you can use them in tandem - have your Saxy handler spit out xmerl and then you can use SweetXML to query into that.

Adzz · July 12, 2022, 1:17pm

Just to follow up with what we ended doing at work after a lot of research.

We now use the default Saxy handler to spit out simple form, then we use data_schema’s to query the simple form. We implemented our own querying for simple form - which was not to difficult to do in the end. (If you are interested I may be able to open source).

This means you can cast the strings in the XML into data values quickly, then traverse the results using Elixir functions rather than xpath.

Not using xmerl and SweetXML drastically reduced our memory impact (we are talking from 1gb to 80mb for large XMLs) and sped it all up by about 12 times.

On top of that I’ve also implemented a custom Saxy handler that will only create the intermediate simple_form representation for values that are required by a data schema. So you pass a schema to the saxy handler then Saxy only creates simple_form for the values that the schema wants.

This keeps the memory impact very low as it never spikes higher than what is required by the schema. It adds a little complexity though.