Library to safely parse XML (by avoiding random atom creation)

I am writing an application where I need to parse XML, and intuitively I reached out for sweet_xml as its a wrapper around xmerl.

Though while I was inspecting the parsed result, I realized that attribute and element names appeared as atoms, and as the XML I receive will be beyond my control, I want to avoid “random” atom creation. Some HTML documents, that slip through could bring the system down…

So does anyone have a typ for an XML library I could use instead?

If it can directly transform the XML into a struct thats a plus, but not a strict requirement. Though if it can’t do structs directly, XPath is a requirement.

I explicitely do not want approaches that work like pythons xml2dict.

3 Likes

There’s erlsom | Hex, which uses strings rather than atoms for element and attribute names…

6 Likes

I have also used saxy successfully. Specially useful if you want to get only certain parts of the XML.

12 Likes

I just use Floki with its builtin mochi based parser. Safe enough for me (string keys). It will not validate the XML though.

3 Likes

I’ve tried a few XML parser implementations for a “proof of concept” software, I had the same concerns with xmerl regarding atom creations, our use case was to process large, very large xml documents and as I have some SAX knowledge (> 20years) I finally choosed the same es Jose, saxy. Yes it doesn’t support xpath, it doesn’t support namespaces, it doesn’t support validation and what not, but it is quite fast and easy to use. It was my first challenge in elixir to implement some missing things, namly some kind of namespace support (normalizing aliases), building structs and partialized parsing (I hope, these extensions could be open sourced some time, but this has to be clarified) and these tasks wasn’t really hard

2 Likes

We currently do this at work:

Saxy (to get xmerl)
Then we use this sweet_xml with data_schema to query that xmerl.

There is also meeseeks GitHub - mischov/meeseeks: An Elixir library for parsing and extracting data from HTML and XML with CSS or XPath selectors.

3 Likes

I would also say the atom thing might not be a problem. You can monitor the atom table and increase the default if you need. Depends on the kinds of responses you are parsing, but it will probably reduce the memory footprint over binaries, I would guess that’s why xmerl uses them.

2 Likes

The XMLs that I expect to parse are user provided, so atoms are out of the question. Reading the OPMLs is sadly a strict requirement, similar to how RSS is based on XML, and there also everything can be in the servers response :frowning:

So trusting those external inputs to not spill my atom table, monitored or not, is out of the question.

Though indeed I started playing with Meeseeks yesterday, as I found some articles online that complained about xmerls xpath performance.

So it’ll be Meeseeks for now.

3 Likes

Nice, incidentally I am working on speeding up and reducing the memory footprint of our XML parsing at work the moment. If I land on something I can share, though we may have different use cases.

1 Like

SAX parser is definitely what you’re reaching for.

1 Like

yea we are using saxy already

1 Like

In that case the only likely way to reduce the footprint is to try to clear out whatever is processed before you move to the next step. I’ve never done it before but I’ve thought about it a lot because I expected it might be a challenge with the copy on write approach.

Potentially send out pieces you’re working with to their own process so that the smaller chunks can easily be garbage collected?

Just speculating.

1 Like

One thing to look out for if you’re using ets tables to store binary fragments retrieved from the xml, you might be causing your source xmls to never get gc’d, since a fragment is just a offset+length on the original binary

1 Like

I find myself in violent agreement with NobbZ statement “trusting those external inputs to not spill my atom table, monitored or not, is out of the question.” I also wonder how many Elixir libraries (to say nothing of app-specific modules) are trusting external inputs not to crash the BEAM.

One of the most pleasant things about Elixir is that we don’t have to worry about all of the pitfalls introduced by thread programming. However, it strikes me that there may be a set of Elixir-specific pitfalls. Is there a “best practices” document that lays out potentially unsafe coding practices of this sort? If not, perhaps some of the smarter folks on the list could create one…

-r

1 Like

Slightly tangential but there is the efficiency guide.

https://www.erlang.org/doc/efficiency_guide/users_guide.html

4 Likes

Very cool list of things to inspect and keep in mind. Interesting reading.

I needed to parse large (1-3gb) XML documents with hundreds of thousands of records recently. My first attempt was using SweetXML and XPath but the memory consumption was very high, so I switched to ultimately switched to Saxy successfully. It was a bit confusing to get started, dealing with pretty complex XML files, but my solution ended up working well.

I’d be happy to go into greater detail if anyone’s interested, but just wanted to second using Saxy.

SweetXML was great for a smaller set of data, it just got unwieldy.

2 Likes

Just to call out you can use them in tandem - have your Saxy handler spit out xmerl and then you can use SweetXML to query into that.

2 Likes

Just to follow up with what we ended doing at work after a lot of research.

We now use the default Saxy handler to spit out simple form, then we use data_schema’s to query the simple form. We implemented our own querying for simple form - which was not to difficult to do in the end. (If you are interested I may be able to open source).

This means you can cast the strings in the XML into data values quickly, then traverse the results using Elixir functions rather than xpath.

Not using xmerl and SweetXML drastically reduced our memory impact (we are talking from 1gb to 80mb for large XMLs) and sped it all up by about 12 times.

On top of that I’ve also implemented a custom Saxy handler that will only create the intermediate simple_form representation for values that are required by a data schema. So you pass a schema to the saxy handler then Saxy only creates simple_form for the values that the schema wants.

This keeps the memory impact very low as it never spikes higher than what is required by the schema. It adds a little complexity though.

2 Likes