Streaming XML to file

Hello :slight_smile:

We are trying to export a “huge” amount of data from a database using XML.
Our XML exporter service is running on a cluster and it has a limited amount of memory (e.g. 4GiB of RAM) so we cannot keep in memory the whole XML document.
Does anyone knows any good Elixir (or erlang) library which allows to stream data so we don’t have to keep in memory the whole document before writing it?

Maybe this search will help https://www.google.com/search?client=firefox-b-d&ei=dFQcXtWTOI-uUoK0togM&q=elixir+stream+xml+async&oq=elixir+stream+xml+async&gs_l=psy-ab.3...88881.89326..89677...1.0..0.106.410.1j3…0…1…gws-wiz.3N-Syrda4Qg&ved=0ahUKEwjVndS9vIDnAhUPlxQKHQKaDcEQ4dUDCAo&uact=5

This emerged https://github.com/processone/fast_xml

This does not help:

Fast Expat based Erlang XML parsing library. I’m looking for a writer, not a parser.

1 Like

I hate to bring this suffering upon you but for excellent performance you should definitely just use Erlang’s :xmerl. It’s not exactly easy to work with, ask away if you get stuck.

It does support streaming, both reading and writing, so RAM will not be a problem.

3 Likes

Thanks, I’ll look for it. IMHO there is a huge lack of documentation about how to use it for streaming purposes. Are you aware of any useful information?

I have a few scattered links that can help you get started. xmerl is not very intuitive for a newcomer so I see some cursing in your future. :003:


Notes on xmerl and processing XML in Erlang. This one can help you understand how can you construct :xmerl XML nodes programmatically (as opposed to receiving them when stream-parsing a file) – and then stream them to a destination file. You’ll likely have to learn to import and use Erlang records in Elixir at some point – but mind you, this is only for reading Erlang records from Elixir, not for writing them.

It’s quite simple:

defmodule Meh.Xmerl
  Record.defrecord(
    :xmlElement,
    Record.extract(:xmlElement, from_lib: "xmerl/include/xmerl.hrl")
  )
end

Read more about it:


Two blog articles on stream-reading XML files. You’ll need these even if you only write XML:

  1. SAX Xml parsing in Erlang – Part 1
  2. A Simple XML State Machine Accepting SAX Events to Build xmerl Compitable XML Tree: icalendar demo

xmerl's docs on the various hook functions that you can pass to its processors (trust me, you’ll need some of these as well, depending on the complexity of your task):


Basically, xmerl works by accumulating XML elements / texts / namespaces etc. in a reverse-accumulator manner (prepending elements to a list; at least that’s how it works when stream-parsing and processing-as-you-go – haven’t checked the writing functionality for that idiom), using its own in-memory representation of XML. Which you can then pass to its export functions if you want a file (or only a singular XML element serialised to a charlist).

As I said before, it’s going to be a ride.

Thanks for you reply :slight_smile: I’m still reading and trying to understand them. Anyway most of your links are not useful for us right now, since we need to write a XML file without keeping it in memory, while I think most of your links are good for stream parsing (which is the opposite).

You might consider keeping it simple and handle writing the opening/closing tags yourself. For example using :xmerl to format chunks of XML to binary or IO list form using Stream.map. Then write those to a custom file stream. For the File streams, it might just be easiest to write your own in one of a few ways:

  1. Write a custom start tag manually (e.g. File.write(Fl, “”) then stream into that file like Stream.map(data_stream, &xml_serialize/1) |> Stream.into(file). Alternatively Stream.transform let’s you add start and after functions so it’d be possible to write a stream accumulator (e.g. file writer) that automatically writes the root open/close tags upon stream completion.
  2. Write a stream of xml nodes to different files (as xml fragments) and when done either manually concatenate them using Elixir or use a standard xml tool (I’m sure there’s something to concat xml fragments on the command line) to concat the files. This would let you parallelize file writing.
  3. Write your xml node stream into a file as an xml fragment. Then when done do a couple of file seek/insert’s to add the root xml tags.

Personally, I’d consider just using a set of EEX templates to write serializers for your data. Its really not harder to do than writing HTML, and being XML you could verify your files using a schema. Using a few recursive to_xml functions and pattern matching you could handle things like “int => <%= @val %>”.

Elixir’s Stream primitives and IO List mechanisms really simplify much of this problem and library might not add too much. You’ll still have to consider the size a various child elements, and might break them into smaller sub-chunks using the above to ensure your memory usage doesn’t explode. Of course, even just writing a little library yourself with a reverse SAX style “write tag start” and “write tag end” wouldn’t take too much effort (I’d guess a few hundred lines to cover 99% of useful xml) using a Collectable stream based approach.

Best of luck!

The first link talks about constructing :xmerl structures to then serialise to files.

As already mentioned, the rest of the links are for gaining understanding in the library.

BTW, :xmerl has that – for stream parsing and stream writing. I haven’t exercised the writing but the library functions are there.

We had a similar use case the difference being we need to generate xlsx files (which is just a zip file with multiple xml files). Apparently neither google nor hex.pm seems to show this library for xml stream

2 Likes

I’ll have some free time this weekend. If you like, put together a small project on GitHub where you put one (anonymised) big input file and your desired output XML file format. I’d be interested in trying to help with a solution. Data formats are a hobby. :slight_smile:

Thanks for you help.
I’ve already created this project: https://github.com/ispirata/xml_stream_writer .
I will push it to hex.pm this week.

I’ll take a look!