Apologies, busy day caught up with me. I did a bit of poking at the XML structure and it’s a bit unclear as to what you’re looking for in there since I’m not sure the XPaths actually match up with the structure of release
, so sorry about not having much in the way of example code. But to give a tl;dr on the way that SAX in general works with elixir, the idea is that it’s a recursive-ish function that carries on some additional state. A special cased-reduce, if you will.
So
def handle_event(:start_element, {name, attributes}, state) do
IO.inspect("Start parsing element #{name} with attributes #{inspect(attributes)}")
{:ok, [{:start_element, name, attributes} | state]}
end
In this function specifically, you’re saying “append on the 3-tuple of start_element, name, and attributes to the head of the state list” with [{:start_element, name, attributes} | state]}
. That’s why Saxy runs out of memory. If you want to look at only a specific tag, you’ll have to do some filtering. For example (illustrative only)
def handle_event(:start_element, {name, attributes}, state) do
if name == "release" do
{:ok, %{state | in_release: true }}
else
{:ok, state}
end
end
state
in the above example doesn’t have to be a list, it can be anything at all, just like Enum.reduce
.
That said, I tried throwing some solutions at your particular problem and I think this is one of those situations where I’m going to have to be more vague than I’d like-- with a 72GB file, you’re going to have to figure out some engineering to get the data transformed in a reasonable amount of time, or pull in some rust via rustler to really let the CPU rip. Even when swapping out Saxy (the fastest XML parser when I tested my declarative XML parsing lib GitHub - ducharmemp/saxaboom, check the benchmarks) for fast_xml, I was still looking at multiple tens of minutes to basically do no real work. Script below for completeness
Mix.install([:fast_xml])
defmodule CSVHandler do
use GenServer
@impl true
def init(fname) do
{:ok, File.open!(fname, [:write])}
end
# really rough, I wouldn't recommend productionizing this
@impl true
def handle_info({:"$gen_event", {:xmlstreamelement, {:xmlel, tag, attrs, children}}}, state) do
IO.puts(state, tag) # This only writes the release tag name as a vague gauge of how far we've come in the file
{:noreply, state}
end
@impl true
def handle_info(arg, state) do
{:noreply, state}
end
end
{:ok, handler} = GenServer.start_link(CSVHandler, "out.csv")
stream_parser = :fxml_stream.new(handler)
File.stream!("./discogs_20231001_releases.xml")
|> Enum.reduce(stream_parser, fn chunk, parser -> :fxml_stream.parse(parser, chunk) end)
Also just to do my due diligence-- I’d actually recommend against using my library that I posted above if you want to expand out this parser to hold more data. Looking at the data overall, it really seems like you need a home rolled solution, and on top of that my lib doesn’t support streaming out data.