I’m having issue trying to parse a JSON file larger than the available memory.
I know how the JSON is structured and I’d like to “emit” sub-structures based on a query.
The well-known Poison parsing library does not seem to fit my needs.
I’m trying to build an app from Open Data documents, but I don’t have the resources to afford a server powerful enough to parse it in-memory.
I started my own implementation of a SAX parser (basically an event based parser) but I’m relatively new to elixir, as you can deduce from the quality of the code
I managed to emit “structure” events ( object_start, array_start, array_end, key, value and so on…) but I’m now having issues building the sub-structures
Are there already libraries that could help me achieve this? (in case I missed some)
Is GenServer the best way to implement an “event emitter”? (is it even a valid elixir pattern?)
Bonus algo question:
Now that I have my “structure” events, do you have any idea of how to build a “structure path” to match the queries, and how to build a valid Elixir structure from those events?
Thank you for your time!
(Don’t hesitate to ask questions if I can be more precise on some points)
Well jsx is an erlang application but I’m pretty sure it can do JSON parsing on-the-fly so you do not need to hold it in memory (via its callback interface).
There might be others but I’ve not looked, but as I recall jsx can do that.
However, there is no ‘query’ syntax like you show for jsx, you could build one though that eats a JSX callback handler and parses out what is requested, that would be a great library to add to hex.pm.
Thank you! It’s almost what I need!
However I want to emit structs who match a query, and not only values:
{foo: {bar: [1,2]}} => “foo.bar” should emit [1,2].
If I’m not mistaken, your lib will only match ["foo", "bar", _] and call handle_value twice with 1 and 2
To achieve that I need to pass queries when initializing my “filter” module.
Once a query is matched, it enters in a “building” mode (I’ll likely use your :jsx_to_term module)
When the query is done building, I’ll just have to call a handle_value kind of function and clean the query from my state, to keep a low memory usage.
I’ll get inspiration from your code, it’s clean and neat
oh i see what you mean by queries now. jsonfilter is fine if you know upfront the shape of the json you’re decoding but not so great for the dynamic case. you could try something with :lists.prefix/2 like this:
defmodule PrefixMatcher do
def init(queries) do
{ queries, %{} }
end
def handle_value(path, value, { queries, result }) do
case match?(path, queries) do
query -> { queries, accumulate(path, value, query, result) }
:unmatched -> { queries, result }
end
end
def finish({ _, result }), do: result
defp match?(path, queries) do
# left as an exercise for the reader, return the first
# query that matches the path using `:lists.prefix/2`
# or `:unmatched` if there is no match
end
defp accumulate(path, value, query, result) do
# another exercise! if the path minus the query prefix
# is empty return the value directly, if the path minus query
# is an integer possibly allocate a list and append the value to
# it, if the path minus query is a key, possibly allocate a map and
# insert the key and the value. if it is anything else, things get
# more complex...
end
end
Opening and old topic but new needs: decoding large json files with Elixir in 2021 ?
Jason/Poison obviously can’t do the job. I need to stream somehow these files.
My question: are there any new/good streaming JSON libraries since 2017 ?
Of course, excepting jsx/exjsx