Parsing large JSON files

Hi folks,

I’m having issue trying to parse a JSON file larger than the available memory.
I know how the JSON is structured and I’d like to “emit” sub-structures based on a query.
The well-known Poison parsing library does not seem to fit my needs.
I’m trying to build an app from Open Data documents, but I don’t have the resources to afford a server powerful enough to parse it in-memory.

I started my own implementation of a SAX parser (basically an event based parser) but I’m relatively new to elixir, as you can deduce from the quality of the code :head_bandage:
I managed to emit “structure” events ( object_start, array_start, array_end, key, value and so on…) but I’m now having issues building the sub-structures

Ideally I’d like to have the following api:

# data.json
# {
#   "foo": {
#     "bar": [{"baz":1},{"baz":2}]
#    }
# }

Parser.parse("data.json", [...queries], fn(struct, query) -> end)

Where queries would emit the following structs:

"foo" -> %{"bar": [%{"baz":1}, %{"baz":2}]}
"foo.bar" ->  [%{"baz":1}, %{"baz":2}]
"foo.bar[]" -> %{"baz":1}
            -> %{"baz":2}
"foo.bar[1]" -> %{"baz":2}
"foo.bar[0].baz" -> 1
"foo.bar[].baz" -> 1
                -> 2

Are there already libraries that could help me achieve this? (in case I missed some)
Is GenServer the best way to implement an “event emitter”? (is it even a valid elixir pattern?)
Bonus algo question:
Now that I have my “structure” events, do you have any idea of how to build a “structure path” to match the queries, and how to build a valid Elixir structure from those events?

Thank you for your time!
(Don’t hesitate to ask questions if I can be more precise on some points)

1 Like

Well jsx is an erlang application but I’m pretty sure it can do JSON parsing on-the-fly so you do not need to hold it in memory (via its callback interface).

There might be others but I’ve not looked, but as I recall jsx can do that. :slight_smile:

However, there is no ‘query’ syntax like you show for jsx, you could build one though that eats a JSX callback handler and parses out what is requested, that would be a great library to add to hex.pm. :slight_smile:

2 Likes

Aaaah thank you!!
I found this lib but I was too obsessed with the “SAX” keyword, I did not notice they were doing this callback stuff :confused:

FYI they did an elixir wrapper exjsx

I just have to implement this “query” feature now :smiley:

1 Like

Do you intend to put in a pr to exjsx for that functionality?

If my implementation works, why not :slight_smile:
I’ll post updates in this topic!

Feel free to try it too if you want :stuck_out_tongue:

2 Likes

Thank you! It’s almost what I need!
However I want to emit structs who match a query, and not only values:
{foo: {bar: [1,2]}} => “foo.bar” should emit [1,2].
If I’m not mistaken, your lib will only match ["foo", "bar", _] and call handle_value twice with 1 and 2

To achieve that I need to pass queries when initializing my “filter” module.
Once a query is matched, it enters in a “building” mode (I’ll likely use your :jsx_to_term module)
When the query is done building, I’ll just have to call a handle_value kind of function and clean the query from my state, to keep a low memory usage.

I’ll get inspiration from your code, it’s clean and neat :slight_smile:

1 Like

oh i see what you mean by queries now. jsonfilter is fine if you know upfront the shape of the json you’re decoding but not so great for the dynamic case. you could try something with :lists.prefix/2 like this:

defmodule PrefixMatcher do
  def init(queries) do
    { queries, %{} }
  end

  def handle_value(path, value, { queries, result }) do
    case match?(path, queries) do
      query      -> { queries, accumulate(path, value, query, result) }
      :unmatched -> { queries, result }
    end
  end

  def finish({ _, result }), do: result

  defp match?(path, queries) do
    # left as an exercise for the reader, return the first
    # query that matches the path using `:lists.prefix/2`
    # or `:unmatched` if there is no match
  end

  defp accumulate(path, value, query, result) do
    # another exercise! if the path minus the query prefix
    # is empty return the value directly, if the path minus query
    # is an integer possibly allocate a list and append the value to
    # it, if the path minus query is a key, possibly allocate a map and
    # insert the key and the value. if it is anything else, things get
    # more complex...
  end
end
1 Like

Opening and old topic but new needs: decoding large json files with Elixir in 2021 ?
Jason/Poison obviously can’t do the job. I need to stream somehow these files.

My question: are there any new/good streaming JSON libraries since 2017 ?
Of course, excepting jsx/exjsx

It seems that Jaxon does the job.