Parse JSON without a library?

webuhu · August 27, 2018, 12:37pm

Hi welcome to all - my first quick question.

I’ve the following JSON response from the Github API.

'[
  {
    "sha": "7c826371c492d673eab5a20b185e4ac312deeae5",
    "node_id": "",
    "commit": {
      "author": {
        "name": ""
      }
    }
  },
  {
    "sha": "33aa225a9d9a9a8877f14a0e2bf0077216709db1",
    "node_id": "",
    "commit": {
      "author": {
        "name": ""
      }
    }
  },
  {
    "...": "..."
  }
]'

Out of this JSON I need only the first sha. But I don’t wanna use one of the excellent JSON libraries, because I’m using this in an .exs script.

So for now I’m reaching my goal in converting to_string and then Regex.run(~r/(?<="sha":")[a-f0-9]{40}/, string).

I wanna hear if there are any other solutions in doing this in Elixir.
Thanks. All the best, Chris

NobbZ · August 27, 2018, 12:39pm

The most obvious solution is to create a proper mix-project and build an escript from it, mix escript does exist. Then you can use a proper JSON library.

You current solution using a regex might extract valid data from invalid responses, which in my opinion is a no-go.

josevalim · August 27, 2018, 1:18pm

Curiously I was just reading this paper last week: Filter Before You Parse: Faster Analytics on Raw Data with Sparser · Stanford DAWN

I would say that if you only need the SHA, then I would do a :binary.matches/3 to find the exact positions of the SHA and then explicitly read the positions using the proper offsets. Similar to regexes but it may be a bit faster. Something like this:

defmodule ExtractSHAs do
  def extract(contents) do
    for {start, _end} <- :binary.matches(contents, "\"sha\":") do
      # Build the lookup scope from after the key until the end of the binary
      scope_start = start + 6
      scope_end = byte_size(contents) - scope_start

      # Find the next quote
      {start, _} = :binary.match(contents, "\"", scope: {scope_start, scope_end})

      # Extract SHA which is always 40 bytes
      :binary.part(contents, start + 1, 40)
    end
  end
end

IO.inspect ExtractSHAs.extract """
[
  {
    "sha": "7c826371c492d673eab5a20b185e4ac312deeae5",
    "node_id": "",
    "commit": {
      "author": {
        "name": ""
      }
    }
  },
  {
    "sha": "33aa225a9d9a9a8877f14a0e2bf0077216709db1",
    "node_id": "",
    "commit": {
      "author": {
        "name": ""
      }
    }
  },
  {
    "...": "..."
  }
]
"""

which returns:

["7c826371c492d673eab5a20b185e4ac312deeae5",
 "33aa225a9d9a9a8877f14a0e2bf0077216709db1"]

When it could return invalid responses? Is there any other way where "sha": would appear in JSON except as a key?

NobbZ · August 27, 2018, 2:14pm

Yes, if the response is not proper JSON at all, but contains something that matches out of random circumstances.

Also it has to be said, that both variants will fail on valid JSON, that is formatted differently.

Both of you assume that the value be in the JSON without any whitespace after the key, this has not to be true. Github may change the prettyprinting at any time and the given regex from the OP would not even be able to parse the example JSON from the OP because of this.

webuhu · August 27, 2018, 3:07pm

@josevalim: Very very cool. Perfectly serves my needs. Thank you - also for creating Elixir .

@NobbZ:
An escript sadly won’t work for this special purpose of mine. Or better to say is not a practical solution I want to work with in this case.
But your point of loosing the certainty of getting the sha out of a valid JSON or breaking on changes in the Github JSON respone formatting of course is valid.
Fortunately my script won’t break anything important if it fails.
Otherwise of course I will stop to just script it. For now it works (nothing more - I’m still early into Elixir).

Thank you for your thougths on my question.

If you are interested - my script: Script to Update the Projects Nix Pkgs Version

josevalim · August 27, 2018, 9:23pm

My implementation looks for the first quote right after "sha":, so it should work just fine if the JSON is formatted differently.

sasajuric · August 27, 2018, 9:38pm

Author’s name could be "sha":"foobar", so something like {..., "name":"\"sha\":\"foobar\""}. Admittedly, it’s probably just a theoretical issue in this example, since only the first sha is needed.

NobbZ · August 27, 2018, 10:08pm

Yeah, okay, I’ve read the offset calculation wrong on the first read. Still, you might find cases as @sasajuricpointed out, as you treat any non-quotes as skippable whitespace, and then simply taking 40 characters after the token, which might or might not be what you actually want, we don’t even know if what we got is JSON.

josevalim · August 28, 2018, 7:14am

Your example won’t match because you need to escape the quotes inside the name. So "sha": is not the same as "sha\":.

If you can identify the key, then it is guaranteed that the following entry is the value. Since this is a publicly documented endpoint, I am not worried about the risks of the result changing. I can always expect SHA being strings (maybe not always being 40 chars, I would need to check their docs, but that’s straight-forward to change). Anything else would be a breaking change. So I am rather interested on false positives/negatives.

To be more concrete, here is an example of where the approach above would decidedly break: if there are nested documents and the inner documents also contain the “sha” key. The paper I linked above does talk about this and the Mison paper they link to goes even more in depth. So I don’t believe there a valid JSON document that would give false positives/negatives except for the already mentioned case of duplicated keys.

sasajuric · August 28, 2018, 8:21am

Good point!
I guess the only false positive I can think of would be something like {"\"sha": "foobar"}, but that’s admittedly far fetched.

This is IMO the key to your approach. In the context of this problem, we can assume a well defined response, and the fact that quotes must be escaped (i.e. single quotes are not valid json), prevents possible ambiguities in free form fields (e.g. name).

Normally, I’d still go for json parsing (because it requires less assumptions). But I agree that this is a valid approach in some circumstances (e.g. if using deps is not an option, or speed is critical).

easco · August 28, 2018, 3:23pm

Another option might be to shell out to a command line tool like jq to do the actual JSON work. Not sure what the performance needs are here though.

webuhu · August 29, 2018, 1:56pm

That would of course always be an option. But won’t be an argument why to write the script in Elixir first.

easco · August 29, 2018, 6:21pm

Except that it’s really nice that Elixir, from a script, can use shell tools to accomplish some goals.

My thought was that if you need to traverse the whole body of the JSON and just extract the sha fields then something like jq could simply ingest the file and give them back to you all at once.

But you’re right… that’s more “get this done” than a “Pure Elixir” solution.

Eiji · August 29, 2018, 6:38pm

Another solution is to use mix_script, so you can use libraries in scripts. The trick is that project is generated from script automatically and then compiled like a normal project. Maybe this situation is not best example, but for me it’s really useful.